The Anatomy of an Agentic Benchmark: From GitHub Issue to Evaluation Task
SWE-bench changed how the world evaluates AI coding ability. But turning a real GitHub bug report into a fair, reproducible test for an AI agent is surprisingly complex. This is how it actually works, step by step.

There is a classic interview question for software engineers: "Tell me about a time you debugged a difficult problem."
You would never evaluate a developer by just asking them this question, listening to their answer, and deciding whether they are good at debugging. You would give them an actual bug and watch them work.
That is the idea behind agentic benchmarks. And it turns out that building one properly is far more complicated than most people realize.
The Old Way of Testing AI Was Missing Something
For years, the standard way to evaluate whether an AI was good at coding was to ask it coding questions and check the answers.
Write a function that reverses a string. Implement binary search. Solve this algorithm problem. The AI would produce code, the code would be run against test cases, and a score would come out.
This approach had real value. But it was measuring something closer to "AI that has seen a lot of algorithm tutorials" than "AI that can actually help a developer do their job."
Real software development is nothing like a LeetCode problem. Real software development looks like this: you get a bug report from a user, you open a codebase you did not write, you read through files trying to understand what is going on, you form a hypothesis about what is broken, you make some changes, you run the tests to see if they pass, you find out you were wrong, you go back and look at different files, and eventually, if you are good, you find and fix the actual problem.
That is a completely different skill. It requires navigating a real codebase, not writing something from scratch. It requires forming and testing hypotheses, not just applying known algorithms. It requires using tools like a search command and a file editor, not just producing text.
In 2023, researchers at Princeton introduced SWE-bench, a benchmark built around this observation. Instead of invented coding puzzles, SWE-bench used real bug reports from real GitHub repositories. The AI had to actually fix the bugs.
The results were humbling. Models that scored impressively on traditional coding benchmarks could barely solve a few percent of SWE-bench tasks. The benchmark revealed a massive gap between "AI that knows about code" and "AI that can work with code."
But what most people do not think about is how much work goes into building a benchmark like this. Going from a GitHub issue to a proper, reproducible evaluation task is a process with many steps, each of which can go wrong in subtle ways.
What Makes a Good Evaluation Task
Before you can build anything, you have to decide what counts as a valid task.
Not every GitHub issue works. In fact, most of them do not.
Imagine trying to turn a bug report into an evaluation task for an AI. The bug report says: "the login button is broken on mobile." That might be a valid bug, but it is completely useless as an evaluation task. What does "broken" mean exactly? How would you check whether an AI fixed it? You would need to visually inspect a browser on a mobile device. There is no automated way to verify the solution.
For an agentic benchmark, a task needs three things that most GitHub issues do not have:
A clear, verifiable fix. You need to know what success looks like. This almost always means the repository already has tests that fail in the broken state and pass when the bug is correctly fixed. Issues that were resolved without any automated tests are almost impossible to use.
A self-contained scope. The fix should be achievable by changing files inside the repository. Issues that require database migrations, external API keys, hardware setup, or changes to external services are not suitable. The AI agent needs to be able to do its work in an isolated, contained environment.
An already-confirmed resolution. The issue should have been resolved by a developer who merged a pull request, and the tests should confirm the fix works. If an issue is still open, or if the resolution is disputed, there is no ground truth to evaluate against.
When the SWE-bench team filtered real issues from real repositories through these criteria, they ended up using a fraction of what was available. Strict standards produce a much smaller but much more reliable dataset.
Building the Environment
Here is something that most people who have never thought about infrastructure do not realize: you cannot just hand an AI the text of a bug report and expect it to work in your current codebase. You have to recreate the exact state of the codebase at the moment the bug was reported.
Think about what that means. Open-source software changes constantly. A repository that has 50,000 commits might look completely different from how it looked two years ago when a particular bug was filed. Files get renamed. Functions get moved. The bug itself gets fixed by the maintainers. If you just point an AI at the current state of the repository, the "bug" might not even exist anymore.
So for each benchmark task, you need to reconstruct the repository at a specific moment in history: the exact commit before the fix was applied. The AI needs to see the broken code.
This gets complicated fast.
Every software project depends on other software libraries. Your bug from 2022 might have required version 2.3.1 of a certain library, and if you install version 2.5.0 instead, the behavior might be different in subtle ways. The tests might fail or pass for reasons that have nothing to do with the bug you are testing.
This means every task needs its complete dependency tree locked to specific versions. Not just the main libraries, but all the hidden dependencies those libraries themselves depend on. It is the software equivalent of recreating the exact atmospheric conditions needed to reproduce a particular weather pattern.
Most benchmark projects use Docker containers to handle this: each task gets its own isolated virtual environment with everything frozen exactly as it was at the relevant moment in time. When the AI runs tests inside this container, the results are determined only by the AI's changes, not by any environmental variation.
Writing Tests That Actually Verify the Right Thing
When a task already has existing tests that confirm the fix, the work here is to verify those tests are reliable. But sometimes the existing tests are not quite right for evaluation purposes, and occasionally there are no useful tests at all.
This is where oracle test design comes in. An oracle test is a test specifically designed to check whether the bug is fixed. It needs to do exactly two things reliably:
- Fail when the bug is present (before any fix is applied)
- Pass when the correct fix is applied
This sounds simple. It is surprisingly hard.
The test should not be too broad. A test that checks a large, complex function might pass for several different reasons, not all of which mean the actual bug was fixed. You want a test that specifically exercises the code path that was broken.
The test should not rely on implementation details. If your test checks that the fix uses a specific variable name or calls a specific helper function, it will fail for an AI that solves the problem a different but equally correct way. The test should check the behavior, not the approach.
The test should not have timing or randomness issues. A test that sometimes passes and sometimes fails on the same code is useless. Any non-deterministic behavior in the test invalidates everything.
The best oracle tests are ones that were already written by the developers who fixed the bug, specifically to confirm the fix worked. When those tests exist and are reliable, they provide a clean, unambiguous signal. When they do not exist, writing them requires deep understanding of the code and the bug.
The Contamination Problem
Here is a challenge that is unique to AI benchmarking: the AI you are testing might have already seen the answers.
Every large AI model is trained on enormous amounts of text scraped from the internet, and GitHub is a significant part of that training data. The bug reports, the discussions, the pull requests, the patches, all of it might already be in the model's memory.
If an AI has essentially memorized a specific bug fix during training, it is not solving the problem during evaluation. It is recalling the answer. The benchmark score becomes a measurement of memory, not capability.
This is called training data contamination, and it is one of the harder problems in AI evaluation to fully solve.
Several approaches help. Date filtering is the most straightforward: only include issues that were filed after the model's training data cutoff. If the model was trained on data up to a certain date, issues from after that date could not have been seen during training.
The limitation is that you need to know the cutoff accurately, and many AI companies do not publish exact details. And even with date filtering, a model might have seen similar enough code in other repositories that it can recognize the pattern.
Another approach is to keep a portion of the evaluation set private. Only the team running the benchmark knows these tasks. This prevents anyone from training specifically to score well on the public leaderboard, and it means the private results are a cleaner measure of genuine capability.
No approach eliminates contamination entirely. It is a spectrum, not a binary. What matters is being aware of it and taking reasonable steps to limit its impact on the results.
Running the Evaluation
With environments built, tests written, and contamination addressed, you can run an actual evaluation.
For each task, the process goes like this:
Start the environment in the broken state. The tests are failing. The bug exists. The AI begins with access to the issue description and the codebase.
The AI can take actions: read files, search for patterns, edit code, run the test suite, inspect the output, read more files, make more edits. It works until it either decides it is done or hits a limit on how many actions it can take.
When the AI signals that it is finished, or when it hits the limit, its changes are collected and the oracle test suite is run against them. Pass or fail.
The primary metric is the fraction of tasks where all oracle tests pass. This is a strict measurement. A partial fix, one that fixes the main behavior but misses a related edge case, does not get partial credit. Either the tests all pass or they do not.
This strictness has trade-offs. An agent that got 90% of the way to the right answer receives the same score as one that did not try. But the alternative, assigning partial credit, requires defining what partial credit means, which introduces subjectivity and judgment calls that undermine the benchmark's reliability.
Why This Work Matters
When SWE-bench was released, the best models solved somewhere around 2-3% of the tasks. That number has risen substantially as models have improved, which is exactly what a well-designed benchmark should show: meaningful progress over time.
The benchmark also revealed something important about where AI capability gaps actually live. Models were better at some types of bugs than others. They were more reliable on certain programming patterns. They tended to fail in specific, identifiable ways. That information shapes research priorities in ways that aggregate scores cannot.
That is what a good agentic benchmark does. It does not just say "this model scored higher than that one." It shows you what kinds of tasks the model can and cannot do, with enough precision and reproducibility to make the information actionable.
Building one properly is expensive, slow, and full of details that matter. But the alternative, evaluating AI on tasks that do not reflect real use, tells you very little about whether the AI will actually help developers in the real world.
The gap between "AI that passes a benchmark" and "AI that helps you ship software" is only as small as the benchmark is good.
Writing about AI training, LLMs, and software engineering. Building AI products at Nomos Insights.
Continue Reading