Loading...
AI & ML·
March 15, 2026

SWE-Bench Reasoning Annotation: What We Learned from 500+ Trajectories

Pass or fail only tells you if an AI agent solved a problem. It tells you nothing about how it reasoned, where it went wrong, or what made one agent dramatically better than another. Here is what we found when we looked inside the trajectories.

Nomos Insights
12 min read
SWE-Bench Reasoning Annotation: What We Learned from 500+ Trajectories

When you watch a chess match between two grandmasters, the score at the end tells you who won. But if you only look at the final position on the board, you miss almost everything interesting: the opening preparation, the moment one player gained a positional advantage, the inaccuracy in move 34 that changed the game, the precise sequence that converted a small edge into a win.

The same thing happens when you evaluate AI agents on coding benchmarks using only pass/fail results.

You know who "won." You have no idea why.

We spent months annotating agent trajectories on SWE-bench style tasks, more than 500 in total, with a specific goal: understand what is actually happening inside a trajectory that succeeds versus one that fails. Not just did the tests pass, but why did the agent make the decisions it made, where did things go wrong, and what patterns separate the agents that are genuinely getting better at this kind of work from the ones that are just getting luckier.

What we found was surprising in some places and clarifying in others. This is what we learned.

What Is a Trajectory, and Why Does It Matter?

For anyone unfamiliar with SWE-bench: it is a benchmark where an AI agent is given a real bug report from a real GitHub repository and has to fix the bug. The agent can read files, search the codebase, make edits, run tests, and use other tools. It works until it either decides it is done or runs out of its action budget.

A trajectory is the complete record of everything the agent did: every file it opened, every search it ran, every edit it made, every test it ran, and every decision to move forward or change direction.

At the end, the oracle tests run. Pass or fail.

The pass/fail result is what ends up on leaderboards and in papers. The trajectory is what shows you whether the agent is actually reasoning or just guessing.

Finding 1: Most Failures Happen Before Any Code Is Written

This was the most consistent and important thing we found.

When an agent fails a task, the intuitive assumption is that something went wrong in the coding. The agent tried to write a fix and got it wrong. That is the picture most people have.

It is not what we saw.

In the majority of failed trajectories, roughly 60 to 70 percent, the critical mistake happened before the agent wrote a single line of code. The agent either identified the wrong root cause, made an assumption about where the bug was without adequate exploration, or correctly found the relevant code but misread what it was actually doing.

The coding step then executed perfectly, on the wrong problem.

Think about what this means. An agent that can write flawless, beautiful code will still fail most of the time if it is diagnosing the wrong issue. And diagnosis is a skill that is completely invisible in pass/fail evaluations, because a wrong diagnosis followed by correct implementation looks identical to a right diagnosis followed by wrong implementation: both fail the tests.

For AI training, this has a direct implication. If you want to improve agents on these tasks, the highest-leverage training signal is not teaching better code generation. It is teaching better diagnostic reasoning: how to explore a codebase systematically, how to form and test hypotheses, how to update your beliefs when new information contradicts your original guess.

Finding 2: Agents That Explore First Win More

This finding follows naturally from the first one, but it deserves its own discussion because the pattern was so clear.

We looked at how agents distributed their actions across successful versus failed trajectories. In successful trajectories, agents typically spent the first 30 to 40 percent of their action budget on exploration before making any file edits. They read files, ran searches, traced code paths, and built up an understanding of the system before touching anything.

In failed trajectories, the fraction of exploration before the first edit was much smaller. Agents would read a file or two, form a quick hypothesis, and start writing changes. Sometimes the hypothesis was right and they got lucky. More often it was not.

This mirrors how experienced developers actually work. When you encounter an unfamiliar bug in an unfamiliar codebase, the worst thing you can do is start changing things based on your first impression. The first impression is almost always incomplete and often wrong. Good debugging means resisting the urge to act, spending time understanding, and only making changes when you have enough confidence that you know what you are fixing.

The agents that had internalized this approach solved significantly more tasks.

One implication for training: if you want to reward agents for exploring before acting, you need training data that captures this distinction. A trajectory that fails but shows excellent exploration should be treated differently than one that fails because the agent guessed wrong from the start. Both fail, but one is better reasoning.

Finding 3: Test Feedback Is the Most Underused Tool

Every agent in our study had access to the test suite. Running the tests costs one action and immediately tells you whether the current state of the code passes or fails. This is an extraordinarily valuable signal.

Successful agents used it like a feedback loop. They would make a change, run the tests, read the output carefully, and use what they learned to decide what to do next. If the tests showed a failure in an unexpected function, they would go read that function. If an error message pointed to a specific line, they would investigate that line. Test output was treated as new evidence.

Many unsuccessful agents ran tests differently. They would finish writing their fix, run the tests once to check, observe that the tests were still failing, and then make a small adjustment to the fix without going back to reconsider their diagnosis. If the second version also failed, some made another small adjustment. The test output was being treated as a confirmation check, not a diagnostic tool.

This distinction matters a lot. When tests fail after a patch, the most important question is: are they failing because of a different problem than I thought, or because my fix is wrong in a specific way? The answer changes what you should do next. Agents that read and interpreted test output made better next decisions. Agents that just noted "still failing" and tried minor variations were often stuck in a loop.

For annotation, this means that how an agent uses tool feedback is worth tracking as a quality signal in its own right. It is separate from whether the agent eventually got the right answer.

Finding 4: Looking at Steps Changed What We Understood

Our initial annotation was focused on the final state: did the agent fix the bug? We expanded this later to annotate individual steps within trajectories, and that decision changed everything we understood about agent quality.

When you only look at final outcomes, certain trajectories look identical. Three trajectories all fail. But when you look inside them, they are completely different:

Trajectory A failed because the agent correctly identified the problem in step 4, then made a single implementation mistake in step 11 that was small but caused the oracle tests to fail.

Trajectory B failed because the agent made a reasonable but wrong initial hypothesis, spent most of the trajectory working on the wrong part of the codebase, and never got close to the actual bug.

Trajectory C failed because the problem required understanding a component the agent had no apparent familiarity with, and it essentially made random edits hoping something would work.

In a pass/fail evaluation, these three look the same. As training examples, they should be treated very differently. Trajectory A contains mostly excellent reasoning that happened to fall short at the last step. Using it as a straightforwardly negative example might actually penalize good behavior. Trajectories B and C are negative for genuinely different reasons.

Step-level annotation is expensive. It requires reviewers to read an entire trajectory action by action and assess the quality of each decision in context, which means understanding the codebase well enough to evaluate what the agent knew at each moment. It takes three to five times longer than final-state evaluation.

But for training data specifically designed to improve reasoning quality, not just outcome quality, it is the right level to annotate at.

Finding 5: The "Almost" Problem

This one frustrated our annotation team more than anything else.

A significant fraction of failed trajectories had a specific structure: correct root cause identification, mostly correct fix, small mistake at the very end.

The agent found the right file. It understood what was broken. It wrote a patch that addressed the actual issue. And then it made a small implementation error: wrong variable name, off-by-one in an index, a missing return statement. The oracle tests fail.

From a final-state perspective, this is indistinguishable from a trajectory that was completely wrong. Both fail.

But treating them identically in training seems wrong. The agent in the first trajectory did most of the hard work correctly. It reasoned well. It navigated the codebase effectively. It identified the problem. The failure was a small mechanical error, not a reasoning failure.

Feeding this as a negative example penalizes the correct reasoning alongside the incorrect implementation.

We started tagging these trajectories separately in our annotation schema, calling them "near misses," distinct from "wrong approach" failures. This tagging feeds into training decisions: near-miss trajectories can be used differently than wrong-approach trajectories, for example by applying a smaller negative weight or by extracting the good reasoning steps as positive examples in a step-level training setup.

There is no clean universal solution here. But recognizing the distinction and encoding it in your annotation data is significantly better than treating all failures as equivalent.

Finding 6: Calibrating Annotators Is Harder Than It Looks

For binary pass/fail annotation, calibration is straightforward. The tests either pass or they do not. Two reviewers looking at the same final state will agree completely.

For step-level reasoning quality, calibration is much harder.

Our annotators disagreed most often on questions like these: Is this file read a strategic exploration or is the agent flailing? Is this hypothesis reasonable given what the agent has seen so far? Is this edit a logical response to the evidence, or a guess?

These questions require putting yourself in the agent's position. What did the agent know at this step? Given what it knew, was this decision reasonable? Different reviewers with different debugging backgrounds had different intuitions about what reasonable looks like.

The approach that worked best for us was building a library of annotated step examples. For each type of action (file read, search query, code edit, test run), we collected examples at different quality levels and wrote explanations of why each was rated the way it was. Reviewers could consult these examples when they encountered a difficult judgment call.

This did not eliminate disagreement. But it gave reviewers a common vocabulary and a shared set of reference points, which reduced the variance enough to make step-level annotation tractable at scale.

Even with this work in place, inter-rater agreement on step quality remained lower than agreement on final-state correctness. This is an honest limitation. Step-level annotation of code reasoning is genuinely hard, and teams building annotation pipelines should budget accordingly for calibration time.

What to Take Away

The five findings point toward a few practical changes for teams building training data for coding agents:

Do not treat all failures as equivalent. An agent that got the diagnosis right but the implementation wrong is different from one that never understood the problem. Encoding this difference in your annotation schema produces better training signals.

Track diagnostic quality separately from implementation quality. Most improvement in agent performance will come from better diagnosis, not better coding. If your evaluation data does not distinguish these, you are flying blind on the thing that matters most.

Build step-level annotation for high-value data. It is expensive, but training data with step-level quality signals is substantially more valuable for improving agent reasoning than training data with only final-state labels.

Invest heavily in annotator calibration examples. Abstract rubric language is not enough for reasoning quality evaluation. Concrete examples that annotators can reference when making difficult calls are what actually produce consistent data.

Pass/fail tells you who won. Trajectory annotation tells you why, and why is where the insights for improvement live.


#SWE-Bench#Code Annotation#AI Training#Trajectories
Nomos Insights

Writing about AI training, LLMs, and software engineering. Building AI products at Nomos Insights.

Continue Reading

Related Articles

01
AI & ML

Why AI Models Need Engineers, Not Annotators, for Code Evaluation

Read More
02
AI & ML

What Makes a Good RLHF Rubric for Coding Tasks

Read More
03
AI & ML

The Anatomy of an Agentic Benchmark: From GitHub Issue to Evaluation Task

Read More