Why AI Models Need Engineers, Not Annotators, for Code Evaluation
Most teams building AI coding assistants are training them on feedback from people who cannot actually read code. Here is what goes wrong, why it is hard to detect, and what the right approach looks like.

Imagine you are renovating your home and you hire a contractor to rewire the electrical system. The work takes three weeks. When it is done, you bring in an inspector to check the quality.
But the inspector you hired has never actually worked with electrical systems. They have read about them, sure. They know what wires should look like. They can tell if the cables are neatly bundled and labeled. They can see that the junction box is properly closed and the work looks tidy.
What they cannot tell you is whether the circuit will trip under load, whether the grounding is correct, or whether there is a hidden fault that will cause a fire six months from now.
Now replace "electrical system" with "code" and "inspector" with "AI training data annotator."
This is exactly the situation most AI teams are in when they use non-engineers to evaluate the code generated by their models.
The Part That Surprises Most People
Here is the thing that surprises a lot of people who are new to AI development: the quality of an AI model's outputs does not come from the model alone. It comes, in large part, from the human feedback the model was trained on.
The process is called RLHF, which stands for Reinforcement Learning from Human Feedback. In plain terms: you show the AI two different responses to the same question, a human picks the better one, and the AI learns from that choice. Repeat this millions of times and the AI gradually learns what "good" looks like.
If the humans picking "better" actually know what better looks like, the model learns well. If they do not, it learns the wrong lessons, often confidently.
For writing tasks, conversation tasks, summarization, and translation, this works reasonably well with careful general annotators. The qualities that matter, clarity, helpfulness, accuracy, are things most educated people can assess.
Code is different. And understanding how it is different changes the way you think about AI training.
Code Has a Property That Prose Does Not
A paragraph either makes sense or it does not. You can read it and judge it.
A function either works or it does not. And you often cannot tell which just by reading it.
Here is a real example. Look at this Python function:
def find_duplicates(items):
seen = []
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.append(item)
return duplicates
To someone who does not write code, this looks perfectly reasonable. It is clean. It is readable. The variable names make sense. There is a clear loop and a clear output.
To a software engineer, something is immediately wrong. The line if item in seen is checking a list, which means every check scans through the entire list one item at a time. On a list of 10,000 items, this function becomes brutally slow, roughly 50 million operations instead of the 10,000 it should take if you used a set instead.
A non-engineer annotator rating this function will likely give it a high score. It looks good. It explains itself. It would probably pass their eye test.
A software engineer would score it much lower and immediately rewrite it using a set.
That gap is the core problem.
What Non-Engineers Are Actually Evaluating
When someone without engineering experience evaluates code, they are not evaluating code. They are evaluating the appearance of code.
They are asking questions like:
- Does this look organized?
- Are the variable names readable?
- Is there an explanation that makes sense?
- Does the response feel thorough and complete?
- Is it longer and more detailed than the other option?
None of these are worthless questions. Organization and readability matter. A good explanation is valuable.
But they are a tiny fraction of what makes code genuinely good or bad. And when you use them as proxies for quality, the AI learns the wrong thing.
It learns to write code that looks impressive.
It learns that more comments get higher ratings.
It learns that longer explanations feel better to evaluators.
What it does not learn is how to write code that actually works correctly under real conditions.
The Specific Things Engineers Catch
Let me walk through the categories of issues that engineering expertise makes visible, because they are not obvious to people who have not spent years writing and debugging software.
Correctness on edge cases. Most functions that fail do not fail on normal inputs. They fail on the unusual ones: an empty list, a number at the maximum value the system supports, a string with special characters, a file that is unexpectedly empty. A non-engineer testing with a few normal examples will conclude the code works. An engineer immediately asks "what happens when this is empty?" and "what if this number is negative?"
Security vulnerabilities. Some of the most dangerous code bugs are completely invisible to untrained eyes. A function that builds a database query by pasting user input directly into a string looks perfectly normal. An engineer recognizes it immediately as a SQL injection risk, one of the oldest and most exploited vulnerabilities in software. To a non-engineer, it looks fine.
Performance problems that scale badly. The duplicate-finding example above is a clean illustration. Code that works for small examples can be catastrophically slow at real scale. Identifying this requires understanding algorithmic complexity, which is a topic most software engineers learn in school and spend years applying in practice.
Wrong use of libraries and frameworks. Modern software is built on top of massive libraries and frameworks, each with their own rules, best practices, and common mistakes. Using a React hook in the wrong way, calling a database function outside a transaction, or using an async function without properly handling its return value are all issues that require familiarity with the specific tool. A non-expert evaluator has no frame of reference.
Code that works but will break later. Some code is technically correct right now but is written in a way that will cause problems when the codebase grows. Engineers call this "technical debt." Recognizing it requires experience with what happens to code over time.
What Happens to the Model
Let us trace through exactly what happens when you train a code-generating AI on feedback from non-engineers.
In round one of training, the model generates some responses. Non-engineer annotators rate them. The responses that look clean, have helpful comments, and come with thorough explanations get high ratings. The model updates to produce more of those.
In round two, the model has gotten better at looking good. The responses are cleaner, better formatted, more thoroughly explained. Annotators rate them highly. The model updates again.
After many rounds, the model has become excellent at producing code that reads well and impresses people who are not checking carefully. But it has not been reliably rewarded for correctness, for handling edge cases, or for efficiency, because those things were not being evaluated.
When this model ships, developers start using it. For simple tasks, it works fine. For anything with real complexity, subtle bugs appear. The code works in testing but fails in production. It handles the happy path but crashes on edge cases.
The team is puzzled because the benchmark scores looked good. But the benchmark scores were measuring what the annotators were measuring, which was not what actually mattered.
The Right Way to Think About This
You do not need perfect engineers for every annotation task. The solution is more nuanced than that.
Think of it in three layers.
The first layer is filtering out obviously bad responses. Does the code at least attempt to solve the problem? Does it run without syntax errors? Is the response coherent? This does not require deep expertise and can be done quickly.
The second layer is correctness evaluation. Does the code actually do what it claims to do? Does it handle the standard inputs correctly? Does it handle the common edge cases? This requires real programming ability in the language and problem domain being evaluated. This is where engineering expertise is non-negotiable.
The third layer is quality evaluation. Is this code well-written? Would it hold up in a real codebase? Is there a better approach? This requires not just programming ability but engineering judgment, which comes from building real systems over time.
The more of your evaluation that happens in the first layer, the worse your training signal. The more that happens in the second and third layers, the better.
What to Look for in Engineering Evaluators
Not every developer is equally good at this. Experience matters, but so does the type of experience.
Someone who has built production systems has a different instinct for code quality than someone who has only done personal projects or academic exercises. They have seen what breaks in the real world, which shapes their evaluation in ways that are hard to replicate through training alone.
Domain match matters too. A Python data engineering specialist evaluating TypeScript React code is less reliable than someone who works in that stack regularly. The idioms, the standard libraries, the common gotchas, they are different enough to matter.
And there is a less obvious quality: calibrated uncertainty. The best evaluators know when a piece of code is outside their expertise and flag it rather than rating it confidently. Someone who rates everything with the same confidence level is usually less reliable than someone who occasionally says "I'm not certain about this one."
The Uncomfortable Truth
The AI coding assistants that developers trust most are the ones trained on high-quality engineering feedback. The ones that frustrate developers, the ones that produce plausible-looking code that does not quite work, have often been trained on a weaker signal.
This is not a problem that better model architecture solves. It is not something that more training data alone fixes. It is a data quality problem, which means the solution is in how the feedback is collected.
The teams that understand this early build better products. The teams that treat annotation as a commodity task, something to be done cheaply and quickly, end up fighting the consequences of their training data for years.
Code evaluation is engineering work. The tools are different, but the judgment required is the same.
Writing about AI training, LLMs, and software engineering. Building AI products at Nomos Insights.
Continue Reading