What Makes a Good RLHF Rubric for Coding Tasks
A rubric sounds simple: a list of criteria, a scoring scale, and some instructions. But most rubrics for AI code evaluation are quietly broken in ways that poison the training data. Here is what good rubric design actually looks like.

Think about the last time you got feedback that felt unfair.
Maybe you turned in a project and someone marked you down for things that overlapped. Or the grading criteria were vague enough that two different reviewers gave you completely different scores for the same work. Or you scored perfectly on a minor detail but got zero credit for the fact that your core work was solid.
That experience, frustrating as it is for a student, becomes a serious problem when it is happening millions of times inside an AI training pipeline.
This is what a bad rubric does to RLHF data. It introduces inconsistency, double-penalizes certain mistakes, and rewards the wrong behaviors. The AI learns from all of it, and it learns the wrong things.
Getting rubric design right is arguably the most underinvested part of building code-generating AI. Teams spend months on model architecture and weeks on infrastructure. The rubric gets a few days, sometimes less. Then everyone wonders why the model behaves unexpectedly.
This post is about what actually makes a rubric work.
First, What Is a Rubric and Why Does It Matter So Much?
RLHF works like this: a human looks at two AI-generated responses, picks the better one, and the AI uses that preference signal to improve. Do this enough times, and the AI learns what "better" means.
But "better" is only as meaningful as the criteria the human is using to judge. If the human is judging by a clear, consistent standard, the AI learns a clear, consistent lesson. If the human is using personal intuition or vague guidelines, the AI learns whatever patterns happen to emerge from a mixture of individual biases.
The rubric is what turns personal intuition into shared, consistent judgment. It is the document that says: here are the dimensions that matter, here is what each score level looks like, here is how to handle tricky cases.
When it is done well, two different evaluators looking at the same code will arrive at the same score. When it is done badly, every evaluator is essentially using a different grading system, and the training data is noise.
The First Principle: One Question Per Criterion
Here is a rubric criterion that seems perfectly reasonable:
"Rate the quality of the code, considering both its correctness and how clearly it is written."
Read it again and notice the problem. It is asking two questions in one.
What does a reviewer do when the code is completely correct but barely readable? Or beautifully structured but has a subtle bug? They have to invent some private weighting system to combine these two things into a single score. Every reviewer invents their own version.
Reviewer A decides correctness is 70% of the score and readability is 30%. Reviewer B does the opposite. Reviewer C just goes with their gut. Three different people, three different grading systems, all feeding into the same training dataset as if they were measuring the same thing.
The fix is simple but requires discipline: one criterion per question.
Correctness gets its own dimension. Readability gets its own dimension. Edge case handling gets its own dimension. Each one is answered independently.
This is called making criteria atomic. Each criterion is a single, indivisible question. If you can imagine a response that scores high on one part and low on another, you have a compound criterion that needs to be split.
The reason this matters beyond consistency: once you have separate dimensions, you can weight them. If correctness matters more than readability in your use case, you can encode that explicitly in how you aggregate scores. When they are bundled together, you lose that control entirely.
The Second Principle: Never Penalize the Same Mistake Twice
This one is sneaky and it shows up constantly in rubrics that were designed without thinking it through.
Imagine a rubric with two criteria: "functional correctness" and "production readiness."
An evaluator sees a function that crashes when passed an empty list. They mark down the functional correctness score because the function is wrong. Then they look at "production readiness" and think: well, code that crashes on empty input would obviously fail in a real production system. So they mark that down too.
Both judgments feel reasonable. But they are penalizing the same bug twice.
In the aggregated training data, this bug is now carrying twice the negative weight that the rubric intended. The AI learns that this category of error is more serious than it actually is, relative to everything else the rubric captures. The reward model's internal sense of what matters gets skewed.
This is called double jeopardy, and it silently corrupts training data in ways that are very hard to detect after the fact.
The fix is to map each type of defect to exactly one criterion. Bugs in logic go under correctness. Missing error handling goes under robustness. Security issues have their own criterion. The production readiness criterion then covers only the things that are not already captured elsewhere, like hardcoded values that would need to change between environments, or architecture decisions that would not scale.
A useful exercise: take a list of the common failure modes you expect to see in code responses, and for each one, write down which single criterion captures it. If you find yourself writing two criteria for the same failure mode, you have a double jeopardy problem to fix.
The Third Principle: Anchors, Not Descriptions
Here is a scoring instruction you will find in many rubrics:
"Rate functional correctness on a scale of 1 to 5, where 1 is very poor and 5 is excellent."
That is not a rubric. That is an invitation to use personal judgment.
What is one evaluator's "3" is another evaluator's "4." One person is reluctant to give 5s because that feels too perfect. Another gives 5s freely because the code mostly works. Both of them follow the instructions and both produce data that is incompatible with each other.
Anchors fix this. An anchor is not a description of a score level, it is an example.
Here is what an anchored scoring guide looks like for functional correctness:
Score 1: The code does not attempt to solve the problem, or produces obviously wrong output for basic inputs. Example: a function supposed to sort a list that returns the list unchanged.
Score 2: The code solves part of the problem but fails on common, expected cases. Example: a sort function that works for positive integers but breaks when given negative numbers or duplicates.
Score 3: The core logic is correct, but the code fails on a meaningful edge case that a reasonable user might encounter. Example: a sort function that handles most inputs but crashes on an empty list.
Score 4: The code works correctly for all normal and most edge cases. Failures only occur on unusual inputs that would require specific documentation to handle. Example: a sort function that handles all numeric inputs but has undefined behavior with NaN values.
Score 5: The code handles all expected inputs correctly, including documented edge cases, and behaves predictably on unusual inputs. Example: a sort function with explicit handling for empty lists, duplicates, negative numbers, and type mismatches.
Now two evaluators looking at the same function have a shared reference point. The examples are the anchor. This does more to reduce disagreement between evaluators than any amount of written description.
The Fourth Principle: State Your Priorities Explicitly
A rubric that treats all dimensions as equally important is telling evaluators a lie.
In almost every code evaluation context, correctness matters more than readability. A function that is hard to read but gets the right answer is better than one that is elegant and wrong. But if your rubric has five criteria all scored 1-5, and you average them together to get a final score, you are implicitly telling evaluators that these things are equal.
Evaluators sense this tension and resolve it inconsistently. Some weight correctness heavily in their own heads. Others take the rubric literally and average everything evenly. The result is more noise in the data.
There are two clean ways to handle this.
The first is a written priority rule. Something like: "If a response scores below 3 on functional correctness, it cannot receive an overall score above 3, regardless of other dimensions." This makes the hierarchy explicit without complicated math.
The second is a weighted formula. Define explicitly that correctness contributes 50% of the total score, edge case handling 20%, readability 15%, and explanation quality 15%. The exact numbers matter less than the fact that they are written down and shared with all evaluators.
Either approach is better than leaving the weighting implicit.
The Fifth Principle: Test Before You Scale
You would not deploy software without testing it. A rubric is no different.
Before you send a rubric out to a hundred evaluators, test it with five. Have them independently rate the same 50 responses. Then compare their scores.
The metric you want to look at is called inter-rater agreement. The simplest version is just the percentage of tasks where two evaluators gave the same score. More sophisticated versions like Cohen's kappa account for the fact that some agreement happens by chance.
If your agreement rate is high, around 80% or above for adjacent scores, your rubric is working. If it is low, you have a problem to diagnose.
Go through the specific cases where evaluators disagreed. Almost always, you will find a pattern: a particular type of response that the rubric does not clearly cover. A type of code error that falls between two criteria. A situation where the examples in the rubric did not anticipate the real-world variation you are seeing.
Fix those gaps and test again. Two rounds of this before scaling usually produces a rubric that holds up at volume.
The Sixth Principle: Update as You Learn
A rubric is a living document, not a contract carved in stone.
Evaluators will encounter cases you did not anticipate. They will ask questions that reveal ambiguities. Real responses from real AI models will surface edge cases that your rubric design did not account for.
Build a mechanism for this. Give evaluators a way to flag ambiguous cases. Review those flags regularly. When a pattern emerges, update the rubric to address it.
Every update needs documentation: what changed, why, and when. This matters because data collected before and after a rubric change may not be fully comparable. If you combine them in training without accounting for the change, you introduce systematic inconsistency.
The teams that treat rubric refinement as an ongoing responsibility produce training data that gets better over time. The teams that write a rubric once and never revisit it produce data that slowly drifts from what they actually want to measure.
Putting It Together
A rubric that works for AI code evaluation has six properties:
- Every criterion measures exactly one thing (atomic)
- Each type of code defect maps to one criterion only (no double jeopardy)
- Score levels are shown with concrete examples, not just described (anchored)
- The relative importance of dimensions is explicit, not assumed (weighted)
- The rubric is tested with a small group before being scaled up (validated)
- It is updated regularly based on what evaluators actually encounter (living)
None of this is complicated in principle. What makes it hard is that it requires discipline and iteration. The first version of any rubric will have gaps. The question is whether you have a process for finding and fixing those gaps before they corrupt your training data.
The reward model you train is only as good as the human signal it learned from. The human signal is only as good as the rubric that shaped it.
Everything flows from getting this right.
Writing about AI training, LLMs, and software engineering. Building AI products at Nomos Insights.
Continue Reading