What is LLM fine-tuning and do I need it?

LLM fine-tuning is the process of training a pre-trained language model on your specific data to improve its accuracy, tone, and relevance for your use case. If you need a model that understands your domain, follows your style, or produces consistent outputs, fine-tuning is the right approach. Nomos Insights designs and runs the full pipeline from dataset curation to trained model delivery.

What is RLHF and how does it improve AI models?

RLHF (Reinforcement Learning from Human Feedback) is a training method that uses human preference data to align model outputs with desired behavior. It helps models produce responses that are more helpful, accurate, and safe. At Nomos Insights, we run RLHF pipelines end to end including annotation, reward modeling, and policy optimization.

How does Nomos Insights approach model evaluation?

We design custom evaluation rubrics based on your quality standards, run red-teaming sessions to probe failure modes, and benchmark models against task-specific criteria. Our evaluation process is systematic and repeatable so you know exactly how your model performs before it ships.

What does end-to-end AI engineering mean?

End-to-end AI engineering means we handle every layer of the stack: dataset curation, model fine-tuning, evaluation, deployment, and ongoing monitoring. We also build the applications and infrastructure that use those models. You get one team accountable for the entire system, not multiple vendors passing work between them.

What industries have you built AI systems for?

We have built AI systems for healthcare, sales automation, e-commerce, SaaS, education, and consumer apps. Our experience spans medical nutrition tracking, email intelligence tools, matrimony platforms, and NFT marketplaces. We adapt our approach to the compliance and performance requirements of each domain.

How long does a fine-tuning or AI training project typically take?

A focused fine-tuning project typically takes 4 to 8 weeks from dataset preparation through model delivery. Full RLHF pipelines with evaluation and iteration usually run 8 to 16 weeks depending on data volume and quality requirements. We provide a detailed timeline after an initial scoping call.

Back to Blog

AI & ML·

March 20, 2026

5 Rubric Quality Issues That Silently Degrade Your AI Training Data

Your annotation pipeline is running. The data looks clean. But your model is behaving in ways you cannot explain. The problem might be hiding in your rubric. Here are five specific issues that silently corrupt AI training data, and how to fix each one.

Nomos Insights

11 min read

5 Rubric Quality Issues That Silently Degrade Your AI Training Data

You spent months building your training data pipeline. You wrote guidelines. You hired annotators. You ran calibration sessions. Everything ran smoothly and the data came out looking reasonable.

Then you trained the model and something was off.

Maybe the model was overconfident about certain types of answers. Maybe it consistently preferred longer responses even when shorter ones were better. Maybe it handled correctness and style as if they were equally important, when correctness was supposed to dominate.

You looked at the training data and it all seemed fine on the surface. No obvious errors. Decent inter-rater agreement. Clean formatting.

The problem was in the rubric. It always is.

The frustrating thing about rubric problems is that they are invisible until you know what to look for. The data does not look wrong. The annotation process runs without obvious friction. The model trains to completion. The issues only reveal themselves in behavior, and by that point you have invested enormous resources into data that is subtly broken.

Here are the five rubric problems we see most often, what each one does to your training data, and how to catch them before they cause damage.

Issue 1: Criteria That Ask Two Things at Once

This is the most common rubric mistake, and the one most people do not notice because compound criteria feel natural to write.

Here is an example: "Rate the response based on how correct and readable the code is."

It seems fine. Both things matter. But watch what happens when a reviewer encounters a response that is correct but messy, or elegant but wrong.

They have to make a private decision about how to weight these two dimensions against each other. One reviewer decides correctness is what really matters and scores it a 4. Another reviewer gives equal weight to both and scores the same response a 2 because the readability pulls the number down. A third reviewer sees "readable" first and is anchored by a negative impression of the style.

Three reviewers, one response, three different scores. All of them followed the rubric correctly.

When this data trains a reward model, the model receives contradictory signals. It cannot learn a coherent relationship between code quality and score because the relationship keeps changing based on which reviewer happened to look at each response.

The fix is to split every compound criterion into separate dimensions. Correctness gets its own score. Readability gets its own score. Each question has exactly one answer. A simple test: if you can imagine a response that scores high on one part and low on another part of the same criterion, the criterion needs to be split.

This does more than improve consistency. It also lets you weight dimensions properly. Once correctness is its own dimension, you can say explicitly that it counts for 50% of the total score. When it is bundled with readability, that control is gone.

Issue 2: Penalizing the Same Bug Twice

This one is called double jeopardy, and it creates a distortion that is genuinely difficult to trace back after the fact.

Imagine a rubric with a "functional correctness" dimension and a separate "production readiness" dimension.

A reviewer looks at a function that crashes when given an empty input. Under functional correctness, they score it low: the function breaks on a basic edge case. Then they look at production readiness and think, well, any code that crashes on empty input would obviously fail in production, so they score that low too.

Both scores feel justified. But they are penalizing the same bug under two different labels.

In the training data, this bug now carries twice the negative weight that the rubric intended. When you aggregate scores across thousands of responses, certain categories of errors become systematically over-penalized. The reward model develops a warped internal sense of what matters.

The downstream effect shows up in the model's behavior. It will treat certain error types as catastrophic when they are merely serious. It will learn priority orderings that do not match what your team actually cares about.

The fix is to map every category of code problem to exactly one criterion. Write a list of common failure modes: logic bugs, missing edge case handling, security vulnerabilities, performance problems, poor documentation. For each one, assign it to a single criterion. If two criteria both claim ownership of the same type of problem, resolve the overlap in the rubric.

Share this mapping with your reviewers as part of training. When they see an empty-input crash, they should know exactly which dimension to capture it under, and they should know that this means the other dimensions are not affected by this particular issue.

Issue 3: Rankings That Contradict Themselves

When you collect preference data through pairwise comparisons (which of these two responses is better?), you are implicitly building a ranking. And rankings have a property that humans consistently violate: they should be transitive.

If response A is better than response B, and response B is better than response C, then A should be better than C. This chain should hold across the entire dataset.

It often does not.

Here is why. When a reviewer compares A to B, certain qualities jump out: maybe correctness is the most salient difference. A is more correct, so A wins. When the same reviewer compares B to C, a different quality comes to the foreground: maybe readability is now the obvious difference, and B is cleaner, so B wins. When they compare A to C, something else dominates, and C wins.

You end up with A beats B, B beats C, and C beats A. A cycle. A logical contradiction in your training data.

Cycles are not rare edge cases. They happen regularly in any large-scale preference collection effort, especially when responses are close in quality or when they differ on multiple dimensions simultaneously.

When these contradictions enter training, the reward model has to reconcile them. It cannot satisfy all of them simultaneously. What typically happens is that the model learns noisy, unstable preferences that do not generalize well. The reward model becomes unreliable precisely in the cases that matter most: when responses are close in quality.

There are a few ways to address this. The most direct is to use absolute ratings on individual dimensions rather than holistic pairwise comparisons. Absolute ratings are less prone to context effects because each response is evaluated on its own merits, not relative to what it is being compared to.

When pairwise comparisons are used, collect enough overlapping pairs to detect cycles in the data before training. Responses that are frequently caught in cycles are worth routing to additional reviewers or to senior annotation staff for resolution.

Issue 4: Numbers Without Meaning

A 1-to-5 scale where the numbers are not defined is not a rubric. It is a space where five different people construct five different personal scales and you aggregate the results as if they were the same measurement.

This happens constantly.

Reviewer A is conservative about high scores. In their mental model, a 5 means perfection, and code is never perfect, so they cap out at 4 on everything good. Reviewer B gives 5s generously because "this is clearly excellent work." Reviewer C uses the middle of the scale for most things because extremes feel harsh.

All three are following the rubric. None of them are measuring the same thing.

This problem has a specific name: scale usage drift. Even within a single reviewer, the scale can drift over the course of a long annotation session. The 4 they give at hour one may not be the same as the 4 they give at hour six.

When you average scores across many reviewers without addressing this, you get data that looks like a clean distribution but is actually a mixture of five different personal scales layered on top of each other.

The fix is anchored examples, not better descriptions. Descriptions of score levels ("4: mostly correct with minor issues") do not anchor judgment reliably. Examples do.

For each score level, identify a real response that represents that score, one that you and several colleagues would agree deserves it. Write a brief explanation of why it receives that score. Build a library of these examples across your full range of task types.

When reviewers encounter a difficult rating decision, they can compare the response in front of them to the example responses in the library. The examples act as a shared calibration point across the entire reviewer team.

This is more work to set up. It is also the single most effective thing you can do to reduce score variance between reviewers.

Issue 5: The Invisible Biases That Inflate Certain Responses

Even well-trained reviewers with a solid rubric and anchored examples are subject to biases they are often not aware of. Two of these are particularly damaging for code evaluation.

Length bias is the tendency to score longer, more detailed responses higher, even when the extra length does not reflect extra quality. A response that includes thorough code comments, a detailed explanation, and several illustrative examples feels more complete and considered. It triggers a sense of thoroughness that shorter, tighter responses do not, even when the shorter response is technically better.

For code evaluation specifically, this bias teaches the AI that longer responses with more explanation are valued, regardless of whether the explanations are accurate or the code is correct. Models trained on this feedback learn to pad responses with content that feels substantive without necessarily being substantive.

Recency bias is the tendency to favor the last response seen in a comparison. In a side-by-side evaluation, the response on the right or shown second gets a slight systematic advantage. In sequential evaluation sessions, responses rated later in a batch receive slightly higher scores on average, because reviewers tend to become slightly more generous as they process more examples.

Neither bias is a character flaw. They are predictable features of how human attention and memory work. But they are systematic, which means they do not cancel out. They accumulate in the training data and produce consistent distortions.

For length bias: add a specific note to the rubric that explicitly says length is not a quality signal. More powerfully, build calibration examples that demonstrate short, high-quality responses scoring higher than long, lower-quality ones. Reviewers need to see this demonstrated, not just told.

For recency bias: randomize presentation order. In pairwise comparisons, show response A first in some tasks and response B first in others. Never maintain a consistent ordering that gives one position a systematic advantage. Periodically shuffle the order in which reviewers see tasks within a session.

Why These Problems Stay Hidden

Each of these five issues has the same frustrating property: the data looks fine when you examine it. Scores fall in expected ranges. Reviewer agreement metrics look acceptable. Nothing in the raw numbers screams "this is broken."

The problems only become visible when you look at specific failure modes in the trained model and work backward, or when you run targeted analyses looking for exactly these patterns.

The practical implication is that catching these issues requires building specific quality checks into your annotation pipeline, not just reviewing the data for obvious errors.

That means testing your rubric with a pilot group before scaling, measuring agreement at the dimension level not just overall, auditing for double jeopardy through the defect mapping exercise described above, and specifically checking for length and recency effects in your data.

It also means treating the rubric as a system that needs maintenance, not a document you write once and archive.

The reward model you train is only as good as the judgment the rubric produced. Getting the rubric right is not the glamorous part of building AI. But it is the part that determines whether everything else was worth doing.

#Training Data#Data Quality#RLHF#Rubric

Nomos Insights

Writing about AI training, LLMs, and software engineering. Building AI products at Nomos Insights.

Share on X·Share on LinkedIn

View All

Business

Email Best Practices for Professionals in 2026 (With Tracking Insights)

Development

Building TrackMailBox: How We Made a Free Email Tracker for Gmail

Technology