What is LLM fine-tuning and do I need it?

LLM fine-tuning is the process of training a pre-trained language model on your specific data to improve its accuracy, tone, and relevance for your use case. If you need a model that understands your domain, follows your style, or produces consistent outputs, fine-tuning is the right approach. Nomos Insights designs and runs the full pipeline from dataset curation to trained model delivery.

What is RLHF and how does it improve AI models?

RLHF (Reinforcement Learning from Human Feedback) is a training method that uses human preference data to align model outputs with desired behavior. It helps models produce responses that are more helpful, accurate, and safe. At Nomos Insights, we run RLHF pipelines end to end including annotation, reward modeling, and policy optimization.

How does Nomos Insights approach model evaluation?

We design custom evaluation rubrics based on your quality standards, run red-teaming sessions to probe failure modes, and benchmark models against task-specific criteria. Our evaluation process is systematic and repeatable so you know exactly how your model performs before it ships.

What does end-to-end AI engineering mean?

End-to-end AI engineering means we handle every layer of the stack: dataset curation, model fine-tuning, evaluation, deployment, and ongoing monitoring. We also build the applications and infrastructure that use those models. You get one team accountable for the entire system, not multiple vendors passing work between them.

What industries have you built AI systems for?

We have built AI systems for healthcare, sales automation, e-commerce, SaaS, education, and consumer apps. Our experience spans medical nutrition tracking, email intelligence tools, matrimony platforms, and NFT marketplaces. We adapt our approach to the compliance and performance requirements of each domain.

How long does a fine-tuning or AI training project typically take?

A focused fine-tuning project typically takes 4 to 8 weeks from dataset preparation through model delivery. Full RLHF pipelines with evaluation and iteration usually run 8 to 16 weeks depending on data volume and quality requirements. We provide a detailed timeline after an initial scoping call.

Back to Blog

AI & ML·

March 5, 2026

Evaluation Dataset Design: Targeting 40-70% Pass Rates for Meaningful Discrimination

A benchmark where every model scores above 90% cannot tell you which model is actually better. Neither can one where every model fails. The most informative benchmarks live in a specific difficulty range, and designing for it is more deliberate than most people realize.

Nomos Insights

11 min read

Evaluation Dataset Design: Targeting 40-70% Pass Rates for Meaningful Discrimination

Imagine a school where every student scores between 95 and 98 on every exam.

At first glance, this sounds like a good problem to have. High scores all around. But if your job is to decide which students get into a competitive program with 50 spots, those exams are completely useless to you. You cannot tell anyone apart.

Now imagine the opposite: a school where the highest score anyone achieves is 8 out of 100. The exam is so impossibly hard that everyone is effectively guessing. The scores tell you nothing because they are dominated by noise.

The useful exam lives between these extremes. It is hard enough that not everyone aces it. Easy enough that strong students clearly outperform weak ones. The scores spread out in a way that reflects actual differences in ability.

AI benchmarks work exactly the same way. And yet most evaluation datasets are not deliberately designed with this in mind, which is why so many of them become useless within a year or two of release.

This is what benchmark difficulty calibration is, why it matters more than most people realize, and how to do it properly.

Why Pass Rate Affects What a Benchmark Can Tell You

Here is the core idea, and it helps to think about it without any equations.

Every test result is a signal. If a model passes a task, that is information. If it fails, that is also information. The question is: how much information does each result give you?

When a task has a very high pass rate, say 95% of models pass it, then passing tells you almost nothing. Of course this model passed it. Almost everything does. The only genuinely informative result is a failure, but failures are rare. Most tasks produce a "pass" that carries almost no signal.

When a task has a moderate pass rate, say 50%, then both passing and failing tell you something meaningful. A pass means the model handled a task that half of all models do not handle. A fail means the model missed something that half of models manage to get right. Both results are informative.

At scale, this plays out directly in how well you can rank models. With a benchmark where the average pass rate is 90%, your effective ranking information is coming almost entirely from the 10% of tasks where models sometimes fail. You are using one-tenth of your dataset to do all your discrimination work.

With a benchmark calibrated to 50% average pass rate, every task is contributing to your ability to distinguish models. Your discrimination power is much higher for the same dataset size.

The 40 to 70 percent range is the practical sweet spot. Below 40%, you risk the benchmark being dominated by tasks so hard that even the best models are mostly guessing, which produces unreliable rankings. Above 70%, you are approaching the ceiling problem. The 40-70 range keeps scores spread across a meaningful portion of the scale, which is where rankings are most reliable.

The Two Ways Benchmarks Fail

Before getting into how to build a well-calibrated benchmark, it helps to understand the two failure modes with concrete examples.

The Benchmark That Got Too Easy

SWE-bench, the benchmark for AI coding agents that we discussed in an earlier post, is a useful illustration of a benchmark that was correctly calibrated at launch.

When it was released in 2023, the best AI models solved somewhere in the range of 2 to 3 percent of tasks. The benchmark was extremely hard relative to the state of the technology at the time. That was appropriate: it gave the field a clear target to aim at and a way to measure genuine progress over time.

Since then, pass rates have increased substantially as models have improved. The benchmark has gone from revealing a wide capability gap to providing more compressed scores at the top of the distribution. The research community has responded by creating harder variants of SWE-bench specifically to restore discrimination power.

This is not a failure of the original benchmark. It is the natural lifecycle of a well-designed one. The problem comes when teams do not notice the saturation happening and keep using a benchmark as if it still has the same discriminative power it once did.

The Benchmark That Is Too Hard

The opposite failure is less common to discuss but equally real. A benchmark calibrated to a 5% pass rate mostly measures noise.

At that pass rate, the difference between a model that scores 4% and one that scores 6% might be entirely explained by which specific tasks happened to appear in the test set. The statistical confidence intervals for scores overlap almost everywhere. You cannot reliably rank models against each other because the measurement error is larger than the real differences you are trying to detect.

Excessively hard benchmarks also produce unhelpful research signal. If nearly everything fails, you cannot identify which approaches are most promising or where models most need improvement. The failure mode is too uniform.

Designing for the Right Difficulty Range

Good calibration does not happen by accident. It requires deliberate design choices and empirical validation.

Step 1: Define who the benchmark is for

The 40 to 70 percent pass rate target is relative to a reference population of models. You need to decide what that population is before you start.

If you are building a benchmark for frontier models, today's most capable systems, your calibration target is the current frontier. Tasks should be hard enough that frontier models do not trivially pass them, but varied enough that frontier models solve a meaningful fraction.

If you are building a benchmark for a specific tier of deployed systems, production coding assistants used by real developers, for example, calibrate relative to the typical capabilities of systems in that tier.

Write this down explicitly before building anything. The target population shapes every subsequent decision about what counts as appropriate difficulty.

Step 2: Run a pilot before scaling

Before building hundreds of tasks, build 50 to 100 and run them across several models.

Look at the pass rate distribution. If every model is scoring above 80%, the tasks are too easy. If every model is scoring below 20%, the tasks are too hard. If the scores are spread in the 40 to 70 percent range, your difficulty calibration is approximately right.

The pilot is cheap relative to the cost of building a full dataset and discovering afterward that it cannot discriminate between models. Run it before scaling.

Step 3: Mix difficulty tiers intentionally

Not every task in a benchmark needs to target the same difficulty level. In fact, a mix of difficulty levels is usually better than uniform difficulty, because it lets your benchmark be useful across a wider capability range.

A practical split might look like this:

Easier tasks, where you expect 70 to 90 percent of your target models to pass: These anchor the low end of the capability distribution and help distinguish between models that are significantly below average. They also ensure that even weaker models have tasks where they succeed, which makes the score distribution more informative overall.

Medium tasks, where you expect 40 to 70 percent to pass: These are the core of the benchmark. They provide the most discrimination among capable models and are where most of the useful signal lives.

Hard tasks, where you expect only 5 to 25 percent to pass: These track performance at the current frontier and help distinguish leading models from each other. They are also what preserves the benchmark's value as the field improves, because the hard tasks take longer to saturate.

A rough even split across these three tiers tends to produce an aggregate pass rate in the 40 to 70 percent range for models near the middle of your target distribution.

Step 4: Tag difficulty during construction

As you build tasks, record your expected difficulty tier for each one based on your domain judgment or pilot data. This serves multiple purposes.

Before the benchmark is finalized, it lets you audit the distribution. If you look at your tagged dataset and find 80% of tasks in the hard tier, you will get an aggregate pass rate that is too low.

After evaluation is run, it lets you compute separate scores by difficulty tier. A model that excels on hard tasks but underperforms on easy ones has a different capability profile than one that is consistent across tiers. That information is genuinely useful and is lost if you only report an overall score.

Handling Benchmark Saturation Over Time

Every benchmark has a lifespan. As models improve, pass rates rise, and discrimination power decreases.

The right way to think about this is not as a problem with the benchmark but as evidence that it served its purpose and needs updating.

Monitor scores across evaluation cycles. If the distribution of model scores, which used to spread from 30% to 70%, is now compressed between 55% and 75%, your benchmark is starting to saturate. The compression is the warning sign, not the absolute level of the top score.

Plan for regular refreshes. A benchmark is not a fixed artifact. New tasks need to be added periodically to maintain difficulty calibration as the capability distribution shifts. Plan for this from the beginning rather than treating the initial dataset as final.

Keep a portion private. For benchmarks used in public leaderboards, keeping a test split private and rotating in new private tasks regularly prevents models from being specifically optimized for the published test cases. This is standard practice for well-run benchmarks.

Version explicitly. When you refresh a benchmark, release it as a new version with documentation of what changed and why. This allows comparison of results over time while acknowledging that the reference standard has evolved.

One Thing That Can Undermine All of This

All of the difficulty calibration work above assumes that your test results are reliable. If they are not, pass rates are measuring something other than model capability, and the calibration work is meaningless.

The most common reliability problem is test flakiness: tests that sometimes pass and sometimes fail on the same code, due to randomness, timing dependencies, or network calls.

Before using pilot pass rates for difficulty calibration, verify that your tests are deterministic. Run each task multiple times without any model changes and confirm identical results. Any task with inconsistent results needs to be fixed or removed before it can contaminate your calibration data.

A flaky test that fails 20% of the time does not reflect a task with 20% pass rate. It reflects a measurement that cannot be trusted.

Why This Matters Beyond the Technical Details

The reason difficulty calibration matters ultimately comes back to what benchmarks are for.

A benchmark is supposed to tell you whether the AI system you are considering is better or worse, and by how much, than the alternatives. This is a genuinely consequential question for teams deciding which models to deploy, which research directions to pursue, and whether a new training approach actually produced improvement.

A saturated benchmark cannot answer this question. Neither can a benchmark that is too hard. Both produce numbers that feel like measurements but do not carry the information you need.

The 40 to 70 percent pass rate target, the mix of difficulty tiers, the pilot testing before scaling, the monitoring for saturation over time: these are not bureaucratic requirements. They are what it takes to build an evaluation instrument that keeps working as models and the field evolve.

Building a good benchmark is not glamorous work. It is careful, empirical, and iterative. But it is the foundation that makes everything else in the evaluation process meaningful.

A score only means something when the measurement was designed to produce a score that means something.

#Dataset Design#Benchmarks#AI Evaluation#Calibration

Nomos Insights

Writing about AI training, LLMs, and software engineering. Building AI products at Nomos Insights.

Share on X·Share on LinkedIn

View All

Business

Email Best Practices for Professionals in 2026 (With Tracking Insights)

Development

Building TrackMailBox: How We Made a Free Email Tracker for Gmail

Technology