Evaluation Dataset Design: Targeting 40-70% Pass Rates for Meaningful Discrimination
A benchmark where every model scores above 90% cannot tell you which model is actually better. Neither can one where every model fails. The most informative benchmarks live in a specific difficulty range, and designing for it is more deliberate than most people realize.

Imagine a school where every student scores between 95 and 98 on every exam.
At first glance, this sounds like a good problem to have. High scores all around. But if your job is to decide which students get into a competitive program with 50 spots, those exams are completely useless to you. You cannot tell anyone apart.
Now imagine the opposite: a school where the highest score anyone achieves is 8 out of 100. The exam is so impossibly hard that everyone is effectively guessing. The scores tell you nothing because they are dominated by noise.
The useful exam lives between these extremes. It is hard enough that not everyone aces it. Easy enough that strong students clearly outperform weak ones. The scores spread out in a way that reflects actual differences in ability.
AI benchmarks work exactly the same way. And yet most evaluation datasets are not deliberately designed with this in mind, which is why so many of them become useless within a year or two of release.
This is what benchmark difficulty calibration is, why it matters more than most people realize, and how to do it properly.
Why Pass Rate Affects What a Benchmark Can Tell You
Here is the core idea, and it helps to think about it without any equations.
Every test result is a signal. If a model passes a task, that is information. If it fails, that is also information. The question is: how much information does each result give you?
When a task has a very high pass rate, say 95% of models pass it, then passing tells you almost nothing. Of course this model passed it. Almost everything does. The only genuinely informative result is a failure, but failures are rare. Most tasks produce a "pass" that carries almost no signal.
When a task has a moderate pass rate, say 50%, then both passing and failing tell you something meaningful. A pass means the model handled a task that half of all models do not handle. A fail means the model missed something that half of models manage to get right. Both results are informative.
At scale, this plays out directly in how well you can rank models. With a benchmark where the average pass rate is 90%, your effective ranking information is coming almost entirely from the 10% of tasks where models sometimes fail. You are using one-tenth of your dataset to do all your discrimination work.
With a benchmark calibrated to 50% average pass rate, every task is contributing to your ability to distinguish models. Your discrimination power is much higher for the same dataset size.
The 40 to 70 percent range is the practical sweet spot. Below 40%, you risk the benchmark being dominated by tasks so hard that even the best models are mostly guessing, which produces unreliable rankings. Above 70%, you are approaching the ceiling problem. The 40-70 range keeps scores spread across a meaningful portion of the scale, which is where rankings are most reliable.
The Two Ways Benchmarks Fail
Before getting into how to build a well-calibrated benchmark, it helps to understand the two failure modes with concrete examples.
The Benchmark That Got Too Easy
SWE-bench, the benchmark for AI coding agents that we discussed in an earlier post, is a useful illustration of a benchmark that was correctly calibrated at launch.
When it was released in 2023, the best AI models solved somewhere in the range of 2 to 3 percent of tasks. The benchmark was extremely hard relative to the state of the technology at the time. That was appropriate: it gave the field a clear target to aim at and a way to measure genuine progress over time.
Since then, pass rates have increased substantially as models have improved. The benchmark has gone from revealing a wide capability gap to providing more compressed scores at the top of the distribution. The research community has responded by creating harder variants of SWE-bench specifically to restore discrimination power.
This is not a failure of the original benchmark. It is the natural lifecycle of a well-designed one. The problem comes when teams do not notice the saturation happening and keep using a benchmark as if it still has the same discriminative power it once did.
The Benchmark That Is Too Hard
The opposite failure is less common to discuss but equally real. A benchmark calibrated to a 5% pass rate mostly measures noise.
At that pass rate, the difference between a model that scores 4% and one that scores 6% might be entirely explained by which specific tasks happened to appear in the test set. The statistical confidence intervals for scores overlap almost everywhere. You cannot reliably rank models against each other because the measurement error is larger than the real differences you are trying to detect.
Excessively hard benchmarks also produce unhelpful research signal. If nearly everything fails, you cannot identify which approaches are most promising or where models most need improvement. The failure mode is too uniform.
Designing for the Right Difficulty Range
Good calibration does not happen by accident. It requires deliberate design choices and empirical validation.
Step 1: Define who the benchmark is for
The 40 to 70 percent pass rate target is relative to a reference population of models. You need to decide what that population is before you start.
If you are building a benchmark for frontier models, today's most capable systems, your calibration target is the current frontier. Tasks should be hard enough that frontier models do not trivially pass them, but varied enough that frontier models solve a meaningful fraction.
If you are building a benchmark for a specific tier of deployed systems, production coding assistants used by real developers, for example, calibrate relative to the typical capabilities of systems in that tier.
Write this down explicitly before building anything. The target population shapes every subsequent decision about what counts as appropriate difficulty.
Step 2: Run a pilot before scaling
Before building hundreds of tasks, build 50 to 100 and run them across several models.
Look at the pass rate distribution. If every model is scoring above 80%, the tasks are too easy. If every model is scoring below 20%, the tasks are too hard. If the scores are spread in the 40 to 70 percent range, your difficulty calibration is approximately right.
The pilot is cheap relative to the cost of building a full dataset and discovering afterward that it cannot discriminate between models. Run it before scaling.
Step 3: Mix difficulty tiers intentionally
Not every task in a benchmark needs to target the same difficulty level. In fact, a mix of difficulty levels is usually better than uniform difficulty, because it lets your benchmark be useful across a wider capability range.
A practical split might look like this:
Easier tasks, where you expect 70 to 90 percent of your target models to pass: These anchor the low end of the capability distribution and help distinguish between models that are significantly below average. They also ensure that even weaker models have tasks where they succeed, which makes the score distribution more informative overall.
Medium tasks, where you expect 40 to 70 percent to pass: These are the core of the benchmark. They provide the most discrimination among capable models and are where most of the useful signal lives.
Hard tasks, where you expect only 5 to 25 percent to pass: These track performance at the current frontier and help distinguish leading models from each other. They are also what preserves the benchmark's value as the field improves, because the hard tasks take longer to saturate.
A rough even split across these three tiers tends to produce an aggregate pass rate in the 40 to 70 percent range for models near the middle of your target distribution.
Step 4: Tag difficulty during construction
As you build tasks, record your expected difficulty tier for each one based on your domain judgment or pilot data. This serves multiple purposes.
Before the benchmark is finalized, it lets you audit the distribution. If you look at your tagged dataset and find 80% of tasks in the hard tier, you will get an aggregate pass rate that is too low.
After evaluation is run, it lets you compute separate scores by difficulty tier. A model that excels on hard tasks but underperforms on easy ones has a different capability profile than one that is consistent across tiers. That information is genuinely useful and is lost if you only report an overall score.
Handling Benchmark Saturation Over Time
Every benchmark has a lifespan. As models improve, pass rates rise, and discrimination power decreases.
The right way to think about this is not as a problem with the benchmark but as evidence that it served its purpose and needs updating.
Monitor scores across evaluation cycles. If the distribution of model scores, which used to spread from 30% to 70%, is now compressed between 55% and 75%, your benchmark is starting to saturate. The compression is the warning sign, not the absolute level of the top score.
Plan for regular refreshes. A benchmark is not a fixed artifact. New tasks need to be added periodically to maintain difficulty calibration as the capability distribution shifts. Plan for this from the beginning rather than treating the initial dataset as final.
Keep a portion private. For benchmarks used in public leaderboards, keeping a test split private and rotating in new private tasks regularly prevents models from being specifically optimized for the published test cases. This is standard practice for well-run benchmarks.
Version explicitly. When you refresh a benchmark, release it as a new version with documentation of what changed and why. This allows comparison of results over time while acknowledging that the reference standard has evolved.
One Thing That Can Undermine All of This
All of the difficulty calibration work above assumes that your test results are reliable. If they are not, pass rates are measuring something other than model capability, and the calibration work is meaningless.
The most common reliability problem is test flakiness: tests that sometimes pass and sometimes fail on the same code, due to randomness, timing dependencies, or network calls.
Before using pilot pass rates for difficulty calibration, verify that your tests are deterministic. Run each task multiple times without any model changes and confirm identical results. Any task with inconsistent results needs to be fixed or removed before it can contaminate your calibration data.
A flaky test that fails 20% of the time does not reflect a task with 20% pass rate. It reflects a measurement that cannot be trusted.
Why This Matters Beyond the Technical Details
The reason difficulty calibration matters ultimately comes back to what benchmarks are for.
A benchmark is supposed to tell you whether the AI system you are considering is better or worse, and by how much, than the alternatives. This is a genuinely consequential question for teams deciding which models to deploy, which research directions to pursue, and whether a new training approach actually produced improvement.
A saturated benchmark cannot answer this question. Neither can a benchmark that is too hard. Both produce numbers that feel like measurements but do not carry the information you need.
The 40 to 70 percent pass rate target, the mix of difficulty tiers, the pilot testing before scaling, the monitoring for saturation over time: these are not bureaucratic requirements. They are what it takes to build an evaluation instrument that keeps working as models and the field evolve.
Building a good benchmark is not glamorous work. It is careful, empirical, and iterative. But it is the foundation that makes everything else in the evaluation process meaningful.
A score only means something when the measurement was designed to produce a score that means something.
Writing about AI training, LLMs, and software engineering. Building AI products at Nomos Insights.
Continue Reading