The Most Important Eval isn't on a Leaderboard

Public benchmarks have been critical to accelerating frontier model capabilities. SWE-bench, LegalBench, TAU-bench, and Cognition's Frontier Code Evals; each pushed the field meaningfully forward by encoding hard tasks, rigorous verifiers, and reproducible scoring that the whole community could build on. As frontier labs hill-climb against these benchmarks, models increasingly saturate toward the specific tasks, the verifiers, the output distributions those graders reward.

But for teams building agentic products, benchmark performance is only a proxy and the real question becomes:

Does the system work on the actual task distribution it sees in production?

That distribution is shaped by real users, real workflows, edge cases, and domain-specific correctness requirements that rarely exist in public datasets. The hardest failures don't show up on public leaderboards — they emerge in production, where agents interact with users at scale. That ground truth is private and it lives in the production data. @Sarah Guo called this as the untrainable corner: the defensible ground where correctness exists only in private.

“The eval frameworks for each vertical are going to get crazy, crazy, crazy specific. The benchmarks that the AI labs release and how they do eval is getting completely useless for vertical companies.”

—Winston Weinberg, CEO, Harvey

Expand

Public benchmarks concentrate on the saturated head of the task distribution, where frontier models already perform well. Production traffic is far broader, shaped by real users and real workflows, and stretches into a long right tail. The hardest failures live in that tail, exactly where public leaderboards provide no coverage and where private production data is the only ground truth.

The production environment has the ground truth nobody else has

Production data is a record of how an organization works. Every interaction with an agent deployed in production generates a data point. Every correction, retry, escalation, review, and override encodes judgment: what success looks like, what tradeoffs matter, what standards are non-negotiable. Over time, that judgment becomes institutional knowledge.

Together, these signals encode the latent rubrics and domain-specific standards that define the boundary between success and failure generated continuously through real world interaction in production. Historically, the judgment and knowledge that lived inside people, playbooks, and workflows can now be operationalized into evals. These evals become the mechanism through which an organization preserves, updates, and compounds its own judgment.

“Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use. Private evals should capture whether a model is actually improving against outcomes that matter to the business — not just external benchmarks.”

—Satya Nadella

Why building evals is still hard

If that data already exists, why is turning it into evals still so hard?

Because the raw data is hostile. Production data is noisy, incomplete, and high-volume. The signal separating a good run from a bad one is buried inside near-misses, retries, ambiguous outcomes, and partial failures.

Expert judgment does not scale. The people who can reliably evaluate correctness are often the most expensive and constrained domain experts in the organization.

And most current tooling is either fragmented or focuses on only storing tracing data or running evals rather than continuously mining production data, generating and maintaining those evals as the capability boundary moves.

This creates predictable gaps: model upgrades ship without regression coverage, quality drifts silently, and teams lose visibility into how behavior changes as prompts, tools, retrieval, and orchestration evolve.

Manoj Soundararajan

Product @Decagon

In production, the real challenge is making agents reliable across the long tail of constraints and user behavior. NeoSigma is addressing this by catching regressions, debugging failures, and maintaining evaluations and reliability as systems evolve and user behaviors drift.

Today's frontier is tomorrow's baseline

An eval suite maps the performance boundary, what an agent reliably handles versus where it fails. But that boundary is never static and it shifts as:

Models improve. A new model version changes the model capabilities and changes what is trivial and what is hard.
The product evolves. New features, tools, and workflows introduce entirely new task distributions that existing evals were never designed to measure.
User expectations rise. As capabilities improve and agents absorb more work, the definition of a "successful run" changes with them — users expect more done, and done more reliably.

Frontier capabilities, product surfaces, and customer expectations all move together, and so must the evals.

An eval system has different types of targets that continuously convert one into the other.

Regression evals capture tasks the agent should already handle reliably. These are the workflows where a drop in performance is an immediate alarm. This should be a compact set of high-signal tests that runs on every model, prompt, or harness change to catch regressions before they reach users.

Frontier Capability evals capture the frontier: hard tasks the agent still struggles with today. These represent the current ceiling of what the system is trying to improve toward.

As the system's capabilities evolve and hill-climb on the frontier, it becomes a strict baseline for what the system must reliably handle. This improves the system's quality as it can be measured against the actual production boundary, pushing the system toward higher reliability and broader capability.

The compounding effect: the data flywheel

Every production interaction generates new traces, feedback, and outcomes. Those interactions expose failures, edge cases, and implicit user expectations that no benchmark could have predicted. Over time, this eval suite becomes the continuously growing corpus of a proprietary memory of everything the system has learned not to fail at.

Good evals lead to better systems
Better quality unlocks more trust.
More trust unlocks more usage.
Expanded usage generates even richer data.
And the loop repeats.

This learning loop continuously encodes the organization's workflows, interaction data and domain-specific judgment into systems that improve driving outcomes like higher task completion rates, fewer human escalations, lower correction rates, faster resolution times, and stronger user retention: everything a business cares about.

“In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country. One where every organization can own the learning loop that encodes its institutional knowledge, compounding its human and token capital.”

—Satya Nadella

The strongest agentic products will not just have the best models. They will have the fastest data flywheels. Read more about our self-improving agentic loops for turning production data into compounding intelligence.

We are already serving in production, sign up for our product waitlist here!

Acknowledgements

Special thanks to Shyamal Anadkat (ex-OpenAI) and the NeoSigma team for reviewing and providing valuable feedback on this blog post.

But for teams building agentic products, benchmark performance is only a proxy and the real question becomes:

Does the system work on the actual task distribution it sees in production?

“The eval frameworks for each vertical are going to get crazy, crazy, crazy specific. The benchmarks that the AI labs release and how they do eval is getting completely useless for vertical companies.”

—Winston Weinberg, CEO, Harvey

Expand

The production environment has the ground truth nobody else has

“Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use. Private evals should capture whether a model is actually improving against outcomes that matter to the business — not just external benchmarks.”

—Satya Nadella

Why building evals is still hard

If that data already exists, why is turning it into evals still so hard?

Expert judgment does not scale. The people who can reliably evaluate correctness are often the most expensive and constrained domain experts in the organization.

Manoj Soundararajan

Product @Decagon

Today's frontier is tomorrow's baseline

An eval suite maps the performance boundary, what an agent reliably handles versus where it fails. But that boundary is never static and it shifts as:

Models improve. A new model version changes the model capabilities and changes what is trivial and what is hard.
The product evolves. New features, tools, and workflows introduce entirely new task distributions that existing evals were never designed to measure.
User expectations rise. As capabilities improve and agents absorb more work, the definition of a "successful run" changes with them — users expect more done, and done more reliably.

Frontier capabilities, product surfaces, and customer expectations all move together, and so must the evals.

An eval system has different types of targets that continuously convert one into the other.

Frontier Capability evals capture the frontier: hard tasks the agent still struggles with today. These represent the current ceiling of what the system is trying to improve toward.

The compounding effect: the data flywheel

Good evals lead to better systems
Better quality unlocks more trust.
More trust unlocks more usage.
Expanded usage generates even richer data.
And the loop repeats.

“In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country. One where every organization can own the learning loop that encodes its institutional knowledge, compounding its human and token capital.”

—Satya Nadella

We are already serving in production, sign up for our product waitlist here!

Acknowledgements

Special thanks to Shyamal Anadkat (ex-OpenAI) and the NeoSigma team for reviewing and providing valuable feedback on this blog post.