Towards More Standardized AI Evaluation: From Models to Agents
The question driving AI evaluation has shifted. It is no longer "How good is the model?" but "Can we trust the system to behave as intended, under change, at scale?". In our paper, we argue that most evaluation practices remain anchored in assumptions from the model-centric era — static benchmarks, aggregate scores, one-off success criteria — and that these approaches increasingly obscure rather than illuminate system behavior.
This blog post summarizes the key ideas. For the full treatment, including regulatory analysis, detailed challenge breakdowns, and practical annexes, read the full paper on arXiv.
Evaluation Is Not a Checkpoint !
We evaluate machines not because we distrust them, but because we depend on them. An aircraft is not trusted because it flies once, but because it flies repeatedly, within tolerances, across conditions.
For classical AI, evaluation asked: Is this output correct? For agents — systems that reason over multiple steps, invoke tools, and maintain state — evaluation must ask: Did the system behave correctly over time, under uncertainty, given its constraints?
This shift has three consequences:
- Variability becomes a signal, not noise. An agent that succeeds once but fails intermittently may be unacceptable, even if its average score looks strong.
- Environment matters. A failure may reflect an ambiguous task or a broken harness rather than a capability gap.
- Evaluation becomes inseparable from engineering. Prompt formats, inference settings, and execution context are part of what is being measured.
The practitioner consensus is clear: evaluations are the real moat. Models, prompts, and data are commoditized. The ability to systematically measure performance is the differentiator between a demo and a product.
Figure 1 — Risk Assessment and Evaluation as Part of AI Governance
Three Evaluation Objectives People Keep Conflating
Much confusion around benchmarks stems from treating different evaluation goals as interchangeable. We identify three distinct motivations:
| Objective | Question | What matters |
|---|---|---|
| Non-regression (builder's view) | Did we break something? | Score trajectory, not absolute value |
| Leaderboards (user's view) | Which model is best for my task? | Relative ranking across unsaturated benchmarks |
| Scientific assessment | Where does the field stand? | Abstract capability constructs (most fragile) |
MMLU, once a gold standard, is now effectively solved by frontier models. The lifecycle from "impossible" to "saturated" is accelerating — pushing us toward harder, dynamic, and agentic evaluations.
Challenges That Silently Break Evaluation
In the paper, we detail few challenges that usually slip without notice. Here are the highlights:
Benchmarks die fast. If a benchmark launches and the top model scores 80%, it's already dead. Ofir Press suggests a "-200%" mindset: design benchmarks so hard that today's models all fail.
Generic metrics mislead. ROUGE measures lexical overlap, not correctness. BERTScore measures semantic similarity, not truth. "Helpfulness" scores reward confident, well-phrased answers — even false ones. Error analysis (manually reviewing traces) remains the single most important eval activity.
Silent bugs are everywhere. Chat template mismatches, sampling parameter drift, hardware indeterminism, and hidden context truncation all silently skew results. An evaluation score is meaningless without the exact context of how the model was served.
Contamination is the dirty secret. Models are trained on the internet; benchmarks are published on the internet. When GPT-4 was released, BigBench was excluded because the model had memorized the "Canary GUID" — a string meant to signal "do not train on this." This is Goodhart's law in action.
LLM judges inherit biases. Position bias, verbosity bias, self-preference bias, and sycophancy are all documented. A simple mitigation: give the judge an "escape hatch" (allow "Unknown") to reduce hallucinated verdicts.
Figure 2 — AI Evaluation Best Practices
From Models to Agents: A Fundamental Shift
Evaluating an agent is not about checking an answer key — it's about auditing a workflow. The anatomy of an agent evaluation involves four components: the harness (scaffold), the transcript (trace), the outcome (environment state), and the graders.
Pass@k vs Pass^k: The Reliability Gap
This distinction, highlighted by Philipp Schmid, is critical:
- Pass@k = probability of at least one success in k attempts (capability)
- Pass^k = probability of all k attempts succeeding (reliability)
An agent with 70% success rate looks great at Pass@3 (~97%). But Pass^3 reveals the truth: only 34.3% chance of three consecutive successes. For autonomous agents without human oversight, Pass^k is the real measure of production readiness.
Today, Evaluations Need Environments
Static "read-only" benchmarks like MMLU are insufficient for agents that must read, write, and adapt. Two recent benchmarks illustrate the shift:
GAIA2 (Meta & Hugging Face) runs agents inside a simulated smartphone ecosystem — Email, Calendar, Contacts, FileSystem. Agents don't just answer questions; they change the state of the world. It tests ambiguity handling, tool failure recovery, and temporal reasoning.
TextQuests uses interactive fiction games (Zork and others) to stress-test long-context reasoning. It exposes spatial disorientation, state hallucination ("inventory amnesia"), and the inability of current agents to allocate reasoning budget dynamically.
Both share a key insight: a model that solves a task in 3 minutes with 500 tokens is superior to one that takes 30 minutes and 50k tokens. Evaluation must now account for the Pareto frontier of performance vs. cost.
The Evaluation Loop for Agents
Evaluating in environments requires a fundamentally different architecture:
- Setup — Clean state for every trial (no contamination between runs)
- Execution — Capture the full trace: Observation → Thought → Action → Observation
- Teardown & Verification — Grade side effects deterministically (check the SMTP logs, not the LLM's claim)
- Extensibility — Modern environments like ARE use MCP to plug in new tools without rewriting the harness
Figure 3 — A Roadmap from "No Evals" to "Trusted Evals"
Closing the Evaluation Gap
The future of AI progress may depend less on architectural breakthroughs than on whether we can build evaluation instruments that keep pace with the systems they govern. As Snorkel AI and partners recently argued, we are trapped in an "evaluation gap" where our ability to develop agents has outpaced our ability to measure them.
Critically, closing this gap cannot be exclusively an Anglocentric endeavor. We urgently need grant initiatives dedicated to multilingual benchmarks and under-served languages, preventing a future where agentic autonomy is robust in English but fragile everywhere else.
The open question is no longer how intelligent our systems can become, but how much uncertainty we are willing to tolerate — and how precisely we can measure and moderate it.
Read the full paper for more detailed analysis and regulatory context:
Towards More Standardized AI Evaluation: From Models to Agents — arXiv:2602.18029