2026 February "AI Evaluation" Digest
Quis custodiet ipsos custodes?
Some of the most interesting benchmarks are starting to look less like thermometers and more like courtrooms. Instead of passively registering performance, task success must be argued over, weighed, and ultimately adjudicated. This month’s through-line is that evaluation is no longer just about better questions; it is about understanding the evaluator. A growing body of work emphasizes the need to rigorously assess the competence, bias, and reliability of the humans, AIs, or hybrids acting as arbiters of performance.
Attention to the judges is showing up in multiple ways. New methods estimate judge reliability and attach uncertainty to preference-based rankings. Psychometric-style audits treat judges as measurement instruments rather than neutral oracles. And there is growing evidence that apparent “benchmark plateaus” can reflect judge error or measurement limits rather than genuine capability ceilings. This shift is not a revolution. Measurement science has long insisted on reliability, construct validity, and calibrated inference when tests are used as instruments rather than anecdotes. What is new is that AI evaluation is reemphasizing these principles under frontier pressure.
If last year’s evaluation discourse was “benchmarks are broken,” this month’s update is sharper. Benchmarks break fastest when the judge can’t keep up. The fix is disciplined attention to calibration, auditing, statistical power and uncertainty, and transparency about who judged what under which conditions.
Takes
This preprint regards AI capabilities and propensities as “dispositional” properties, a concept borrowed from philosophy defining how systems would behave under varying contextual conditions. The authors argue that current evaluation practices, including benchmarks, red-teaming, and latent-factor analysis, fail to measure dispositions because they don’t identify the causal contextual factors that drive behaviour. A new measurement framework is advanced in which explicit causal hypotheses are proposed which specify which contextual properties matter, how they should be operationalised, and how those features systematically map onto behavioural probability.
In this new blog post, Miles Brundage advocates for a “triage” or “80/20” state on AI policy: focusing 20% of the effort on mitigating 80% of the most catastrophic risks. He suggests prioritizing auditing of the top firms and reinforcing societal defenses, as key defenses against the worst outcomes such as AI-enabled biological weapons, rogue AI takeovers or AI-enabled global totalitarianism.
The authors of this Nature perspective argue that evaluating LLM “moral performance” is insufficient; rather what matters is moral competence. That is, whether models respond appropriately for the right reasons. The authors identify three challenges (facsimile reasoning, multidimensional trade-offs, and moral pluralism) and propose adversarial, parametric, and pluralistic evaluations to probe them.
The authors of this Nature comment argue that AGI, by Turing Test’s standards, has already been achieved, and that debates should focus on degree rather than on whether some fixed threshold has been crossed. They dismiss objections based on embodiment, autonomy, and hallucination as non-essential to general intelligence. Shane Legg, who coined the term AGI, highlighted the piece on X, noting that even a system that far exceeds human-level performance on some metrics should not necessarily be considered AGI if it still fails at seemingly trivial tasks. As the concept of AGI mutates to survive, we wonder if it is true that Einstein knew everything about the centrifugal force in his cup of tea but was often so absent-minded to heat the water.
News
Is it a bird, is it a plane? Is it a taxonomy, a collection, an ontology, a conceptual net? It’s AI Construct Lexis, a “nomological network” to map and collect AI constructs, measurement instruments, behaviors and tasks as an integrated effort for the (understanding of the) evaluation of AI systems. It takes inspiration from the Cognitive Atlas, a similar initiative focused on natural cognition.
Parlez-vous français mieux que l’anglais? The French government develops “compar:IA”, an open-source service for collecting large-scale human preference data, mostly in French. Are user judgments in French different from those predominantly in English such as LLM Arena? Comme ci, comme ça.
The International AI Safety Report was launched last year as an authoritative source reporting on the status of AI Safety. The new edition, released in February 2026, includes many facts and trends about safety evaluations, but also capabilities evaluations. For instance, it points to evidence that some models are beginning to sandbag, underperforming during safety testing.
Evals-consensus is a new initiative intended to build consensus on evaluation practices. It is functionally a pre-registered protocol based on consultation using the Delphi method with input from a wide range of stakeholders from governance, policy, academia, industry, and evaluation practice. Feel free to propose yourself on the website.
Was 2025 the Year of the “AIgent”? According to the new edition of the AI Agent Index by MIT, it was, and progress and actor/location concentration was featured in the key findings. Pending matter: safety!
ML Contests released their 2025 edition of “The State of Machine Learning Competitions”.
The International Programme on AI Evaluation starts its Open Seminars, live lectures by leading researchers. You can already watch the inaugural ones on Uncertainty Estimation by Tom Dietterich and Evaluating Multi-Agent / Social Systems by Joel Leibo!
Methodology
There are model cards, evaluation sheets and other templates related to reporting evaluation results or model risks. However, many organisations do not know how to start an evaluation project and would benefit from a guiding checklist whose execution would also serve as documentation and reporting of the project. The PrepEval Protocol fills that gap and emphasises the need of pre-registration before the bulk of the evaluation commences.
Extensive work is being conducted on capabilities (monotonic abilities that determine what a model *can* do), but this new preprint (whose authors include some of the contributors to this digest) claims that capabilities alone are insufficient indicators of model task success. In response, the team proposes a method for measuring propensities (non-monotonic traits of what models *tend* to do) that they show can incrementally improve the prediction of task outcomes.
The EvalEval Coalition is producing a series of outputs that combine the expertise of the community. In this one, “Every Eval Ever”, they respond to several attempts to integrate data from evaluations, which is usually costly and fragmented. To mitigate this they present a standardised format and a growing dataset to collect AI evaluation results. Join the initiative or contribute!
In a policy forum published in Science, several experts from the EU AI Office and other institutions analyse the tradeoff between the burden of AI evaluations and the effectiveness of such evaluations. Ensuring evaluations are proportionate depends on verifying the evaluation is suitable and necessary for the given evaluative task and then balancing between burden and effectiveness. The paper illustrates this balancing with an example.
This preprint collects or adapts traditional metrics of (self-)consistency, robustness, (self-)predictability and safety under the category of “reliability”. When applied to modern AI models, the authors conclude what everybody knew and has been shown repeatedly in the past years: performance increases but reliability, not that much (or has even diminished). We wonder why everybody these days is putting “Science” in the title of their papers. Find the other one in this newsletter!
This MSc thesis makes the case that reasoning evaluations should be statistical inference, not leaderboard roulette, proposing a Bayesian alternative to Pass@k that produces posterior estimates + credible intervals for faster, more stable rankings on small (and costly) benchmarks.
Rabooki and colleagues introduce an item-response theory (IRT) based method for anomaly detection benchmarking.
This new preprint from Anthropic aims to decompose AI system errors into bias and variance. The authors suggest that AI system variance does not reliably decrease with reasoning length or system scaling. However, the paper has received some criticism for its definition of incoherence and for some cherrypicking of the conclusions.
In this preprint, the authors propose DMLRANK, a statistical method for ranking LLMs using pairwise preference data (including data produced by LLM-as-a-judge pipelines). The key contribution is a way to attach meaningful uncertainty estimates (“error bars”) to leaderboard scores even when flexible machine-learning components are used, rather than relying on brittle parametric assumptions.
In this preprint, the authors point out that “confidence” can mean two different things: how sure a model is about “this” one answer, versus how likely it is to succeed on a task overall. They evaluate several ways of estimating that second notion and report that some simple signals can help decide when to trust a model or when to spend more attempts/compute.
Benchmarking at the Edge of Comprehension considers a regime where problems get so hard that humans can’t reliably write the tests, know the right answers, or fully check long solutions (again the recurrent theme with which we open this newsletter). They propose an evaluation setup where answers “pass” unless someone can point to a specific, checkable mistake, and they show initial results suggesting this can still produce usable rankings on difficult math tasks.
Item Response Theory (IRT) is a powerful tool for evaluation, but as any evaluation tool, it can also be used to improve models. In this paper, a variant of IRT using artificial crowds (IRT-AC) is used to improve model training in curriculum learning settings. As the authors express as future work, there is also space for exploration about different ways in which difficulty (an output of IRT analysis) can be used to choose the examples (e.g., zone of proximal development).
This preprint introduces a framework to replace vague “human-level” AI comparisons with rigorous, human-anchored psychometric scales by calibrating benchmark items against a projected world population. By employing Large Language Models to extrapolate task success rates from restricted demographic samples to a global distribution, the methodology establishes distinct logarithmic difficulty bases across 18 cognitive dimensions. Ultimately, this approach creates a commensurate, standardized ruler that precisely quantifies where AI capabilities sit relative to human population tails, actively correcting for the biases of traditional convenience sampling.
Findings and Results
Spiesberger and colleagues have released a preprint which reaffirms that data contamination is widespread, and that task contamination (i.e., training on the same task even without identical instances) can sometimes reach the level of near-exact duplication. Together, these findings would suggest that model generalization is shallow and does not extend to true out-of-distribution settings.
A systematic analysis of saturation across 60 LLM benchmarks has revealed half of them as saturated. The key finding is that saturation is primarily driven by two structural factors: benchmark age (reflecting cumulative exposure and optimization pressure) and test set size (which determines measurement resolution). Notably, commonly assumed safeguards such as private test sets, open-ended formats, or template diversity, showed no robust protective effect against saturation.
The UK’s AI Security Institute (AISI), working alongside the government’s new Future of Work Unit, conducted a randomised controlled trial with 500 participants to measure how access to a state-of-the-art model affects productivity on common workplace tasks derived from the O*NET occupational taxonomy. The results showed that AI use led to an average 25% improvement in task quality and 61% higher productivity, though impacts varied significantly across tasks. Structured analytical tasks, such as monitoring processes, benefit most. More subjective, open-ended tasks, such as planning and prioritisation, show no significant gains. This pilot study is being expanded to cover more work activities and test agentic systems, with a view to building an evidence-based picture of the impact of AI on the UK labour market.
This preprint presents a comprehensive survey that systematizes reasoning failures in Large Language Models (LLMs) by introducing a novel categorization framework distinguishing between embodied and non-embodied (informal vs. formal) reasoning. The authors classify failure modes into three distinct types: fundamental (intrinsic to architecture), application-specific (domain limitations), and robustness-related (inconsistent performance). While analyzing root causes and reviewing mitigation strategies. The work aims to unify fragmented research efforts to guide the development of more reliable and robust reasoning systems. While classifications are always great, we missed the fundamental question of WHY failures occurred, and a discussion on potential architectural reasons, and not just prompting iterations.
Benchmarks
CL-Bench is a new benchmark that features tasks whose solution depends almost exclusively on the information that has been presented in the context, avoiding contamination or extrapolation. The tasks force the test takers to think about the newly given information and reason inductively and deductively from it. Guess what? Models perform poorly (GPT-5.1, solves only 23.7%). For how long?
The new, BabyReasoningBench reframes AI evaluation for “baby” language models trained on child-directed input, introducing 19 developmentally grounded reasoning task families (theory of mind, analogy, causal learning, and more) with systematically varied MCQ items. On two BabyLM GPT-2 baselines, accuracy is low but highly non-uniform: scaling data mainly helps causal/physical inference, while explicit false-belief and pragmatics-sensitive tasks remain near floor. This is exactly the kind of dissociation you want from a diagnostic evaluation.
This preprint proposes new benchmarks for evaluating 15 “individualized” student safety risks (during pedagogical uses of AI), based on 14 student attributes. The authors evaluate a number of LLM such as Gemini 1.5 Pro, GPT-4o, and Claude-3.5-Sonnet; and find that those models score less than 2.3 out of 5. Those results however mix propensities with capabilities, so it will be interesting to see how newer models trend.
Standing for “Ethereum Virtual Machine”, EVMbench is a new benchmark for testing whether AI agents can handle smart contract security. It asks agents to find bugs, fix them, and even exploit them in realistic blockchain setups, with results checked automatically.
OpenAI reports that SWE-bench Verified is no longer a reliable measure of frontier coding ability because many items have evaluation flaws and the benchmark is increasingly likely to be contaminated. In their post, they recommend moving to SWE-bench Pro and developing newer, cleaner coding evaluations.
In this AAMAS paper, MoralityGym is introduced as a benchmark of trolley-problem-style RL environments designed to test whether agents follow hierarchically ordered moral norms rather than optimizing a single reward. The benchmark formalizes moral priorities as “Morality Chains” and evaluates policies with a metric that strongly prioritizes higher-ranked constraints, separating task success from moral compliance.
This preprint defines the class of all conceivable human games, capturing the capabilities that matter for humans and having desirable evaluative properties. AI Gamestore is a first exploration of this space with 100 video games, with the extraordinary feature of being generated by AI models! (and then validated by humans). The results show that there is no need to look for superadvanced academic questions (as in Humanity’s Last Exam) or weird transformations of colourful square configurations (like ARC) to see how current AI systems fail. Unlike other approaches, the preprint also includes a characterisation of capabilities for each video game, showing where the current limitations lie. Spoiler: it’s “memory” where VLMs fail catastrophically! Do you remember?
News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com.
Getting the Digest: Once a month if you join at aievaluation.substack.com
Editor-in-Chief: Zack Tidler
Contributors: Behzad Mehrbakhsh, Fernando Martinez Plumed, Pablo Antonio Moreno Casares, José H. Orallo, Peter Romero, Wout Schellaert, Konstantinos Voudouris, Daniel Romero-Alvarado



The "Don't Pass@k" paper (in the text: master's dissertation) was accepted at ICLR 2026: https://openreview.net/forum?id=PTXi3Ef4sT
The point about benchmarks breaking when the judge can't keep up hits something I noticed practically. Built on Mistral during the EU Hackathon last weekend and what struck me wasn't any single failure - it was the cumulative management overhead.
Constant small corrections that add up. It's not captured in any benchmark I know of, but it changes what you're willing to build. 'Developer experience under realistic pressure' is probably its own evaluation dimension. Wrote about it here if anyone wants a non-benchmark data point: https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026