2025 February "AI Evaluation" Digest

It’s high time to change the paradigm.

Feb 28, 2025

Welcome to this month's AI Evaluation Digest — a space dedicated to challenging the status quo and broadly thinking about how we evaluate AI. In this issue, we ask: Is it time to change the paradigm of AI evaluation?

Recent interdisciplinary research is challenging the long-standing reliance on traditional benchmarks, sparking debate about whether a few indicators from these benchmarks truly capture the complexity and risks of real-world applications. Researchers at the European Commission ask “Can we trust AI benchmarks?” and indicate growing concerns about data quality, contextual relevance, and unanticipated risks in a paradigm that relies solely on aggregate measures. Issues such as biased datasets and overly rigid scoring protocols can oversimplify multifaceted phenomena, reducing benchmark scores to little more than AI marketing signals rather than reliable indicators.

But is this the one and only paradigm? This paper shows that there are actually several paradigms (Benchmarking, Evals, Construct-Oriented, Exploratory, Real-World Impact and TEVV) currently shaping AI evaluation, which highlight the unique goals, methodologies and cultural influences behind today's evaluation strategies. This synthesis calls for greater cross-fertilisation between approaches as many evaluation methodologies have developed in isolation, and integrating these perspectives is essential to creating more accountable, holistic frameworks.

Echoing Thomas Kuhn's view that true scientific revolutions arise from deep-seated dissatisfaction and a willingness to challenge fundamental assumptions, this sentiment now seems to resonate throughout our field. This month's digest invites you to rethink the existing evaluation frameworks and to conceive of innovative methodologies that better reflect the inherent complexity of AI.

Funding Opportunities

This call is exactly about this need for rethinking AI Evaluation: Improving Capability Evaluations from Open Philanthropy. This funding opportunity invites proposals to address key challenges in AI evaluation, from oversaturated benchmarks and underdeveloped measurement science to limited third-party oversight. Grants range from $0.2 million to $5 million over 6 months to 2 years to support projects that refine how we measure and interpret frontier AI performance for robust governance and risk mitigation.

Evaluation Methodology

This paper applies psychometric techniques from human testing to refine LLM leaderboards, going beyond simple averages to capture the latent cognitive strengths of models. Using data from the Hugging Face Leaderboard as a case study, the study demonstrates that a psychometrically informed ranking provides a more robust, noise-filtered assessment of LLM performance.
This paper argues that as LLMs evolve into general-purpose AI (GPAI), the challenges of adversarial ML — from defining attack success (e.g., jailbreaks) to evaluating defences — have become significantly more complex and harder to evaluate rigorously. The authors warn that without a rethink of our evaluation frameworks, another decade of research may yield little meaningful progress against these increasingly ill-defined and adaptive threats.

Automated Capability Discovery (ACD) uses a foundation model as a scientist to generate and self-evaluate open-ended tasks that reveal hidden capabilities and failure modes in subject models. ACD automatically produces interactive visualisations and structured capability reports from different scientist/subject pairings. “Capabilities” are still performance averages on kinds of tasks, but they are split by three levels of difficulty (moderate, hard, and very hard).
This paper presents an automated method for generating first-order logic problems of controllable complexity, based on Zermelo-Fraenkel set theory, to evaluate the logical reasoning abilities of LLMs.

What should an AI assessor optimise for? analyses whether an assessor, which predicts performance metrics (i.e. loss values for other AI systems) should be trained directly on the target metric, or on an alternative proxy metric that is then mapped back. The surprising finding: proxy losses may be more effective.
CAPA (or Chance Adjusted Probabilistic Agreement) is a new metric to measure how similarly LLMs answer a set of questions. It extends Cohen’s kappa (which adjusts by accuracy to take into account random agreements) by considering the actual wrong answers produced and the probability assigned to the different answers, and normalising to take into account that models with higher average probability agree more often by “chance”. Considering this metric of similarity, they find that 1) LLMs-as-judges favour more similar models; 2) weak-to-strong supervision works better with lower similarity; 3) model mistakes are becoming more similar with increasing capabilities which, combined with the first two points, may reduce the usefulness of AI oversight.
A recent position paper strongly criticises AGI as the goal for AI because aiming for AGI creates six traps for AI, which are obstacles to distinguishing hype from reality. From the perspective of AI evaluation, everything is wrong if the target metrics are wrong: This is a problem of “external validity—the question of whether a measurement corresponds to the real-world phenomenon it’s supposed to capture”. But what are the “good AI profiles” we should aim for, and how can we characterise and measure how close we get to them?

Am I wrong? Can it be told wrong? Will it be wrong? Was it?

Uncertainty estimation is usually performed by the model itself, and corresponds to the question: “Am I wrong?” LLMs often struggle to provide absolute confidence values, and we see a variety of methods aimed at better calibrating model outputs to mitigate hallucinations: asking models to compare pairs of questions and translating these preferences into scores using rank aggregation (e.g. Elo), or estimating token-level uncertainty in real time using frameworks such as LogU. Benchmarks such as ConfidenceBench assess calibration by asking models to self-report their confidence on multiple-choice questions, although fine-tuning on short-cut-laden data can lead to overconfident predictions that obscure true performance.
PredictaBoard considers anticipative uncertainty estimation from outside (“Will it be wrong?”) and separates the question of how good the model is (“How wrong is it”) from how well you can predict when a model is going to be correct or not (“Can you predict when it is wrong?”). These two closely-related questions depend on both the model and a family of predictors for the model’s response, including the model itself. PredictaBoard evaluates pairs of a LLM and an assessor predicting the LLM’s scores on individual questions. By measuring metrics such as the Accuracy-Rejection Curve and the Predictably Valid Region, the framework jointly measures the model's performance and the assessor's ability to anticipate errors, making predictability a key criterion for assessing model reliability in high-stakes settings.
QueRE is a post-assessor model (“Was it wrong?”) which uses follow-up queries to extract low-dimensional, black-box representations for predicting the instance-level performance or to detect other behavioural attributes of LLMs (e.g., if they were influenced by an adversarial prompt, or if the correct model version is served). Motivation is extending methods detecting false statements from internals to closed-source models.

Contamination

This survey of data contamination for LLMs examines how unintentional overlap between training and test data can artificially boost performance, compromising the true generalisation capabilities of LLMs. It categorises contamination into phase-based and benchmark-based types, reviews current contamination-free evaluation strategies and detection methods (white, grey and black box), and proposes future directions to promote more rigorous and reliable benchmarking.

Benchmarking

Continuing with contamination, AdEval is a dynamic scoring method that uses alignment-based data to mitigate contamination effects, thereby improving the reliability of LLM scores.
MATH-Perturb: By modifying level 5 problems from the MATH dataset, this benchmark questions whether LLMs truly generalise mathematical reasoning or simply memorise problem-solving approaches when faced with small perturbations. (Note: Remember in last month’s digest we mentioned that MATH may have intellectual property issues – do perturbed versions of the questions infringe the IP too?).
EnigmaEval comprises 1,184 puzzles from competitive events, EnigmaEval tests LLMs on advanced, multi-step deductive and lateral reasoning skills, revealing significant performance gaps compared to human solvers.
Towards a Biological Knowledge Benchmark is a comprehensive assessment that evaluates 29 LLMs against eight curated benchmarks — six for biology and chemistry and two for security rejection rates — and includes custom models with security guardrails removed to compare biosecurity potential against expert baselines.
RV‑Bench evaluates LLM mathematical reasoning by generating "unseen" questions (i.e., instantiating classical mathematical problems with randomised variable combinations). Results reveal significant drops in accuracy compared to fixed-variable benchmarks. This is yet another indication of benchmark contamination.

Agents

This paper studies the ability of agents to autonomously deploy computer science research repositories by evaluating LLM performance with respect to accuracy, efficiency, and the quality of generated deployment scripts. This work aims to boost developer productivity by enhancing deployment workflows, as well as improving the management of complex development processes.
Science-GYM is a Python library that provides a benchmark environment for testing how well AI agents understand basic physics through tasks involving data collection, experimental design, and equation discovery. Each experiment is defined by a state space, and an action space. They use rewards for directing the discovery process. The complexity of each task is given by the type of observations provided to the agent, and the nature of the reward.

The Best Paper Title Award

After many digests highlighting very original or pathetic benchmark names and paper titles, we are proud to announce the first AI Evaluation Digest Best Paper Title Award, which is presented to Nishant Balepur, Rachel Rudinger and Jordan Boyd-Graber from the University of Maryland, for their paper:

Which of These Best Describes Multiple Choice Evaluation with LLMs?

A) Forced B) Flawed C) Fixable D) All of the Above

Congratulations! On top of the achievement in title composition, the paper demonstrates how the multiple-choice evaluation format for LLMs leads to unreliable performance metrics, and proposes solutions from educational best practices.

Contributors to this month’s digest: Lorenzo Pacchiardi, Nando Martínez-Plumed, Jose H. Orallo, Irene Testini, Lexin Zhou, Peter Romero, Joseph Castellano.

News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com

Getting the digest: Once a month if you join.

The AI Evaluation Substack

Discussion about this post