2025 May "AI Evaluation" Digest
Ethical standards in AI evaluation
Researchers face many pressures that may be in conflict with basic ethical standards. What are these pressures in AI evaluation and how can we ensure that researchers do not fall to them, willingly or not? One example is the difficulty of having access to real human-AI interaction behaviour, meeting the ambition of a broader view of AI evaluation as used in the real world. This pressure to present real data is exactly what seems to have led to crossing red lines in a couple of recent cases.
In the first one, covered by Science, LLM-generated text was used for impersonating human identities on Reddit, with the goal of showing that AI content is significantly more persuasive than human-generated content. But Reddit users weren’t informed they were part of an “experiment”. This is not only unethical but fraudulent, according to Reddit rules. The study was never published. Lesson learned: conducting experiments “in the wild” is good, but doing them wildly is not.
In a second case, which was published a few months ago, but retracted this month, was covered by Nature and the Wall Street Journal, among other major venues (but we at the prestigious AI Evaluation Substack didn’t). The paper, evaluating the impact of an AI tool on material discovery in a company, “begins with a preprint, as usual”, with results that are too good to be true and come from a rising star at a prestigious institution. Once the fraud is spotted, the institution, MIT, is forced to release a statement declaring they have “no confidence in the veracity of the research contained in the paper”. Yes, but other kinds of confidence have been eroded as a result. Lesson learned: we all contribute to echoing shocking results before papers are properly peer-reviewed, and should be more careful with preprints. Extraordinary claims may require good old ordinary peer review.
Popcorn rows are not fun
Unethical and other bad practices have always existed, but they are now amplified by social media rows, increasing the perception of mobocracy in AI evaluation. What’s worse, healthy scientific debates too often become personal, which is neither fun nor useful. A recent paper, The Leaderboard Illusion, argues that Chatbot Arena can be easily gamed by testing multiple variants secretly before a final model is released publicly; this favours private providers that can afford to do so. They also argue that proprietary, closed models are sampled more frequently than open-source ones. Instead of working to avoid these issues, Chatbot Arena contends that the paper includes “factual errors and misleading statements”, such as the figure for open-source model usage in the arena. This escalated to pushbacks on X (1, 2, 3). And, as social media is all about extremes, other criticisms sprout up, putting everything into question, “because they try to quantify the unquantifiable”. What is unquantifiable in AI evaluation?
News
The European Commission’s AI Office has published a forthcoming call for tenders, spotlighting major attention on building new evaluations and improving existing ones. More details to be announced, but you can subscribe to updates at the link above.
The EvalEval Coalition — a researcher community evaluating evaluations — has been soft launched.
UK AISI has published a research agenda that includes challenges in the Science of Evaluation.
RAND has released a compendium of best practices in general-purpose AI evaluation, which should help achieve internal validity, external validity, and reproducibility of GPAI evaluations.
Benchmarks
PERSONAMEM is a benchmark of over 180 simulated multi-session user-LLM conversations across 15 real-world tasks, designed to test the ability of models to internalise and track users' evolving characteristics and preferences. Evaluations show that top LLMs achieve only around 50% overall accuracy.
A new benchmark of symbolically encoded room-mapping puzzles forces LLMs to apply real semantic and deductive logic to unseen problems (i.e., logical reasoning), thereby minimising the chance of mere memorisation. Most models (GPT-3, GPT-4, Llama-3.1, Gemini-1.5, Claude-3.5) performed similarly across all difficulty levels, suggesting algorithmic rather than human-like reasoning. DeepSeek-V3 stood out for its more human-like problem solving behaviour.
Automated chemistry-centric benchmark for multi-hop reasoning in LLMs, building a knowledge graph from chemical literature via named-entity recognition and external data sources to generate challenging multi-step question-answer pairs. Their evaluation of leading LLMs (both with and without perfect context) reveals persistent reasoning errors even under ideal retrieval.
MOTBench evaluates LVLMs using OCR and bilingual translation of real Chinese and English menus, extracting the name, price and unit of each dish. Its fine-grained, item-by-item comparison shows that the automated metrics (BLEU, COMET) closely match human judgements, revealing the strengths and weaknesses of each model.
GoEmotions.v2 and CancerEmo.v2 are two human-annotated stress test sets created by iteratively rephrasing sentences to remove explicit emotion words until a fine-tuned BERT model fails, thus testing whether LMs truly understand emotions or merely exploit surface cues. Their experiments show that all models suffer a sharp drop in performance as lexical signals disappear, although instruction-tuned LLMs such as OPT-IML and ChatGPT outperform smaller, pre-trained models.
Skating benchmark (is this the most specific benchmark ever? ;), combining 3D motion, skeletal and video data with fine-grained technical and artistic annotations. Initial evaluations show that today's AI models struggle to capture the sport's complex mix of jumps, turns, and expressive performance.
CAPTURE evaluates spatial reasoning on VLMs (i.e., to complete and count objects hidden behind occluders) in both real and synthetic patterned images. Experiments on GPT-4o, Intern-VL2, Molmo, and Qwen2VL show that they struggle much more than humans with occluded counting and only improve when given oracle or inpainted cues.
DeepMath-103K provides decontaminated, high-level (level 5-10) maths problems, each paired with a verifiable final answer and three different solution paths (supporting both supervised finetuning and rule-based RL). Models trained on this dataset show significant accuracy gains on challenging benchmarks such as MATH500, AMC23 and AIME.
TimeCausality is a benchmark designed to test the ability of vision-language models to understand irreversible, real-world transformations over time (e.g. ageing, decay, and corrosion). There are still major gaps in the temporal causal reasoning capabilities of open-source VLLMs.
HARDMath2 comprises 211 problems in graduate-level applied mathematics, covering boundary-layer theory, WKB methods, asymptotic integrals, and nonlinear PDEs. It has been crafted and peer-validated by students via an LLM-interactive pipeline. Gemini 2.5 and o3 achieve less than 60% accuracy.
R-Bench is an Olympiad-level, graduate-level reasoning benchmark with 1,094 text-only and 665 multimodal questions across over 100 disciplines in English and Chinese. Evaluation of leading LLMs/MLLMs (e.g. OpenAI o1, GPT-4o) tops out at just ~53% on multimodal items.
XBench is a live benchmark suite aligned to professions that measures the real-world productivity of AI agents in recruitment and influencer marketing using authentic tasks defined by experts. It is updated continuously via IRT-based indexing to chart agent growth and predict tech-market fit.
ARC AGI was meant to saturate only when we achieved AGI. Now it’s saturated, but we don’t have AGI. No worries, we have the brand-new ARC-AGI-2.0 :-) Let’s give the ARC AGI series a warm welcome to the ‘challenge-solve-and-replace’ benchmark club!
OSUNIVERSE is a benchmark that has been introduced to measure the performance of multimodal AI agents on GUI navigation tasks. Agents receive screenshots from the environment and can take actions by controlling the mouse and keyboard. Unlike for humans, these tasks appear to be challenging for current AI agents, with OpenAI's computer-use model achieving the highest performance of 47.8%.
IRT & Capabilities
This survey argues that evaluation is transitioning from task-specific to “capability-based” evaluation (does it sound familiar?); merely stating that a benchmark tests a capability does not guarantee it has the necessary specificity and sensitivity. Interestingly, they identify the unbounded (in breadth and strength) LLM capabilities as the core challenge of evaluation and advocate for adaptive datasets, advanced metrics and close-up study of reasoning traces to address this.
In this preprint, the use of capabilities by a model is detected by some mechanistic interpretability techniques, namely the Model Utilization Index (MUI), an evaluation metric that measures the proportion of a model's activated neurons or features during inference, offering insight into how efficiently a model uses its capacity.
Everitt et al introduce a capability-normalised metric called "goal-directedness" to quantify how LLMs apply their known abilities to a given goal. It basically captures a narrow facet of “task diligence” or goal persistence. Testing on block-stacking tasks that combine noisy measurement, reasoning, and plan execution, they show that even the latest models fall short of full goal-directedness, and that scores are consistent across tasks, but only modestly improved by motivational prompts.
PSN-IRT, a 4PL logistic layer in an end-to-end pseudo-Siamese architecture, provides precise estimates (compared to classical MLE/MCMC/VI or simpler neural IRT models) of LLM latent abilities and item parameters (relaxing restrictive normality assumptions), revealing widespread ceiling effects, weak separability and data contamination in popular benchmarks. Item selection via PSN-IRT’s Fisher information produces small, discriminative test suites whose model rankings align closely with human preference.
AGI-Elo treats test cases and AI agents (or humans) as Elo competitors and uses an IRT-style logistic model to infer both item difficulty and agent ability simultaneously. Across six vision, language and action benchmarks, AGI-Elo uncovers long-tailed difficulty distributions and quantifies the number of Elo points that current models must gain to achieve 50%, 90% or 99% mastery confidence.
IRT analysis of GSM8K, MATH, and MathOdyssey shows that GSM8K can no longer discriminate among leading LLMs, MATH (while currently most informative) risks rapid obsolescence, and model rankings based on these benchmarks prove unstable.
Also in the realm of mathematics, this paper generates hierarchies of capabilities automatically and assigns benchmarks for each of them.
A two‐level IRT framework is proposed to jointly estimate test‐speech difficulty and ASR system ability, visualizing performance through Recognizer Characteristic Curves and “ASR fingerprints” that decompose difficulty into sentence and speaker dimensions.
This paper introduces a novel evaluation framework for human-AI collaboration with a complementary suite of metrics (e.g., “information frontier”, which reflects the alignment between AI outputs and users’ working knowledge). The authors demonstrate their methodology in a financial valuation task that mirrors real-world complexity. By decomposing tasks into subtasks and tracking user-LLM interaction, they assess both AI output and user strategies. Their finding: while integrating LLM content generally improves performance, proactive prompting aimed at novelty can backfire by distracting users from relevant subtasks—echoing real-world findings from competency modelling and organizational behaviour.
Evaluation Methods
JudgeBench suite is a three-stage meta-judging pipeline that uses a human-and-GPT-4–refined rubric plus multiple LLM agents to score and filter judgments, automatically weeding out low-quality evaluations. This multi-agent approach boosts judgment accuracy by over 15 percent versus raw LLM outputs and by 8 percent over a single-agent baseline.
This paper compares a wide range of instance-level complexity measures for classification (from costly methods such as PyHard and IRT difficulty to simple cues such as sentence length) and shows that a model's own training loss closely reflects these more elaborate measures.
“Can a Crow Hatch a Falcon?” explicitly incorporates model lineage relationships (which model descended from which) to predict instance- and benchmark-level performance. Can’t predict for unseen examples but much better than simplistic PCA approaches.
An “LLM-as-a-Judge” framework using LLMs to automatically score AI-generated multi-document patient summaries against PDSQI-9 rubric matching human expert reliability (ICC 0.818) in just 22 seconds.
AutoEval is a new system that autonomously runs real-world robot manipulation benchmarks, using learned success detectors and reset policies to run trials around the clock with over 99% less human supervision (and closely matching human evaluations). The authors even provide public AutoEval cells so that researchers can benchmark their general policies in a standardised, reproducible way.
LLM-KG-Bench was used to evaluate 26 open LLMs on RDF and SPARQL engineering tasks, revealing that performance generally scales with model size but often plateaus, allowing smaller models to match larger ones cost-effectively.
This paper proposes modelling classification datasets as proximity graphs, using network centrality metrics (e.g. degree and closeness) as novel measures of instance hardness that capture local density and sparsity. Synthetic tests show these metrics pinpoint hard instances and complement traditional overlap-focused measures with minimal correlation.
This preprint re-introduces the concept of 'LLM psychometrics', which involves adapting the methods of psychological measurement (from personality inventories to IRT) to evaluate, validate, and enhance LLMs across constructs such as personality, values, cognitive biases, and theory of mind. It provides a unified framework for designing tests, creating prompts, scoring and psychometric validation (yet unfortunately focuses on LLM only). Something this open-source book project tries to accomplish (yet from a psychologists’ perspective).
A taxonomy of six functional LLM usage modes (rather than capabilities, as stated in the title) was inferred from O*NET task data and millions of real-world prompts: summarisation, technical assistance, reviewing work, data structuring, generation and information retrieval. Benchmarking against criteria such as coherence, accuracy, clarity, relevance, and efficiency reveals that Gemini is the best performer.
Resources
Raji and Recht's Spring 2025 Machine Learning Evaluation course unpacks how we know machine learning 'works', from metrics like cross-validation and calibration to challenges like adaptivity, robustness and dynamic LLM benchmarks.
Contributors to this month’s digest: Nando Martínez-Plumed, Jose H. Orallo, Lexin Zhou, Lorenzo Pacchiardi, Wout Schellaert, Peter Romero, Behzad Mehrbakhsh, Joseph Castellano.
News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com
Getting the digest: Once a month if you join.


