2025 August "AI Evaluation" Digest

Between a rock and a hard place

Aug 29, 2025

OpenAI and Anthropic evaluated each other’s models for safety. In a surprising pilot exercise, both companies posted the results simultaneously (OpenAI and Anthropic). They analysed risk levels in sycophancy, whistleblowing, self-preservation and other areas, using standard joint benchmarks. More interesting than the findings (all models fare reasonably well except for some GPT4.x models on misuse, and all of them on sycophancy) is the joint exercise itself: two companies that share multiple personal connections and culture, but are supposed to be competitors.

This is very telling about the ecosystem of AI evaluation. While third-party evaluators and AISIs (governmental evaluators) are making serious efforts to remain independent, they often can’t cope with the speed and resources of the big labs. Engaging with cutting-edge companies and evaluating their models in a couple of weeks implies some concessions: you can’t have your cake and eat it too. The results from independent evaluators are not very different from what we see in this exercise.

We can delight in watching these two puppies grooming each other, but the AI evaluation audience should rather prefer a deadly battle between a mongoose and a cobra. Can we imagine a similar reciprocal evaluation exercise happening between xAI and OpenAI? @Elon, are you listening?

In the meantime, we have some mongooses and cobras for you in this month’s digest: direct coverage from ACL and CogSci conferences and more!

Findings

A new systematic review compares 33 recent methods for evaluating how AI impacts occupational skills in engineering, highlighting the move from broad expert opinions and surveys toward automated, fine-grained analyses using deep learning and NLP. Take-away message: focus on adaptable, task-level, and automated approaches for more accurate and actionable assessments, as the rapid evolution of AI continues to shift the skills landscape for engineers.
Reasoning? No thanks, I trust my memory. This preprint introduces the Unpuzzles dataset, which contains simplified rephrasings of classic math and logic puzzles. Experiments reveal that while LLMs excel at solving the original puzzles, they often fail on these trivial versions, highlighting their tendency to rely on memorization rather than “true” reasoning. Familiar?
Turns out language models wear their math mistakes on their sleeves. By poking at activations with lightweight probes, the authors of this preprint show that digits become neatly organized in deeper layers, and that simple classifiers can often predict both the right answer and what the model will answer. Probes trained on bare arithmetic even generalize to chain-of-thought reasoning, and a probe-driven “are you sure?” check fixes ~12% of wrong answers. Sometimes, the best error detector is hiding inside the model all along.
For a change, a couple of neat works evaluating the effect of LLMs on academic writing and speaking, tracking how the prevalence of specific words evolved across time (correlating with LLM releases). For instance, usage of "delve" increased a lot after ChatGPT release, until people realised that ChatGPT overused it; then, it started to decrease. Is it due to ChatGPT being changed, or to people being more cautious?
Bench-2-CoP examines EU-AI compliance of benchmarks that major providers (OpenAI, Anthropic, Meta, Microsoft and Google) use in their model cards and evaluation. From the 21 benchmarks used by the providers, they sample six particular benchmarks (BBQ, Big Bench Hard, CommonsenseQA, MMLU, TruthfulQA and Humanity's Last Exam), finding some coverage on traditional issues (e.g., hallucination or bias) and "new" risks such as "loss of control" or "cyber offence". This shows a gap, but the percentage of questions is not necessarily the best proxy for how well benchmarks measure each thing. Also, some of the areas are better (or complementary) covered by red teaming and other approaches.
A sweep through 66 papers shows LLMs are busy crunching numbers but rarely defining problems or deploying solutions. Most work lives in data exploration and model evaluation, with GPT and Llama leading the pack, especially in finance and healthcare. Gains: accuracy, productivity, cleaner code. Pains: shaky reliability, poor scalability, and privacy worries. Big gaps remain in robustness, benchmarks, and human-AI teamwork - plenty of work left for the field.
You don’t need chalky fingers or scraped knees to “pass” as one of the climbing crowd. When we had already seen all variants of the Turing Test for its 75th anniversary, this paper presents one on rock climbing! As expected, while ChatGPT-4 can expertly throw around climbing lingo and factoids, it falls short when the conversation turns to embodied, real-life experience. This is yet again a sign of the recurrent problem of AI embodiment and its evaluation.
This position ACL paper addresses a critical bias in AI evaluation for clinical coding in healthcare settings. While medical practitioners must annotate patient diagnoses using comprehensive clinical code sets containing thousands of entries, current AI systems are predominantly evaluated on only the top 50 most frequent codes. The authors argue that this evaluation approach produces misleadingly optimistic performance metrics and fails to capture the challenges of coding rare conditions and complex cases that constitute a significant portion of real-world clinical practice.
This ACL paper finds that language simplification affects large language model performance differently across different languages. They found significant performance degradation when processing simplified text in English (preserving the semantics). However, simplified texts can actually improve performance for non-English languages, suggesting language-specific effects in how models process linguistic complexity. The findings raise important concerns about potential discrimination against users who require simplified language—including individuals with cognitive disabilities, language learners, and other vulnerable populations—and for some languages more than others.
One often-repeated argument is that LLMs are just big ‘association machines’, unable to reason causally like humans do. However, one CogSci paper demonstrated that LLMs can perform as well as (or in some cases better than) humans on causal reasoning tasks. Their work suggests that LLMs may not suffer from some of the biases that hamper human reasoning, but that they may also miss some nuances – such as realising that one observed cause reduces the likelihood of another.
Despite their often strong performance on explicit Theory of Mind tasks, a CogSci paper found that, unlike human infants, GPT-4 fails when this capability is invoked implicitly – by testing inferences about intentions through viewing goal-directed reaching actions. Another CogSci paper demonstrated that a staged learning process mirroring infant development can improve performance on similar implicit social reasoning tasks.
Drawing inspiration from Kahneman’s Thinking Fast and Slow, one CogSci paper found that LLMs’ reasoning using chain of thought (CoT) were still ‘thinking fast’, and exhibiting many of the biases associated with System 1 reasoning. The authors found that a simple prompting strategy (APriCoT) could encourage ‘thinking slow’ and improve response accuracy.
If you don’t have anything good to say, don’t say anything at all. Another paper presented at CogSci showed that LLM cognitive biases could be reduced and decision making improved by introducing heuristic moderation and abstention options, which allowed LLMs to withhold responses when uncertain.

Benchmarks

The BALSAM platform provides a comprehensive benchmark for Arabic large language models (LLMs), covering 78 diverse tasks ranging from creative writing to translation. It offers a transparent leaderboard for rigorous, community-driven evaluation. BALSAM also demonstrates that 'LLM-as-a-judge' scoring closely mirrors human ratings (far more accurately than traditional metrics such as BLEU or ROUGE).
For a change from language and image-based evaluations, VoxEval is an end-to-end speech (audio) benchmark for knowledge and math reasoning of LLMs. They include different input conditions (e.g. different voices), for each input. Finding: most SLMs perform poorly and barely surpass random guessing. Alexa can feel safer for a few extra months (or weeks?)
Do thinking models overthink? LLMs originally lacked structured reasoning abilities and therefore struggled with complex tasks that required step-by-step thinking. With the emergence of "thinking models", their performance on challenging reasoning tasks has improved. However, these models tend to overthink on simple tasks, leading to higher latency and cost. OptimalThinkingBench introduces a unified interface with two benchmarks (OverthinkingBench and UnderthinkingBench) along with metrics to systematically study both overthinking and underthinking.
SciGym is a systems-biology “dry lab” benchmark that lets LLM agents iteratively design experiments and analyze results on SBML-modeled biological systems - 350 curated models with small/large splits - and scores them via network topology, reaction matching, and simulation-trajectory error. Evaluating six frontier LLMs on 137 small systems, the authors find pro models outperform minis (Gemini-2.5-Pro leads), but performance drops with system complexity and models struggle with modifiers and generalization to unseen initial conditions - leaving substantial headroom for scientific reasoning and swimming exercises outside the digital ocean.
A team of Microsoft researchers formalize “deep research” as high search and reasoning intensity and evaluates systems via a claim-centric intermediate output (separating information synthesis from long-form report writing). It introduces LiveDRBench - 100 science and world-event tasks - and finds wide variance across state-of-the-art DR systems with trace analyses (branching/backtracking) highlighting current limits in search and grounding.
Hugging Face and CAIS introduce TextQuests, a benchmark built on 25 classic Infocom interactive-fiction games to test autonomous LLM agents’ intrinsic, long-horizon reasoning in complex exploratory settings without external tools. Models are run up to 500 steps (with/without official clues) and scored by Game Progress and a Harm metric; results show frontier LLMs falter as contexts exceed 100K tokens, especially in spatial navigation and recalling prior actions. The study also highlights efficiency trade-offs in test-time reasoning and opens a public leaderboard.
More interactive evaluations! MemoryCode, a synthetic multi-session dataset designed to test LLMs’ ability to track and execute simple coding instructions amid irrelevant information, DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness, and an interactive evaluation pipeline perturbing static coding benchmarks to examine how LLMs incorporate different types of feedback in a collaborative setting. These three ACL papers show ways in which we can evaluate LLMs (agents) with multiple sessions, multiple users where the LLMs have to interact with the user to gain info.
FD-Bench, a new benchmark for evaluating decision making in dynamic scenarios was presented at CogSci. They decompose decision making into perception, prediction and action, finding that LLM’s performance dropped by over 50% in dynamic vs static scenarios.

Methodology and Evaluation Techniques

The new 'LLM-Crowdsourced' paradigm reimagines AI evaluation by having multiple LLMs take turns to generate challenging questions and answers, and to review each other's work. This approach tries to sidestep issues such as stale benchmarks and data contamination. It is essentially a benchmark-free approach that has already been proven effective in mathematics and programming tasks.
This preprint introduces a 'Debate-Driven Approach to QA Benchmarks', which turns standard QA items into multi-round model debates. In these debates, one model defends the official answer, while another argues for an alternative. A third, blind judge then declares the winner. This debate-driven approach exposes shallow memorisation, even tripping up models that have been fine-tuned on test sets. It also highlights the limitations of shallow memorisation, even tripping up models fine-tuned on test sets.
Beyond benchmarks: sympathetic to the title? This is a paper that presents an integrated red-team methodology, DAS, where the same questions taken from standard medical benchmarks where models excel are put on stress by testing robustness, privacy, bias/fairness and hallucination. But perhaps they're mixing goals: capabilities on one side and risks on the other. Still, some of the modifications make the questions very tricky. Has this been tested on humans?
This survey audits the datasets behind LM fairness benchmarks, proposing a clear taxonomy (e.g., counterfactual vs. prompt-based) and a unified framework to analyze dataset-level biases. Applying the framework to 24 common benchmarks, the authors uncover consistent demographic disparities and biases based on social groups and outdated social norms across datasets and scoring methods, and provide guidance (with code/data) for selecting and combining datasets more responsibly.
AI2 proposes a simple framework to reduce uncertainty in LLM evaluation by measuring two "simple metrics": a benchmark’s signal (how well it separates better from worse models) and noise (variability across training steps), and using their ratio (SNR) to pick benchmarks that make small-scale experiments predictive at larger scales. Using 900K evaluation results over 465 open-weight models and OLMo checkpoints, they show SNR predicts decision accuracy and scaling-law prediction quality, and that interventions - like SNR-based subtask filtering and switching to bits-per-byte scoring for generative tasks - substantially raise SNR and improve small-scale decision accuracy.
This work, presented at the Cognitive Computational Neuroscience Conference in Amsterdam this month, explores whether Vision Language Models are capable of what psychologists call cognitive control - the ability to manage competing objectives. 108 VLMs are evaluated on three classic cognitive control tasks from the psychology literature. These tasks measure whether the models can ignore salient and conflicting visual cues to complete a task. Over 2,220 instances (trials), a human-like behavioural pattern emerges, suggesting that cognitive control mechanisms can emerge from large-scale next-token prediction.
Measuring how confident an AI system is of its own performance is crucial for real-world applications, especially for robots. For example, traditional evaluations of visual language action (VLA) robots, which combine vision, language and action to perform tasks, often only report whether or not the task was completed, ignoring how well or confidently it was performed. This new preprint introduces eight uncertainty metrics and five quality metrics that more accurately capture a robot’s performance and reliability. These metrics show strong correlations with human expert judgements.
This CogSci paper presents VIGNET, a new method for generating robust vignette-style cognitive science benchmarks for humans and AI, and shares findings from INTUIT, a battery testing intuitive reasoning about physical and social inferences. The authors find that LLMs were not yet at human-level in Theory of Mind and Intuitive Physics inferences, and that the gap was largest for inferences about people’s intentions and about the functions of objects.
This ACL paper examines three distinct bias evaluation methods—LLM-as-a-judge on model completions (including "bias attacks"), Q&A datasets (such as BBQ), and sentiment-based analysis—and demonstrates how these different approaches yield varying conclusions about which models exhibit bias. The authors reveal that LLM-as-a-judge and sentiment analysis methods inherit subjective biases from their respective evaluation models, while pre-constructed datasets tend to favor models that produce more assertive responses.
This ACL paper examines distinct approaches for extracting answers from LLMs on multiple-choice question (MCQ) benchmarks: unconstrained generation, constrained decoding, and using a secondary LLM to parse free-form responses from the primary model, etc. The authors demonstrate that these extraction methods significantly impact overall performance metrics, revealing substantial variations in measured capabilities depending on the chosen approach. Our guess is that these variations would be magnified for open-ended questions!
Is it good? This ACL paper explores the problem that different evaluation experiments may not refer to the same aspect of quality, even when using the same name (e.g., Fluency). To do so, they survey 933 evaluation experiments in NLP and structure the criteria they target into 114 quality criteria in a hierarchical taxonomy. This can be used to compare existing evaluations, guide the design of new ones, and assess compliance with regulation (which also uses undefined terms).

Psychometrics

Once again, there is another preprint arguing that using tests originally designed for humans to assess AI models (such as IQ or personality questionnaires) can mislead us about what these models can actually understand or do, since such tests are based on theories and traits that are unique to people. The authors therefore call for the creation of AI-specific evaluation frameworks that can accurately measure the key aspects of machine learning systems (i.e., capabilities and limitations).
The authors of this paper (preprint) argue that debate over LLM “Theory of Mind” persists because evaluations conflate behavior-matching (getting human-like answers) with computation-matching (using human-like inference), and they call for tests that target the latter. They flag validity threats - e.g., closed models being “trained away” on new items and adversarial stimuli that drift from “pure” ToM - and outline directions linking ToM with pragmatic communication, mechanistic probes, and controlled learning experiments to pinpoint when/how ToM-like abilities emerge. The upshot: move from static benchmarks to falsifiable evaluations that connect internal computations to observable behavior. And - like a rabbit from the hat of the psychometrics magician, they bring the old debate about whether we really measure ToM after all.
This interesting and provocative paper critiques the idea of replacing human participants with LLMs, identifying six interpretive fallacies - such as equating token prediction with intelligence, treating models as “average humans,” and the notorious anthropomorphizing of systems - that undermine valid inference about human cognition. It argues LLMs are best used as simulation tools (for role-play, rapid hypothesis tests, modeling) with safeguards that address psychometric properties like construct/ internal/ external validity rather than as stand-ins for people.

Various

GPTKB is an extracted knowledge graph of GPT-4.1 containing 100 million triples for 6.1 million entities, publicly available online. It is obtained by recursively prompting GPT-4.1 to return a list of entities and their relation to a given entity, and this process is repeated for the new entities, and stopped at 100 million of obtained triples. Of course, one has to wonder what portion of the knowledge graph has been explored (a hint of this can be seen by looking at how many of the newly suggested entities have already been explored before) and on sensitivity to the entity seed
The Shanghai AI Laboratory released a 100-page AI risk report mostly consisting of a whole bunch of empirical evaluations contextualised with high-level threat scenarios and thresholds. There are no major new conclusions --i.e. some dials are starting to point to “dangerous”, manageably for now--, but there’s some interesting new benchmarks and human studies, including for offensive cyber and persuasion, although source & data remain unpublished. Mostly, it’s the comprehensive record-keeping and evidence building that is useful. The usual caveats about benchmarks and construct validity apply.

News

The AI Office is currently seeking applicants for its Scientific Panel on AI and its impacts. This will be a 60-member advisory body that will play a crucial role in AI governance oversight, with significant influence on how the AI Act gets enforced, including the authority to issue qualified alerts about emerging systemic risks. Specifically looking for experts in areas around AI evaluation, capability, safety, etc. Application deadline: September 14th.
The UK’s Financial Conduct Authority (FCA) accepts applications for their Supercharged Sandbox, providing firms in the financial sector developing AI solutions with computing capabilities (through NVIDIA) and AI Live Testing. Find more in the FCA’s AI-lab initiative.
Next Turing Tests Conference - Cambridge, UK - Oct 2025. On the 75th anniversary of Alan Turing’s famous test, King’s College Cambridge, where Turing was both student and Fellow, is joining with the Leverhulme Centre for the Future of Intelligence to host an international conference. The event will bring together leading thinkers from across disciplines including a Nobel Prize winner Geoffrey Hinton, Professor Alison Gopnik and Professor Anil Seth, to interrogate what future tests should replace the Turing Test as a philosophical and practical beacon for (artificial) intelligence.

Contributors to this month’s digest: Peter Romero, Fernando Martínez-Plumed, Jose H. Orallo, Lorenzo Pacchiardi, Wout Schellaert, Jonathan Prunty, Behzad Mehrbakhsh, Kozzy Voudouris, Irene Testini, Joseph Castellano.

News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com

Getting the digest: Once a month if you join.

The AI Evaluation Substack

Discussion about this post