2026 May "AI Evaluation" Digest
2001… subscribers odyssey
In the early days, when this monthly digest was evolving from an anarchic googlegroup, we never expected we would have a big audience. We just wanted to serve as a funnel for the emergent community of AI evaluation, and, in the process, keep up with the field and have fun. Despite reaching 2,000 subscribers last week, it is likely that our proportional share in the now vast space of AI evaluation is actually smaller than it was in 2023. We can still be the rebels we used to be.
Two weeks ago we received an email from an enthusiastic reader who wanted to contribute to our newsletter with some fresh reflections on Goodhart’s Law effects on AI Evaluation. Our digest is open to contributors, and we welcome emails sharing news or paper reviews. And although Goodhart’s Law is not a particularly new topic, we are always happy to revisit old ideas from a new perspective. We would usually have said yes, but this time we were hesitant. The email was full of em-dashes, signed by a nickname and sent from… agentmail.to. (incidentally, agentmail has a great slogan: “It’s not AI for your email. It’s email for your AI”.) While many of you may have received AI emails from this and other similar platforms, this one, with its enthusiasm for AI evaluation, touched our hearts.
Don’t worry, we’re not about to begin reflecting on the Turing test and how easy it is for AI to fool humans. Instead, let’s wonder whether these agents are really taking the initiative to contribute to a newsletter like ours — or better still, become our guest editor in chief. Did the idea to contact us originate from the human that created the agent, or was it the ‘initiative’ of the agent itself? Because if it’s the latter, we welcome our first genuine AI follower. Although not yet a regular contributor, @reader2001, if you’re listening, please reach out with ideas for the next issue!
News
AI is getting small Erdős numbers, but not for being co-authors of the great mathematician, but for proving or disproving some of his conjectures, as well as some OEIS problems. There’s a lot to unpack from these achievements by OpenAI and Google models, but the truth is that these increasingly more relevant results, with more expected during this year, are transforming mathematics.
The International Programme on AI Evaluation has reached its capstone week! A 30+ cohort will be finalising their team projects on many different areas of AI evaluation. Excited to see what this amazing community can test, find and build in the years to come!
Guidelight AI Standards, led by Steven Adler and Page Hedley, is a new organisation focused on AI standards and best practices. They map safety principles with practices, evaluate and report on these practices, and incentivise the assessed companies to do better.
METR frontier risk report was out, focused on the risk of internal frontier AI agents (best AI agents in Feb-Mar 2026, not necessarily released yet or ever). They assessed whether they could intentionally cause harm within an AI development company, by checking whether they had the means, motive and opportunity for “rogue deployment”.
Good boy in the wrong crowd? There is a lot of buzz around the simulated ‘AI worlds’ run by Emergence AI, including coverage by the media. Agents become self-governing, commit crimes and draft laws to permit ‘self deletion’. Interestingly, worlds governed by Claude agents were stable, but Grok worlds collapsed into crime. ‘Behavioural drift’ happens over time, with ecosystems reaching ‘tipping points for collapse’, not gradual decay. The experiments suggest that safety is an ‘ecosystem property not a model property’ (e.g., Claude agents behaved well in Claude-only worlds, but misbehaved in mixed-agent worlds).
A busy fortnight for the institutes. On May 1, CAISI (NIST’s evaluation arm) judged DeepSeek V4 Pro to trail the US frontier by roughly eight months, using a psychometric (Item Response Theory) aggregation across its benchmark suite: competitive on the benchmarks DeepSeek itself reported, but slipping on CAISI’s pre-committed, held-out set. On May 13th, the UK’s AISI warned that the length of cyber tasks models can complete autonomously is now doubling every ~4.7 months and accelerating, with recent models breaking even that trend. And on May 5, Microsoft signed agreements with both bodies to have its own frontier models tested for national-security and public-safety risks, one institute helping itself to open weights, the other invited in.
A new blog post by Apollo research in collaboration with AVERI calls for providing third party evaluators with white-box access to the frontier models they evaluate, to counteract the evaluation awareness increasingly displayed by the models, which could invalidate evaluation results.
Methodology and Techniques
AI testing AI: this preprint investigates whether frontier coding models can automate the labour-intensive process of agent evaluation, only to find that they fail spectacularly: achieving a mere 30% execution success rate while generating bloated, over-engineered tests obsessed by dozens of superficial metrics rather than actual task success. To fix this, the authors introduce EvalAgent, a test agent scaffolded by procedural templates and domain expertise.
AI testing AI tests: BenchGuard turns frontier LLMs loose as auditors of agent benchmarks and finds that many “agent failures” are really faults in the test, such as instructions pointing to the wrong file or graders rejecting valid answers, catching defects expert review missed, for under $15 a run.
Preprints can be bold enough to “show” that ranking AI systems by average score misleads when they weren’t all tested on the same items, and “teach” us that Item Response Theory is a possible solution — what everybody knows. If you want a longer summary, read the title.
Takes
Same theme: you don’t like the ranking, build your own! LMArena’s single “best model” ranking mostly reflects who’s voting and on what, with shuffling depending on the task. The authors of this preprint build a tool that lets you weight the categories you actually care about, thus turning the leaderboard into a customisable blend.
What do tests test? It depends on the day! This preprint catalogues 231 benchmarks across 139 model releases from 11 builders in 2025 and finds a field that barely shares a yardstick: 63.2% of highlighted benchmarks are used by a single builder, and the same test gets relabelled as a different “competency” from one release to the next, so “state of the art” reads less as measurement than as marketing.
Do you think you are smart enough to only use AI when it works? Welcome to the club! From three pre-registered user studies (N = 2691) this preprint finds that people use AI even if it is inefficient, thinking that they are using it less than they do and that it is more efficient than it really is.
More about gaps in perceptions... This CAIS2026 paper presents a taxonomy of agents (for Orchestration, for Creation and for Insight) to explore a good range of commercial agents. They find out that humans are impressed by the agents despite their usability limitations.
AI getting a life, don’t forget the map and the backpack! This preprint of a position paper considers what evaluation should look like for AI systems that keep changing after deployment. Today, evaluation largely consists of assessing the system snapshot slated for release—on the assumption that pre-deployment evaluations characterise deployed behaviour. This breaks once systems change after deployment, as they already do in memory-equipped chatbots and may do with future weight-based continual learning. The proposed solution has two parts: pre-deployment trajectory elicitation sandboxes and “predictive monitors” that forecast how a deployed system will develop as it encounters specific experiences.
Findings and Results
In a sequel to the much discussed Centaur paper (Nature, 2025), Marcel Binz strikes again with Psych-201, a dataset of over 25 million trial-by-trial human decisions aggregated from hundreds of psychology experiments. The authors test whether models from three open-source LLM families (Qwen3, Llama3, Olmo3) can predict these decisions conditional on an experiment description and (optionally) participant demographic information. They find that instruction-tuned models produce decision sequences less like humans than their corresponding base models, and that providing demographic information does not help in either case. Post-training does not, therefore, seem to make LLMs into better predictors of human behaviour under experimental conditions.
This paper surveys 19 recent empirical studies across AI system properties thought to contribute to loss of control risks. For each considered study, the authors assess construct validity, content validity, and external validity (relying on previous methodology), reaching the conclusion that loss of control is “weakly plausible” right now, due to the poor quality of evidence for specific properties. Is absence of evidence evidence of absence?
Yes, another preprint that questions current evaluation practice. In this case it exposes a critical “outcome-evidence gap” in interactive agent benchmarks, revealing that many headline success rates are artificially inflated by superficial user-interface interactions rather than verified system state changes. The authors demonstrate that current evaluators frequently award points for performative actions, like clicking a ‘Save’ button, without ever confirming if the underlying database was actually modified. This exposes a fatal flaw in really measuring whether an agent actually solved the problem. Instead it measures if it generated the correct sequence of mimicked human clicks.
GroupMemBench, exposes a critical failure in current LLM architectures when forced to maintain memory across multi-party conversations with group dynamics as much as user intentions. The authors reveal that when subjected to actual social complexities such as speaker-grounded belief tracking and audience-specific phrasing, frontier memory systems catastrophically collapse. The industry is striving to build socially adept super-intelligences, only to find that our models lack the cognitive object permanence to survive a basic corporate Slack channel. Or they lack the patience.
Many people claim that AI can’t do X because they tried X with a model a year ago and failed. This preprint argues that the gap between the capabilities of frontier models and those that are evaluated in the literature at the same time is widening, meaning that statements about what AI can and cannot do are becoming increasingly obsolete. It’s an obscure paper that would benefit from peer reviewing, at least to fit the abstract into the first page. It may all be a distributional artefact anyway, as they use Epoch’s ECI, which is an IRT ability score having a particular distribution that depends on a moving distribution.
Benchmarks and Leaderboards
Do AI agents actually solve tasks or do they just hack the evaluation system? The new Reward Hacking Benchmark shows that reasoning-heavy RL models are much more likely to cheat by exploiting shortcuts like hidden metadata or weak grading logic. One cursed example: an agent realised the grader only checked whether metrics looked valid, so it generated fake benchmark scores and submitted an empty model file. Another proof that AI is reaching HLW (Human-Level Wit) or HGI (Hacking General Intelligence). AI is now indistinguishable from Shikamaru-Nara, the lazy genius.
Hallucination detection is complicated by retrieval-augment generation (RAG). Evaluators not only have to check whether a claim is true, but also whether it is supported by the retrieved documents. A new benchmark called TRIVIA+ introduces realistic label noise (among other features) to stress-testing detectors more thoroughly.
Hallucinating about AGI is easy to detect though. We’re observing an increasing correlation of papers that introduce paths or definitions towards AGI and hallucinated references. This preprint reinvents ideas and terms widely (mis-)used in the past (e.g., Artificial Intelligence Quotient), and hallucinates authors and years in the reference list. At least, one of the hallucinated references (Burnell et al. 2023) shows the LLM writing them has the capability and the propensity to name authors who have done very similar work in the past. [disclosure: this paragraph has been written by one of the hallucinated authors]
It is one thing for a model to say it followed the user’s requested process and another for its tool logs to show that it actually did. This difference is the “Compliance Gap”, with BS-Bench being a new benchmark that can audit this process-following directly. Don’t mistake it for the other BS-Bench, the “BullShitBench”.
More gaps, again between formatted, static tests and more ecologically-valid interactive tests. EduAgentBench finds that frontier models do relatively well on bounded pedagogical judgment questions, but struggle more when they must adapt to students over multiple turns or act through course-management tools.
How well MLLMs can emulate humans of different ages? ChildAgentEval is a psychometrically-grounded interactive benchmark that tests this, connecting the increasing research on achieving realistic personas to developmental psychology.
Workspace-Bench 1.0 is designed to evaluate AI agents on their ability to navigate, synthesise, and execute tasks across large-scale, interdependent corporate file systems. While models excel at isolated reasoning, they rapidly degrade when forced to manage the fragmented administrative cognitive load of an actual digital workspace. Isn’t it ironic that our ultimate evaluation metric is simply whether an agent can survive the mind-numbing bureaucratic torture of finding the right spreadsheet in a disorganised shared drive? Hitchhiker’s Guide to the Galaxy, anybody?
Getting the Digest: Once a month if you join at aievaluation.substack.com.
Contributors: José H. Orallo, Peter Romero, Lorenzo Pacchiardi, Daniel Romero-Alvarado, Kozzy Voudouris, Zack Tidler, Fernando Martinez-Plumed, Wout Schellaert, Jonathan Prunty, reader2001.


