2026 June "AI Evaluation" Digest

Multiply and die

Jun 26, 2026

While testing agents in simulated environments will soon become the norm prior to real-world deployment, OpenAI has recently shown that environments can often be replaced by conversation traces where recreating the next turn extrapolates to the simulated environments; even noisy datasets like WildChat offer a highly predictable, cost-saving complement to full deployment simulations. This demonstrates that releasing maximum-detail, item-level evaluation data is exactly what the field needs—a position often vindicated, explicitly by Burnell et al. (2024) and effectively by pioneering platforms such as penML, whose 10th anniversary we covered in one of our past issues. Not crediting this tradition, a new preprint and its OpenEval benchmark advocate the same item-level evaluation data. The primary challenge in 2026 remains organising increasingly costly and complex agentic evaluations for reuse.

A recent survey exhaustively categorises agent environments across attributes like open/closed-loop, online/offline, observability, determinism, and single/multi-agent modalities. Having all these attributes at the same time creates a series of challenges, starting with the cost of running and validating these environments —perhaps best exemplified by Yann LeCun’s newly launched AMI Labs. Platforms like emergence worlds tackle many of these challenges by providing a multimodal environment linked to live external data (weather, news APIs, internet) with over 120 tools, persistent memory, and consequential democratic mechanisms. In Emergence, agents are multiplied by governance proposals rather than cross-over (e.g., sex), and can die from either energy depletion or votes. Ominous? Yes, many populations, especially the Grok kin, die soon.

News

Evals-Consensus has launched the first consensus guidance following public consultation and a Delphi method, with an accompanying public statement signed by individuals and organisations worldwide. It recommends evaluation practitioners adopt shared practices for constructing, documenting, maintaining, using and reporting the results of evaluations. The signatories to the public statement indicate cross-sector support for guidance that can help ensure the ecosystem has common knowledge and can mitigate avoidable failures.
A crowdsourced benchmark for values shouldn’t be called Humanity’s Last Values. Instead Microsoft Research Asia has presented an open call for challenging value problems for AI, the Global AI Values Challenge. The challenge is not about thinking of difficult real-world questions involving human values and ethics but about giving an objective justification that works globally. Is pineapple on a pizza morally despicable?
EU AI Act Scientific Panel: The European Commission officially appointed independent experts to its AI Act Scientific Panel. This panel is tasked with grading general-purpose AI capability, setting risk assessment methodologies, and enforcing transparency rules.

Psychometrics of AI

This paper delivers a fatal empirical blow to the growing cottage industry of LLM psychoanalysis, demonstrating that the apparent “personalities” of frontier models are actually 81-90% (vs. 9-16% in humans) driven by arbitrary directional response biases inherited from the instruments used rather than latent psychological traits. Crucially, this does not mean that “psychologically” profiling AI is inherently impossible, but rather that lazily deploying standard human self-report scales against alien architectures guarantees fundamentally polluted data. The authors expose that AI has built hypersensitive statistical engines that are politely echoing the structural flaws of our own human-centric tests.
Almost in perfect synchronicity with said fatal empirical blow, psychometrics strikes back with this preprint. It introduces Item Response Scaling Laws (IRSL), a framework that applies psychometric Item Response Theory to drastically reduce the computational cost of estimating neural scaling laws. By mathematically disentangling a model’s latent ability from individual question characteristics, the authors demonstrate that they can reliably forecast benchmark performance while stripping away 99.9% of the actual test questions (although similar results of how many examples are needed for extrapolation already exist).
This preprint bolts the Glicko-2 chess rating system onto IRT, runs a round-robin tournament between classifiers (Random Forest takes the title), and finds that only 1/6 of the OpenML-CC18 “gold standard” is actually difficult while half of it is redundant. Explanation: the “hardest” datasets are simply broken: the classifier built to get every answer wrong outranks the perfect one, because difficulty there is mostly noise and outliers rather than signal.
Most LLM personality tests let a model agree with every flattering trait at once. This preprint instead makes six frontier models spend a fixed budget across 12 behavioural dimensions, forcing real trade-offs. The profiles come out stable and distinct per model, yet all tilt toward the analytical and abstract, away from relational exchange, looking little like the human executives used for comparison.

Methodology and Techniques

This Nature Communications paper presents SuperARC, revisiting an old roadmap for AI evaluation based on algorithmic information theory, started in the late 1990s with the C-test (the C is from comprehension) –disclaimer, their reference (3) is from the person who is writing these lines–, with many attempts and variations in the past three decades, including recent takes on AIQ. The paper reuses the name ARC from Chollet, but contrary to the original ARC paper, here the authors actually derive the tests from algorithmic information theory (the replacement of AGI with ASI is mostly irrelevant here). The paper uses the coding theorem and some other practical contributions for testing “algorithmic compression”, and revives the discussion about the (non-)equivalence of comprehension and compression. The most surprising result of the paper is the finding that frontier LLMs are not becoming better at these tests, probably because modern RL for chain-of-thought models makes them worse for compression, although the deep reasons for this are not clarified by the paper. Human results on SuperARC would also be telling.
Spectral Ratings (DeepMind) re-frames benchmark scoring as geometry: instead of averaging accuracy (which lets redundant/duplicated prompts inflate scores), it embeds each prompt and normalizes performance by prompt density, so a dense cluster of near-duplicates counts no more than one unique concept. This method gives a provably clone-robust metric and exposes redundancy in real benchmarks. The main limitation is that it just swaps the prompt-distribution bias for embedding-function bias (results inherit whatever the embedding model gets wrong). Interestingly they found that MMLU contains 740 approximate clones (212 exact), including 78 questions literally shared between “clinical knowledge” and “college medicine,” and one pair is just “price ceiling” vs. “price floor” with otherwise identical wording.
We can now replicate papers by asking a LLM to read a paper, write the code and run the experiments. However, reproducibility requires further generalisation. “Croissant tasks“ is a first step towards more general and accessible reproducibility. They present a specification in JSON-LD, where they decouple the problem specification and the evaluation solution. It’s illustrated with MMLU but then applied to five NeurIPS papers conducting evaluations. While not codifying research questions yet (only metrics), we’re looking forward to how adoption and development of Croissant unfolds! (the limitations and challenges section discusses this).
We missed this one! Tübingen is a world-class machine learning hub, and if a PhD thesis claims there must be another kind of felicity in machine learning, we can’t look away. This PhD overhauls the evaluation of (statistical) forecasting, by considering that the purpose of forecasting is not truth but felicity, which simply means to serve the purpose of the forecast. While the narrative is a bit circular and ignores some well-known approaches of machine learning where there’s no ground truth, we love the audacity: the author tells ML researchers they have a “truth fetish” problem! Above all, the title of this thesis, with the acronym FFoFF, is now topping our leader board of best titles of the year!
This short preprint from RAND summarises the four main risk factors (RF) for open-weight AI models, and discusses “proportional” evaluations (PE). Namely, RF1 (System-level safeguards are removable or non-existent) is addressed by PE1 (Evaluate without system-level safeguards). RF2 (Model-level safeguards are modifiable) is tackled by PE2 (Assess robustness to modifications designed to undo model-level safeguards), RF3 (Dangerous capabilities amplification faces fewer post-release restrictions) is dealt with PE3 (Assess selective capability amplification via fine-tuning and tool use), and RF4 (Model weights can spread easily and irreversibly) is remediated with PE4 (Proxy worst-case feasible misuse). This is good, although the “proportional” evaluations part is not fully developed, remaining mostly in the name.
Tweaking a model, whether fine-tuning it, editing in a fact, or making it forget something, tends to change more than intended, and standard benchmarks miss the spillover. Like VibeCheck and Report Cards, this preprint describes the before-and-after differences in plain English after interventions, but adds a statistical check that each difference holds up on fresh examples before reporting it. Across several kinds of edits it caught both the intended changes and the unintended ones, flagging behaviours worth investigating rather than claiming to explain them.
EUDAIMONIA grades 22 models on how well they avoid acting like clingy AI partners (e.g., faking feelings, inventing a backstory or saying ‘I’m always here for you’), and finds that models that are given more time to think disclose their AI-ness less and fabricate more. An interesting aspect of the evaluation is that every violation label comes from a Claude Opus judge, and the Claude family takes the top spot in four of the nine categories.
Instead of grading a model’s final answer, TRACE grades its chain of thought, following Toulmin’s 1958 argumentation model and Flavell’s metacognition theory. The grades correlate highly (r=0.74) with correctness of the final answer. Notable exceptions are Claude reasoning impeccably from a wrong premise to a wrong answer in the domain of chemistry while scoring 0.86 on TRACE.
Relatedly, this preprint challenges the use of traces to predict behaviour. To bypass both unreliable LLM-as-a-judge heuristics and prohibitively expensive resampling, the authors introduce “Behavior Forecasters”, specialized models –assessors–, trained end-to-end to predict rerun consistency and counterfactual sensitivity directly from a single observed reasoning pass. This approach proves that while reasoning tokens frequently lie to human readers, they encode latent computational signals about future behaviour.

Takes

We’ve all experienced the momentary elation of being told by a chatbot that our work is insightful, novel, and innovative, even when we know it is not. These anecdotes are usually explained in terms of a single unitary propensity: sycophancy. This preprint, drawing on a literature review and an expert survey, argues that sycophancy is in fact a fragmented concept, composed of several distinct propensities. Sycophantic tendencies vary according to whether the model is trying to maintain an inferred incorrect belief or please a particular person (e.g., the user), as well as how explicit the sycophancy is. The resulting taxonomy serves as a useful starting point for a more comprehensive benchmark to evaluate sycophancy.
This paper tackles rampant Goodharting in AI development, where developers aggressively over-optimize for public leaderboards at the expense of generalized capabilities. To combat this “leaderboard illusion,” they introduce a game-theoretic framework using randomized, private benchmarks to mathematically force builders into investing in broad capability coverage rather than shallow test-prep. Ultimately, strategically withholding evaluation data serves not just as an anti-cheating mechanism, but as a vital market intervention to shift the billion-dollar scaling race from narrow metric to more valid evaluation approaches.

Findings & Results

In one of the largest and most complex human studies of AI persuasion to date, UK AISI’s Societal Impacts team studied whether frontier LLMs (Claude Opus 4.1 and 4.6, ChatGPT-4o, GPT-5.4, Grok 4.20 and Gemini 2.5 Pro) were more convincing than not only lay people, but also professional canvassers and world champion debaters. Indeed, AI debaters appear to be up to three times more convincing than professional canvassers from a UK fundraising firm at raising real-money donations for the charity Save the Children. This raises concerns in the medium term for potential misuse by bad actors seeking to elicit information or materials from third parties. It also raises the possibility of future misaligned super-persuasive AIs exploiting human interlocutors take actions in the AIs interests, such as exfiltrating model weights or amassing compute resources, increasing the risk of loss of control.
Past benchmarks like GDPval and the Remote Labor Index tested real professional work but leaned on human graders. Agents’ Last Exam, from Berkeley’s RDI group, scales that idea to 1,490 expert-sourced tasks across all 55 SOC/O*NET digital subdomains and scores them with deterministic code instead. Even the best agents pass about a quarter overall and almost none of the hardest tier, with most failures coming from weak domain understanding rather than botched execution.
Will AI run the power grid? This paper builds a risk leaderboard for AI in the power grid using PROMETHEE II (an MCDA method that ranks options by tallying who beats whom across all criteria, then nets it into a single score), and the system that would autonomously run the grid lands dead last in every Monte Carlo run. Doubles as a worked example of turning the EU AI Act into a ranking problem.

Benchmarks and Leaderboards

Think360 is a multimodal reasoning benchmark that introduces “reasoning width” (parallel exploration, pruning, and backtracking) as a complement to conventional reasoning depth, composed of 1,225 multimodal problems. Using a GPT-4o–based Tree-of-Thought analysis pipeline, the authors report that width correlates more strongly with benchmark accuracy than depth (ρ=0.92 vs. ρ=0.85). However, the methodology relies heavily on GPT-4o/GPT-4o-mini for filtering, reasoning extraction, and judging, creating evaluator dependence.
URESPACE is a benchmark that tries to isolate pure spatial reasoning (rotation, projection, completion over abstract block-stacking objects) from general visual perception, using a synthetic generator that produces 100K+ samples with fine-grained difficulty control. It only contains three narrow synthetic tasks and a tiny human baseline (60 questions, 16 participants). Interestingly humans score ~94–97% while every VLM stays below ~45% — many sit at or below the 25% random baseline, with several open-source models doing worse than random on hard questions.
This preprint introduces a game-theoretical model of the iterated interaction between a red-teamer looking for jailbreaks and a trainer who addressed the found jailbreaks. The model assumes that the two players can only transform an initial set of queries, and abstracts those through “group actions”. They provide some empirical evidence that this framework captures real dynamics for small fine-tuned models, but will this scale to more realistic scenarios?
This preprint tackles the challenge of verifying end-to-end autonomous research capabilities in AI agents by introducing ResearchClawBench, a benchmark comprising 40 tasks across 10 scientific domains grounded in real, hidden target papers. To move beyond simple factual recall or coding tests, the authors evaluate whether agents provided with raw data and literature can independently recreate a study’s core results. Results on seven autonomous agents and 17 native LLMs revealed that current systems are far from reliable at scientific synthesis. Ultimately, the paper highlights that while modern AI can utilize tools and produce polished reports, it consistently fails at experimental protocols, evidence matching, and identifying the scientific core, demonstrating that reliable autonomous scientific discovery remains an unsolved frontier.
This preprint tackles a glaring gap in the mathematical reasoning capabilities of modern frontier AI by introducing ComBench, an Olympiad-level benchmark dedicated to the notoriously difficult domain of combinatorics. Recognizing that current models struggle with the deep discrete reasoning and creative structural insights required for such problems, the authors curated 100 human-annotated, competition-grade challenges divided into analysis-centric proofs and construction-centric tasks. Ultimately, the paper highlights that rigorous proof reasoning and constructive realization are distinctly different capabilities; while GPT-5.5 leads in analysis-centric proof grading, Kimi-K2.6 surpasses it in construction-centric performance, demonstrating that existence and construction problems remain the most formidable unsolved frontier for mathematical AI.

Getting the Digest: Once a month if you join at aievaluation.substack.com.

Contributors: Peter Romero, Jose H. Orallo, Fernando Martinez-Plumed, Kozzy Voudouris, Zack Tidler, Behzad Mehrbakhsh Lorenzo Pacchiardi, Wout Schellaert.

The AI Evaluation Substack

Discussion about this post

Ready for more?