Act One: After a mere 75 years of waiting for Godot, researchers have rediscovered what we all knew about the Turing Test (see here and here): AI can fool you into thinking it's human. In yet another paper repeating a variation of the experiment, they dressed up GPT-4.5 and Meta's LLaMa 3.1-405B in a bespoke "persona" prompt and threw them (together with ELIZA and GPT-4o), into four-person, five-minute chat sessions with students and prolific workers. After Beckett’s Pozzo shouted “think!”, GPT-4.5-Persona convinced the judges it was human 73% of the time (beating even real humans), while LLaMa scored a modest 56% (not reliably above chance with undergraduates, but it was with Prolific).
Act Two: Despite major concerns about the methodology of the paper and what it means, it reaffirms that some humans don’t look human, and what matters most for the original imitation game is a good costume (none of the LLMs without the persona performed significantly better than Eliza). The only thing certain is that it was the least energy-efficient Turing Test in history, with GPT-4.5-preview being deprecated very soon.
To this, Beckett once said: “It is a game, everything is a game. When all four of them are lying on the ground, that cannot be handled naturalistically. That has got to be done artificially, balletically. Otherwise everything becomes an imitation, an imitation of reality”.
AI Agents’ Capabilities
There have been a number of papers that put autonomous agents through their paces, and reveal persistent gaps in reliability, context handling, and reasoning:
On the virtual work front, Mechanize Inc.’s realistic, long-horizon environments target the multimodal coordination challenges of everyday office tasks, aiming to unlock automation across the $60 trillion global labour market; meanwhile REAL tests agents on 112 tasks, from shopping to networking, across 11 high-fidelity website replicas (frontier models achieve just 41% success). On the open web, BrowseComp’s 1,266 challenging search questions expose the 1-2% accuracy ceiling of generic models (without browsing hover) while the purpose-built 'Deep Research' agent leaps to over 50% by tightly integrating search and inference.
Research replication remains elusive: PaperBench tasks agents with rebuilding codebases, running experiments, and matching results from 20 recent ICML papers against author‑approved rubrics. Top systems such as Claude 3.5 Sonnet average only ~21%, well below human ML PhDs. Spatial reasoning is no easier: a new four-part VLM benchmark (for spatial relations, navigation, mental rotation, visualisation) finds 13 leading models scoring near chance, and a GiScience-focused benchmark shows GPT-4 turbo below 25% on landmark, route, and survey tasks (though a hybrid LLM+GIS system jumps above 70%). TALES, measuring spatial, deductive, inductive and grounded reasoning capabilities, shows zero progress by agents in human-generated text adventures. And in the realm of self-replication, RepliBench offers 65 tasks revealing that even agents skilled at cloud APIs and crypto still fail at KYC checks, secure weight exfiltration, and durable deployments.
Evaluation Cards
"Audit Cards for AI Evaluation" is a one-page template that documents auditor credentials, conflict-of-interest checks, scope definitions, resource access, and all the fine print you definitely didn't read the last time someone said "trust us". And if one card isn't enough, SPHERE pulls out its own "evaluation card" for human-AI systems, asking five questions: what are you testing, how and when, who is involved, and how will you validate results? They even applied SPHERE to 39 recent human-AI systems and, shockingly ;), found evaluation gaps!
Psychometrics
There have also been a few new frameworks that bring human‑style testing to AI:
The first combines Item Response Theory (measuring classifier ability on hard instances) with the Glicko‑2 rating system (capturing robustness across datasets) to deliver a fairer, more nuanced ranking of algorithms. In an OpenML-CC18 case study, they find that only 15% of datasets really challenge classifiers, show that a well-chosen 50% subset retains full performance, and report Random Forest as the top-ranked model.
The second tool, R.U.Psycho (April’s best paper title award!), is a Python framework for designing, running, and documenting classic psychometric experiments (e.g., BFI‑44, Trolley‑Style Moral Dilemmas, etc.) on any generative LM using simple JSON configuration files, customisable prompts, and built-in post-processing.
Finally, it’s often overlooked that successful deployment of AI in hybrid work situations demands human acceptance, and an AI attitude scale can become a helpful tool to gauge degrees of inclusivity and thus success, thereby evaluating on a higher systemic level than just the individual AI.
Benchmarking Methodologies and Frameworks
This month's benchmark innovations include LLMs, VLMs, robotics and energy efficiency, taking evaluation far beyond static prompts.
LLM benchmarks are testing everything from code synthesis to open-ended reasoning and social skills. CodeARC, for inductive program synthesis with a suite of 1,114 Python functions, allows reasoning LMs to query a hidden Python function and a differential testing oracle to self-correct their candidate programs, yet the best model (o3-mini) tops out at 52.7% success; fine-tuning on synthetic 'reasoning traces' delivers a further 31% relative gain. A companion benchmark on code reasoning uses DSL sampling and targeted mutations to explore generalisation in and out of distribution, and finds that modern reasoning-tuned LLMs nearly perfect mutated code where earlier models merely pattern-matched.
At a higher level of abstraction, SuperARC uses recursive compression and algorithmic probability to challenge the capability to abstract, synthesise models and make optimal predictions. Experiments show that LLMs rely primarily on memorisation and falter at basic pattern synthesis, while a hybrid neurosymbolic method based on Kolmogorov complexity outperforms them on all tasks. KUMO pairs LLMs with symbolic solvers for endless multi-turn reasoning games, revealing that vanilla LLMs default to memorisation, while neurosymbolic hybrids and reasoning-scaled variants can match or exceed university-level performance. TextArena, a gym-style suite of 57 text games (from logic puzzles to social deduction), lets models compete against each other and human players to measure complex social skills such as theory of mind, persuasion and deception. NoveltyBench's 1,100-prompt diversity metric exposes persistent gaps in creative response diversity (and benchmark name originally), where larger models often collapse to fewer unique responses than their smaller counterparts. Finally, ChatBench converts 396 MMLU questions into 7,336 real human-AI conversations and shows that AI‐only benchmarks misestimate collaborative performance.
Vision-Language Models evaluation sees similar leaps. Video SimpleQA presents 41 top VLMs with concise, externally verified factual questions requiring visual, temporal and open-world knowledge, revealing rampant hallucination, overconfidence and efficiency/accuracy trade-offs in retrieval-assisted generation. A taxonomy-aware evaluation framework scores free-form text output to a hierarchical taxonomy and scores it with hierarchical precision and recall, so that partially correct but less specific answers receive partial credit. They show that standard text similarity measures fail to capture taxonomic relationships. And MapBench's 1,600+ path-finding queries on 100 real-world outdoor maps (linked via a Map Space Scene Graph) highlight persistent weaknesses in spatial reasoning and route planning, despite incremental gains in step-by-step instruction following.
Finally, benchmarks move into the real toolbench. AutoEval autonomously evaluates robotic manipulation policies on real hardware, using vision-based success detectors and self-recovering reset controllers to run continuous experiments without human intervention. On the infrastructure side, MLPerf Power establishes a standard methodology for measuring ML power efficiency, from microwatt edge devices to megawatt data centre clusters, reporting 1,841 measurements across 60 platforms and revealing critical trade-offs between computational complexity, throughput and power consumption.
Evaluation Methods
New evaluation methods are pushing us to look beyond raw accuracy:
Model Utilization Index (MUI) is a new metric which introduces mechanism interpretability techniques to complement traditional performance metrics. It uses neuron‑ and feature-level interpretability to measure how much of an LLM’s capacity is engaged for each task, adding an “effort” perspective. In their experiments, the authors uncover a consistent inverse-log “utility law” between MUI and performance, proposing four optimization directions during training (evolving, accumulating, coarsening, and collapsing) which clarify capability gains, specialization trade-offs, and data-contamination impacts.
Reasoning models often “overthink,” producing more reasoning tokens than needed particularly for easy problems. To tackle this inefficiency, ThoughtTerminator (oh, we love this one) is a training-free black-box method that imposes a token budget and repeatedly reminds the model of the remaining tokens, getting the same accuracy at a lower cost. The token budget is estimated by the model or by an external fine-tuned LLM.
Dutch Scaler is yet another performance indicator for binary classifiers that measures how much a model really learns by situating its metric score between an input‑independent “Dutch draw” baseline and an optimal "Dutch oracle" benchmark.
Ups and Downs
LLMs can turbo-charge ideation and team collaboration, but still lag behind in true creative composition and rigorous argumentation, and are still strongly biased:
It seems that current benchmarks miss “creative composition” use cases such as writing cover letters, personal stories, or brainstorming ideas. Shen and Guestrin argue that we need new, use-based benchmarks for creative writing and ideation to properly measure these models’ capabilities and their societal impacts
In a large-scale field experiment at Procter & Gamble, professionals using a GPT-4-based AI assistant generated new product ideas as effectively as two-person teams, broke down silos between R&D and commercial roles, and even reported higher positive emotions (in contrast to previous work). These findings suggest that generative AI can act as a true 'cybernetic teammate', transforming performance, expertise sharing and social engagement in knowledge work.
Haase et al. put 14 popular LLMs through two standard creativity tests (the Divergent Association Task and the Alternative Uses Task) and find no overall creative gains over the past two years (GPT-4 scores actually lower than before), although every model still beats the average human on the AUT, with only 0.28% of their output reaching the human top 10%.
In “Brains vs. Bytes” the authors test math reasoning capabilities of leading LLMs on 455 IMO shortlist maths problems and find that while these models sometimes guess the right answer, they almost never produce fully valid proofs and often hallucinate or rely on heuristics. They also show that LLMs can't reliably verify each other's solutions.
No LLM is free from bias! This paper presents a framework for assessing bias in a small set of open-source LLMs (TinyLLaMA-1.1B, Phi-3.5B, Mistral-7B and LLaMA3.1-8B) by examining them across eight bias categories (from gender and race to age and socio-economic status) using five prompting strategies and standardised metrics (LMS, SS, ICAT). They find that all models exhibit significant bias (particularly in gender and occupation), with LLaMA3.1-8B emerging as the least biased.
[DigestTerminator]
Contributors to this month’s digest: Nando Martínez-Plumed, Jose H. Orallo, Lorenzo Pacchiardi, Peter Romero, Lexin Zhou, Marko Tesic, Joseph Castellano.
News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com
Getting the digest: Once a month if you join
.