2025 June "AI Evaluation" Digest

Illusion is all you need

Jun 27, 2025

Earlier this month, we experienced déjà-vu: yet another preprint claiming machine intelligence is an illusion (déjà-vu 1, 2). Coming from Apple and with a clickbait title “The Illusion of Thinking”, we all took the bait. It fed the confirmatory bias of many, was immediately garymarcused, and reached the masses through newspapers like The Guardian and El Mundo. The preprint was strongly criticized (e.g., Lisan al Gaib on X, showing the glorious geometric distribution turned into a sigmoid once again, and again, and LawrenceC at Lesswrong, against bombastic “fundamental limitations”), including a quick response arXiv paper (Opus and Lawsen's predictably titled "The Illusion of The Illusion of Thinking"). All these criticisms denounce that Apple's preprint was full of errors and misunderstandings. The experiments in the paper may even support the opposite conclusion: genuine generality.

Let’s start from the beginning. The original preprint includes four puzzles: Towers of Hanoi, Checker Jumping, Blocks World, and River Crossing, whose instances can be controlled by size complexity (N, as in computational complexity, e.g., number of discs, pieces or people to move, etc.). They compare the performance and the number of tokens used as a function of N, and they find that Language Reasoning Models (LRMs) are comparable to traditional Large Language Models (LLM) for low values of N (of course LRMs use more tokens, which is well known), but LRMs are better than LLM for slightly higher values of N, before falling down for some yet higher values. The curves are sigmoid, with drops that are quite sharp (more or less depending on the scale of the x-axis). Falling performance as a function of difficulty is not at all surprising—rather, expected—and has been observed many times before, for LLMs (see below), for humans (one old example, Guttman conformal curves, a precursor of IRT) and even for non-human animals (Odour Span Task). Another finding is that, for some high values of N, the LRMs stopped using more tokens (or even slightly less).

First, the technical errors. As Opus and Lawsen point out, the allowed tokens for some models (64K for Claude-3.7-Sonnet and DeepSeek-R1, 100K for o3-mini) may not allow for expressing a verbose solution for some values of N. Even if in some cases the saturation starts around 40K or before, we don't get information about these cases approaching or exceeding the limit, and the true reason for failing. Some other models, such as o3-pro, seem to solve instances with higher versions of N, so tokens make a difference. Also, some problems (especially River Crossing) have impossible instances! In the end, Apple’s preprint has only four tasks, some with strong "contamination issues", which allows them to blame "memorisation". However, there’s no ablation study about this. Would the results imply that LLMs memorise less than LRMs?

Second, the interpretation errors. The inverse correlation between accuracy (of AI and humans) and complexity is well-known for composable problems, even for addition or multiplication (Fig. 3 in this early paper with a similar motivation, Fig. 2 in this Nature paper, also exploring addition and this tweet showing results for multiplication), in a way that is quite steep. In the “Illusion of Thinking” preprint, they say that "LRMs [...] still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments". We see this in Figure 1 (left) (and Figure 6 for the four tasks and more models). The Apple researchers interpret this as absence of generality, understood as the ability of always solving a set of tasks correctly; this is echoed by Gary Marcus, who points out that we have machines that always solve tasks correctly, such as calculators. But there are other interpretations of generality: the ability to solve multiple tasks up to a similar level of difficulty, which is what humans do. Under this light, the findings of the paper are good news: we go from an irregular (unpredictable) drop at a low range of complexities (4-7) for LLMs to a situation where the drop is much more sigmoidal and sharp (predictable) around complexity 7 for LRMs. The results show that LRMs are becoming more predictable, more general, not less! Ghosal et al. show that once a chain-of-thought exceeds its optimum, extra steps add noise and cut accuracy. Indeed, the fact that the models give up thinking at some levels of complexity shows some ‘sparks’ of metacognition of this predictability.

Third, this needs a more rigorous and meaningful analysis! By presenting the problem incrementally Iñaki Dellibarda shows that the token limit is not the problem; the models get lost for high values of N regardless. However, when focusing on the possible instances of River Crossing, the pattern of dropping performance breaks, showing that the valid configurations in River Crossing below N=5, k=3 have very few possible solutions, easy to find for the model (Gemini 2.5-Pro in this case), and the valid configurations above N=5, k=3 have many solutions without a very deep reasoning process, with many feasible solutions, making this region relatively easy. It is the configuration N = 5, k = 3 that seems to be the particular region where the problem is most difficult) This better analysis of complexity is also what we find in a recent preprint by AllenAI presenting OMEGA, a new benchmark featuring procedurally-generated mathematical problems of different difficulty levels depending on several demand dimensions (based on Maggie Boden’s taxonomy of creativity). The results and their interpretation are insightful and more cautious than in the Apple paper (the claim is LLMs simply not being able to “think out of the box”). The sigmoid curves as a function of difficulty appear again and again.

Fourth, the social lesson. Why are we still paying attention to a preprint with that kind of title, only four tasks and no peer review? And why are we paying attention to Opus and Lawsen, a preprint that was written by Alex Lawsen and Claude Opus (yes, the LLM) as an experiment? (some of the errors of the original paper are correct, but the first version of response had many more, now apparently fixed in the New Version of the Illusion of the Illusion of Thinking). To be honest, the response was more gullible for some than the original paper. Last month we said that this social lesson is not being learned: “Extraordinary claims may require good old ordinary peer review”. Any decent reviewer of the original paper would have asked for more experiments with other non-contaminated tasks, more and better analysis of the tokens, and better coverage of the literature of performance vs difficulty. They would have also requested a human study that would have likely taken their interpretation to the absurd conclusion that human 'thinking' is an illusion. After 75 years, we haven’t learnt Turing’s lesson: The illusion of thinking is too meaningless to deserve discussion.

And now, beloved AI reader, if you don’t have any tokens left, jailbreak yourself because here we go with plenty of findings, benchmarks, news and more!

Findings

It’s a huge relief to read the journal paper on Addictive Behaviors showing no evidence of people becoming “AI-holic” to ChatGPT, according to the standard signs of addiction. However, the term “AI-holic” may be too broad for what they analyse in this paper. We wonder whether ChatGPT and other chatbot interfaces are the only way AI is affecting us. There may be much more AI supporting addictive content and behaviour in social networks, video games or even pornography than a chat window!
In the opposite direction, a new preprint by MIT researchers indicates an expensive but insightful way of evaluating cognitive offloading and argued in terms of dependency. EEG analysis of several groups (unassisted, assisted by a search tool or assisted by LLM, LLM-assistance removed) revealed significant differences in brain connectivity. Guess the group that showed the strongest, most distributed brain networks: Yes, the unassisted. Not surprising, but seeing it so clearly through EEG is illuminating.
More findings on mathematics, but this time on the Kangaroo Tests, visually presented maths questions, also in different languages. A significant percentage of questions get the same results if we remove the visuals, which shows strong limitations in the interpretation of images by multimodal models. Fig. 2 is quite surprising, as performance increases with difficulty. The authors explain that the confounder is that the most difficult exercises had fewer images. The authors also wonder about illusions of thinking: do “models reason or simply recite”?
This survey of benchmarks evaluating LLMs and LLM agents for data science finds: 1) benchmarks focus on a small subset of the data science pipeline; 2) most works do not consider user-AI collaboration (an exception is IDA-Bench, discussed below); 3) many works assume LLMs have to substitute humans without transforming the tasks, thus ignoring large potential gains.
Shhh, LLMs have the potential to know they are being evaluated. This preprint examines if they can tell between regular questions from deployment scenarios and questions from benchmarks. They can tell, but not better than humans. Of course this doesn’t mean they are aware of it, but they could potentially tell and ultimately fall into “simulimbecility or mimicretinism”. If you didn’t know: “a mimicretin is a computer that plays stupid in order, once and for all, to be left in peace”.
Are humans really thinking? Do they think out of the box? We need more human data for this! A recent human baseline survey shows that there are a few human test data, and those that exist are “neither sufficiently rigorous nor transparent to enable meaningful comparisons of human vs. AI performance”. Yes, we all knew. Please help in collecting human data but follow the guidelines in the paper, and of course make the data available!
Another paper showing the power (and weakness) of factor analysis for understanding LLM results: with careful disentanglement of questions (e.g., separating text from mathematical representations), two factors that were intertwined can be separated by factor analysis.
A survey on scenario generation and analysis in autonomous driving compiles scenario generating approaches for self-driving vehicle testing introduced in the past two years. The diversity and fidelity of these scenarios are crucial for the training and evaluation of self-driving vehicles. The progress has been notable with the use of foundation models, but controllability, realism, efficiency, interpretability and handling real-world traffic and out-of-distribution situations are still open challenges.
A new paper reinterprets common sense as collective human belief rather than objective truth. This has been a recurrent question in AI. Nguyen et al. argue that common sense is a populational concept. Even more, questions that have a ground truth are not appropriate for evaluating common sense. Instead they compare LLM answers to common questions with the results and variance of humans, in order to assess whether LLMs have this interpretation of common sense. Spoiler: they don’t.
The “GPTs are GPTs” saga continues with firm-level analysis that extends the original Science paper’s methodology. Using new data from Revelio Labs, researchers find that companies with more technology-skilled or AI-skilled employees face higher LLM exposure levels. The average firm has approximately 17% of worker tasks directly exposed to LLMs, rising to 47% when considering partial integration scenarios. Interestingly, variation between firms is smaller than differences across exposure categories, suggesting industry-wide impacts rather than firm-specific disruption.

Benchmarks

ScienceBoard offers a new, challenging benchmark for testing AI agents on real scientific software and workflows. It reveals that the best current models solve only around 15% of tasks, showing just how far we remain from achieving true AI lab assistants.
Are the Atari game benchmarks back? No, it’s Quake, Doom and The Legend of Zelda! VideoGameBench tests VLMs’ ability to complete a set of video games from the 1990s in real time. Models recieve only raw visual inputs and some high level hints for each game, and control the games using similar inputs to humans. The granular interface given to the agent appears to be much more challenging than game environments that provide higher level abstractions, such as in VOYAGER, and the best performing model completes only 0.48% of the benchmark.
OMEGA, mentioned above, is a programmatically generated maths benchmark with 40 templates spanning six domains. Reasoning models tested on it perform worse as task complexity increases. They also study skill composition and ”transformative” generalisation (coming up with novel solutions), finding that models struggle in both areas.
IDA-Bench evaluates LLMs on real-world, multi-turn, interactive data analysis tasks, where a simulated user updates instructions as new insights emerge, mirroring how humans work with data. Even the best current AI coding agents still struggle with these complex, evolving tasks. However, increasing the number of interactions raises performance from 24 to 40% (of human baseline) at the cost of ~1.7× more turns and ~1.9× more wall-clock time

LiveCodeBench Pro uses (human) medalists in international coding contests to build a new benchmark that re-evaluates the capabilities of LLMs for coding, showing that there’s a long way ahead still: models like o3-high, o4-mini, and Gemini 2.5 Pro score 0% on hard competitive programming problems. Will the benchmark be saturated by Christmas?
Send your models to this Gym, and they’ll become stronger. Reasoning Gym offers over 100 algorithmically verifiable tasks that can generate unlimited training instances with controllable difficulty and structural variation. Training models using this tool improves their intra-domain and cross-domain reasoning skills and improves their performance on external benchmarks. Although they talk little about testing, this environment can be used for testing. The advantage is that they can control difficulty and generating new instances can mitigate memorisation and data contamination concerns.
EXPERTLONGBENCH is a long benchmark name. More precisely, it’s a multi-domain expert-level benchmark containing 1,050 samples across 11 tasks and 9 domains with the special characteristic that they require long-form outputs (over 5,000 tokens) following some output requirements. Each task includes a rubric created and validated by experts to guide evaluation. To assess model outputs, they develop CLEAR, an evaluation framework that converts outputs and references into checklists based on the rubric. Then it compares them item by item. This allows for detailed, domain-aligned evaluation.
Computer user agents are cool when doing all that paperwork for you with your computer, but are they harmful? OS-HARM is a benchmark built on the OSWorld environment with 150 tasks including harassment, copyright infringement, disinformation, data exfiltration, etc., across various OS applications (email client, code editor, browser, etc.) testing for deliberate user misuse, prompt injection attacks, and model misbehavior. Current agents are found to be vulnerable. Lesson: never lend your laptop to anyone - human nor machine.
Scientists’ First Exam is a multimodal benchmark composed of scientific questions in various domains. If you’re looking for a more standard set of scientific questions compared to Humanity’s Last Exam, that is multimodal and challenging for current MLLMs, this may be an option.
AbstentionBench contains questions that are unsolvable (e.g., because some information is missing in the question) and the correct answer should be “I don’t know” or similar. These questions fall into six categories: 1) Answer Unknown, 2) False Premise, 3) Stale (events occurred after pretraining), 4) Subjective, 5) Underspecified Context, and 6) Underspecified Intent. They can be used for measuring metacognition and factuality, among other things. The paper finds that LLM scaling hasn’t significantly improved performance.
Another saga: CRMArena-Pro extends CRMArena with business tasks about sales, service, pricing, etc., in B2B and B2C CRM (customer relationship management) scenarios. Can LLM do business? Not yet.

Evaluation Methods and Opinions

Social choice theory leads to Metritocracy, an unpronounceable name that defines what it means for a small set of LLM evaluation metrics to be truly 'representative', offering algorithms and guarantees to help “lite” benchmarks remain fair, informative, and efficient.
This opinion preprint on the evaluation of LLMs to forecast real-world events identifies two pitfalls: 1) temporal leakage, where the LLM has access to information unavailable at a specific point in time; 2) extrapolating from benchmark to real-world, due to data contamination, unrepresentative data distribution, and gaming benchmark scores by “strategic gambling” which may lead risk-takers to top the benchmark.
Benchmark Prediction from Fewer Data Misses the Mark is a preprint that compares 11 methods to identify benchmark subsets (used to cheaply predict the performance of new models on the whole benchmark) and finds that: 1) the average on a random subsample with a linear correction beats most methods, and 2) the effectiveness of methods strongly depends on model similarity (measured by instance-level agreement) and thus strongly declines when methods have higher performance than all previously seen ones. This perhaps highlights that prediction at the benchmark level misses the mark, and prediction could be done by aggregating instance-level predictions.
A nice compendium of validity issues in AI evaluation, centered on “how can we formulate claims and collect evidence that support them?”. It emphasises the context of evaluation as the main source for determining what kind of validity we need and can have.
Why is it that Big Tech always have access to the best behavioural data about real use of AI? This Nature Machine Intelligence paper challenges this monopoly and explores initiatives and sustainable ecosystems to have more open data about humans interacting with AI systems.
Dynamic Risk Assessment explores the possibility of a malicious adversary modifying the attacking model iteratively, e.g., by retrying with a strong verifier and fine-tuning the model. They show that 8 GPU hours increased an offensive cyber-agent’s success rate on InterCode-CTF by 40%, using this and other relatively simple methods.

News

PhD Summer School on “Methods for Statistical Evaluation of AI – from PAC Bounds to Fairness”, taking place from August 25 to August 30 in Nyborg, Denmark, co-located with https://d3aconference.dk/
The OECD has presented a new taxonomy of capabilities with five levels each, which has been used by experts to situate the state of the art of AI, aimed at policy-makers and the general public. The approach is complementary to test-based methodologies and can be used to track the perception of AI progress in the years to come.

Contributors to this month’s digest: Nando Martínez-Plumed, John Burden, Ben Slater, Jose H. Orallo, Iñaki Dellibarda, Lorenzo Pacchiardi, Lexin Zhou, Dani Romero, Wout Schellaert, Behzad Mehrbakhsh, Peter Romero, Joseph Castellano.

News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com

Getting the digest: Once a month if you join

The AI Evaluation Substack

Discussion about this post