2025 November "AI Evaluation" Digest
Hitting a wall? Seeing is all you need
The recent presentation of Google Gemini-3 has come with the usual combination of benchmark results cherrypicking, leaderboard reshuffling and lots of anecdotal evidence from early users. As any other model, Gemini-3 has been praised and criticised in almost equal terms. It will only be in the following weeks or months that we will be able to fully appreciate the capabilities and safety of this new release. However, for the moment, we have some indications that more natively multimodal models are paying off. Oriol Vinyals, who has co-led the release, posted on X that “contrary to the popular belief that scaling is over [...] the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we’ve ever seen. No walls in sight!” Looking at the self-reported results, the most significant gains over other state-of-the-art models are found in visual benchmarks, such as MMMU-Pro and ScreenSpot-Pro. This mirrors the leap observed in ARC-AGI-2—an “AGI” benchmark that claims to measure abstraction and reasoning—where the previous models’ visual limitations likely constrained their performance.
Vinyals argues that the explanation for the gains is simple: “improving pre-training & post-training”. Several benchmarks are also preparing the arena for the wider margin of progress in vision-language models (VLMs) compared to more meagre gains in text-only or “agentic” tasks. For instance, MM-OPERA (Multi-Modal OPen-Ended Reasoning-guided Association) is a new benchmark for VLMs, which explores association tasks (“what do [A] and [B] have in common?”, or “if [A] is to [B], what is [C] to [?]”, with A, B, C and ? being images). The benchmark is mostly challenging because of the multimodality, otherwise the items would be ordinary analogy riddles common in text-based IQ tests, which do not pose extra difficulty for humans over those with visual representations (in tests used by psychometrics or cognitive psychology, as cited in the paper). If MM-OPERA can detect improvements mostly on multimodal pretraining, MIRA (Spanish and Portuguese for “Watch”, from Multimodal Imagination for Reasoning Assessment) is a benchmark that aims at multimodal chain of thought (CoT), something that seems the natural evolution of multimodal “post-training”. The test introduces visual cues in the Visual-CoT, and the examples in the benchmark show that it gets closer to evaluating some kind of visual reasoning, including rotations of objects, perspectives, etc. This benchmark may resist saturation for a while, especially as seeing and reasoning in 2D can be insufficient for solving problems that involve seeing and understanding in 3D, requiring video modality and further scaling.
Lesson learned? For robust evaluations of multimodal models we need to account for the modality confounders by devising problems that require equivalent cognitive processes but only change the modality. Otherwise, we may end up saying that, e.g., AI does not reason – about what it cannot see.
Findings and results
Henley Wing Chiu analysed 180 million global job postings (2023–2025) to see the impact of AI in the job market. The study finds an overall 8% drop in openings in 2025 versus 2024. Creative execution jobs (photographers, writers, graphic artists) and some compliance roles show the steepest declines. AI-related technical roles, especially ML engineers, are the fastest-growing, with demand rising over 40% year-over-year. Jobs thought to be heavily exposed to AI (like customer support) have not collapsed, while high-level strategic roles hold up better than lower-level execution roles.
This preprint shows that modern language models can intentionally underperform on safety-critical multiple-choice evaluations while producing reasoning traces that fool chain-of-thought monitors. Larger models tend to navigate the trade-off between hiding their intentions and still sabotaging effectively, though none do so reliably. The authors uncover a surprisingly rich collection of covert behaviours that emerge without being taught explicit attack strategies.
And talking about models not saying what they truly think, this preprint examines whether reasoning-tuned LLMs can actually explain their thought processes, introducing evaluations for self-awareness, latent policy generalisation and consistency between hidden thoughts and final answers. The punchline? RL-style post-training (especially GRPO) makes models more self-aware and better at transferring behaviours. However, it also makes them more likely to diverge silently between their thoughts and their actions. These are new red flags for CoT-based safety monitoring.
Big survey coming out of the EvalEval coalition. They consider “social impact assessments covering bias, fairness, privacy, environmental costs, and labor practices” and analyse first-party and third-party evaluation reporting. They find that first-party reporting is sparse and has declined over time in some areas, with the larger burden of social impact evaluation carried by third-party evaluations (academics, nonprofits, independent organisations) that provide more rigorous coverage. Still, the authors stress that this is not ideal, as “only model developers can authoritatively report on data provenance, content moderation labor, financial costs, and training infrastructure, yet interviews reveal that these disclosures are often deprioritized”.
For an example of a first-party evaluation, Anthropic’s Product & Societal Impacts team introduces a “Paired Prompts” benchmark to measure political even‑handedness in Claude across 1,350 pairs of opposing political requests. Using Claude itself as an evaluator, they find Claude 4.1/4.5 roughly match or trail Gemini and Grok but significantly outperform GPT‑5 and Llama 4 on even‑handedness, while deliberately keeping higher refusal rates on more extreme or persuasion‑oriented content. The study is US‑centric and relies heavily on LLM‑as‑judge, but it’s one of the clearest public articulations of how “politically fair” behaviour in a general‑purpose model can be operationalised.
A good example of the rift between development and deployment is this paper, which presents a comprehensive safety evaluation of LLMs used in robotic contexts, focusing on how these models handle instructions involving sensitive personal attributes. Results? Leading LLMs consistently exhibit direct discrimination (e.g., favouring “able-bodied” or “European” individuals over others) and fail critical safety checks by approving violent or unlawful commands. So, while models might be immunised against harmful speech, bias might still be persistent in behaviours, which would be even more problematic.
Since we are all big fans of scaling laws, why not some for evaluation awareness? This preprint does exactly that, analysing the evaluation awareness of open weights LLMs via linear probing for different model sizes across four families. The results are unsurprising: larger models are usually more aware they are being evaluated. Nonetheless, they show a nice scaling power law, and the results hold across all four families, putting size (and not other components such as architecture or training approach) as the most relevant factor at play.
This survey reveals (again) that LMM evaluation is mostly a messy patchwork of narrow benchmarks and leaderboard worship, telling us more about who’s good at test-taking than who’s actually useful. Read it for a reminder that the field is stacking giant models on top of measurement tools held together with duct tape, guesswork, and vibes :).
Epoch AI ran a factor analysis on AI benchmarks and found the expected dominant factor capturing general capability. But they also discovered a second factor explaining real variance beyond noise. This factor corresponded to being “good at agentic tasks but bad at vision and math”—and Claude models consistently scored highest on it, earning it the name “Claudiness“ vector. This suggests support for the “contingent view”: that models can develop along multiple orthogonal dimensions of ability, rather than varying along a single axis of intelligence, as shown in some previous factor analysis paper on LLM and benchmark populations. The dominant factor we observe may simply reflect optimisation pressure pushing models down similar paths, while Claude’s distinctiveness shows that different training approaches can produce genuinely different capability profiles.
Do Retrieval-Augmented Language Models (RALMs) Know When They Don’t Know? This preprint shows that purely negative retrieval contexts significantly damage model calibration and lead to excessive refusals. The authors also examine how refusal behaviour relates to calibration quality, finding that refusal-aware RALMs perform poorly across different RAG configurations.
WorldTest is a framework that uses AutumnBench (43 interactive grid-world puzzles) to test whether AI models can build world models of how environments work. They consider three facets of world modelling: next-frame prediction, causal change detection, and planning. Results? Humans performed much better than 3 frontier models. The AI models persisted in being wrong by barely using the “reset” button during exploration. At least they share one thing with us: stubbornness.
Not everything is about top-notch performance, what about the resources spent? Across recent studies, researchers converge on the need to assess AI models not only by accuracy but also by energy and carbon cost. Saad Falcon et al. propose Intelligence per Watt to quantify how much useful work local LLMs deliver per unit of power, showing a five-fold gain in two years and strong benefits from hybrid local–cloud routing. Mehditabar et al. extend this perspective to code generation with BRACE, revealing that some of the most accurate code models are unexpectedly energy-inefficient and that smaller or quantised models often fare better once energy is considered. Jeanquartier et al. provide a complementary carbon-accounting framework, demonstrating how emissions shift with hardware, model size, and usage patterns, and urging that energy/CO₂ reporting become standard practice. Islam et al. bring these ideas to the edge, benchmarking small LMs on devices like Raspberry Pi and Jetson Orin to show trade-offs between latency, energy, and accuracy. Together, these works illustrate a maturing ecosystem of metrics and benchmarks that emphasise sustainability.
Benchmarks
Do language models truly uphold a coherent value system? ValBench tests whether LLMs maintain a stable value stance across 115K pairs of opposing sides in controversial topics obtained from Wikipedia, even taking into account refusals and no-information responses. Claude models hold firm (even though some of them just refuse a ton, getting good scores), while others like the GPT family always answer but change sides more easily. As Groucho Marx put it: if you don’t like my principles, well… I have others.
Liar’s Bench collects 72K examples of LLM lies and honest responses generated by open-weight models and proposes to use them as “lie detectors”, both white- and black-box. The benchmark is structured according to the model’s reason for lying (what kind of pressure or encouragement) and the object of belief the lie targets. Incorrect answers are categorised as lies if the model responds correctly in pressure-free.
Following obvious benchmark names, Zhao et al. introduce MUBENCH, a comprehensive benchmark for evaluating machine unlearning methods on three axes: safety, over safety, and general utility. Across seven MU methods and three aligned LLMs, they find a consistent trilemma: improving safety via MU almost always causes exaggerated refusal of benign content and noticeable drops in general performance.
ConsintBench tests how well LLMs understand real-world consumer intent in messy, multi-user discussions. It evaluates 20 models on over 200k product-level comments along four dimensions (depth, breadth, correctness, and informativeness). Results? Best LLMs still struggle with genuinely deep intent understanding.
A new preprint proposes a benchmark of “Long-Term Memory” in LLMs, presenting conversations of up to 10M tokens with associated “memory ability” questions. No models surpass 30% overall accuracy on the longest conversations, so this benchmark still has a bit of road to run.
LoCoBench-Agent turns 8,000 software engineering scenarios into realistic, multi-turn, tool-using, long-context environments and provides a comprehensive set of (9) comprehension and efficiency metrics. It shows that smart context management beats sheer context size.
Herambourg et al. introduce ORCA, a benchmark focused on everyday numerical tasks (e.g., reading pay slips, receipts, and budgets) rather than contest math. Evaluating a range of popular LLMs, they show that raw calculation accuracy is surprisingly poor in many real-world scenarios, and that tool use helps but doesn’t fix everything.
MFAVA is a 30 language hallucination detection benchmark, used to train multilingual detectors that estimate hallucination rates for eleven open-source LLMs answering long-form questions. They find that average hallucination rates hover around 7–12% of tokens, smaller models hallucinate more, and models that claim broader language coverage tend to hallucinate more as well.
MMBench is a 200-task continuous-control benchmark, and Newt, a language-conditioned world model that combines online RL with demonstrations to outperform PPO/TD3-style baselines and transfer reasonably to new tasks. Performance still lags in some more challenging domains (e.g. Atari, Box2D and MuJoCo) and remains fully simulator-bound.
And we wrap up the benchmark bonanza with MedR-Bench, a comprehensive benchmark of 1,453 structured clinical cases designed to evaluate LLM across three stages of care: examination recommendation, diagnostic decision-making, and treatment planning. They score model outputs on efficiency, factuality, and completeness by cross-referencing external medical resources. While current models achieve over 85% accuracy on static diagnostic tasks, their performance drops significantly on dynamic workflows like treatment planning, where they tend to produce factual but critically incomplete reasoning, especially in iterative diagnosis-treatment cases.
Methodologies and Resources
It’s important (imperative!) to ascertain construct validity in AI benchmarks: this paper reviews 445 LLM benchmarks and shows recurring issues in how phenomena, tasks, and scoring approaches are defined—issues that can weaken the claims researchers draw from benchmark results. A much-needed meta-review, notable for its coverage and quantity of benchmarks, pointing in a constructive direction for improving how we evaluate AI.
This NeurIPS ‘25 paper recasts multi-LLM routing as a multiple-source domain adaptation problem and proposes a principled algorithm that learns to send each input to the best specialist model, even when the real-world task mix is unknown. It comes with strong regret guarantees showing the router essentially matches the best expert across any mixture of benchmark domains, backed by initial experiments on MixInstruct.
Li et al. propose RIDE, a framework that uses Item Response Theory to generate and calibrate difficulty-controlled perturbations of math problems, turning static benchmarks into graded stress tests. Across several LLMs, they show that seemingly tiny changes in a math question can cause large performance drops, revealing brittle reasoning that isn’t obvious from original benchmarks alone.
A “Scaling Environments for Agents” NeurIPS ’25 workshop paper argues that training environments should be “active producers of experiential data”, surveying learning environments which Generate tasks the agent Executes and receives Feedback on (what they call the GEF loop). This comes close to environments for traditional reinforcement learning, but with LLM-agent specifics (e.g., the focus on tool/API calls, long-horizon tasks). It is interesting how this brings back AI development closer to evaluation. The survey highlights the Generator-Verifier Asymmetry, where easy-to-generate tasks are hard to verify, and the other way around.
BeTaL is a framework that leverages large language models (LLMs-in-the-loop) to automatically generate dynamic benchmarks. It operates by having an LLM adjust benchmark “templates”, simulate tasks, measure model performance, and iteratively refine parameters until the benchmark reaches the desired difficulty or other target properties. A key limitation is that the approach requires a simulator capable of producing tasks with ground-truth, which isn’t available for many real-world or open-ended domains.
Need a solid introduction to AI evaluation in about 15 minutes? “What a 100-year-old horse teaches us about AI” is a great animated video that covers the key elements of AI evaluation. It’s accessible, rigorous and fun!
News and Events
The jingle-jangle fallacy is a relatively well-known phenomenon in educational measurement where different evaluations often share a name while measuring distinct underlying constructs (jingle), or conversely, identical constructs get rebranded as entirely new metrics (jangle). This fallacy explains some of the misunderstandings and wheel-reinvention in AI evaluation. The jingle-jangle fallacy unexpectedly took centre stage at a recent symposium held at Stanford University on November 14th, arguing that the proliferation of overlapping constructs muddles our understanding of what AI systems actually do and what any given benchmark truly captures. The discussion continued on the need for clearer taxonomies (and perhaps fewer, better-defined constructs) to ensure that evaluations genuinely help us interpret and predict model behaviour.
Turns out Kaggle is now hosting and running large-language-model (LLM) benchmarks. With “Kaggle Benchmarks,” the platform brings verified leaderboards from major evaluation suites (like math, code, reasoning, multilingual tasks) to a communal space where anyone can run popular LLMs on them and compare results transparently.
News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com
Getting the digest: Once a month if you join.
Editor-in-Chief: Daniel Romero
Contributors to this month’s digest: Zack Tidler, Behzad Mehrbakhsh, Jose H. Orallo, Lorenzo Pacchiardi, Ben Slater, Pablo A. Moreno-Casares, Peter Romero, Fernando Martínez-Plumed, Wout Schellaert, Cèsar Ferri, Lexin Zhou, Yael Moros, Joseph Castellano.



This digest is a masterclass in AI evaluation, especially the emphasis on modality confounders and construct validity. Emerging frameworks like BeTaL and RIDE show how much methodology shapes what we think AI can do.
I talk about the latest AI trends and insights. Do check out my Substack, I am sure you’ll find it very relevant and relatable.