2025 January "AI Evaluation" Digest

Distil, baby, distil!

Jan 31, 2025

In this first newsletter of 2025, we have to start with DeepSeek. For V3 and R1, the reported evaluations are relatively standard: a collection of average metrics over well-known benchmarks, with our usual doubts about contamination (“contamination” is not mentioned in either paper). There are no details about the “14.8 trillion diverse and high-quality tokens” used to train V3 (the “Pile” is used for testing, while it’s likely that chunks or all of it are used for training) and about the 800K chain-of-thought examples used for the cold start of R1. We don’t know if they include material from other models (such as unverified claims that DeepSeek’s models are a partial “distillation” of OpenAI’s models). However, in the limitations section of the V3 paper they seem to acknowledge the common problems about the current evaluation paradigm: "We will explore more comprehensive and multi-dimensional model evaluation methods to prevent the tendency towards optimizing a fixed set of benchmarks during research".

Definitely, the meticulous optimisations to train V3 with 2048 H800 GPUs in a couple of months (2.788M hours of training) is remarkable, but we need to notice that much larger efforts must have been involved in architecture exploration and ablation (cost not reported), and training to the test (call it benchmark optimisation, research overfitting or simply contamination). But the really significant element of these papers is the unleashed power of distillation with chain-of-thought (CoT), adding up to last year’s change of the dominant LLM paradigm. The traditional scaling laws between training compute and capabilities have received very serious blows in the past few months. First, with the new reasoning families such as o1-o3, we see that inference reasoning is a key factor for system capabilities, beyond training compute. Second, these capabilities do not grow uniformly (no g factor anymore). Many interventions increase the reasoning capabilities, and hence improve in some domains (e.g., mathematics), but capabilities in other areas (comprehension, expression, social skills…) do not necessarily improve with this, or have slight recessions. Capability profiles are no longer concentric onion layers not crossing each other. Third, distillation shows that some of the improvement of larger models due to chain of thought (and in the case of R1 through direct reinforcement learning over CoT) can be translated to smaller and less competent models (such as those in the Qwen and Llama families). Of course, this comes at significant inference cost, because the outputs become longer and longer as the performance goes up.

Three lessons emerge from the perspective of evaluation. First, infer, baby, infer: We need to pay more attention to inference, their export regulations and how well chips perform on inference benchmarks (such as MLPerf Inference Benchmark Suite). Second, test, baby, test: “Comprehensive and multi-dimensional model evaluation” (without contamination) is the only reliable proxy for the capabilities of a model, beyond their reported size, compute or data. And third, distil, baby distil: Distillation had been there for a while, but was used to have mini/flash models “designed for lower intelligence tasks”. Now we see that capability profiles can vary significantly (up and down in different dimensions) with small models being finetuned to acquire the “thinking strategies” of larger CoT models.

Evaluating AI agents:

Companies are introducing alpha (“Research”) versions of their LLM agents, such as OpenAI’s Operator, with access to the Internet, but how can we test them in a replicable and meaningful way?

We can evaluate them in ecologically-valid domain-specific scenarios, not connected to the Internet. For instance, “the agent company” is not a spy agency but a company simulator (including internal websites such as a chat and gitlab) to evaluate AI agents in executing work-related tasks. They evaluated 175 tasks across different job types, and performance varies significantly depending on the task and model. Autonomously solved, complex, long-horizon tasks remain challenging for current AI systems. Another paper presents an “aviary”, an extensible gymnasium and a conceptual framework based on Markov decision process to evaluate language agents across five environments, with particular emphasis on three scientific tasks: molecular cloning, scientific literature QA, and protein engineering. In the same cluster, SWE-secret is a full MMath thesis focused on evaluating agents for software engineering tasks, developing a private dataset with 457 tasks derived from Github mirroring SWE-bench’s structure while maintaining strict data secrecy. But it provides a secure mechanism to allow evaluation without exposing the dataset.

We can also evaluate agents more generally and automatically, but some information is lost in the simplification. For instance, HAL (Holistic Agent Leaderboard), puts the emphasis on cost, standardisation and cross-comparison, in a similar way as traditional benchmark platforms such as HELM. Benchmarks included in HAL typically have a ground truth, even if the agents have to use external tools, including doing or finding things on the Internet, such as GAIA.

We are in the early stages in the evaluation of LLM agents, hopefully in time to introduce better practices and methodologies ensuring validity and reliability, rather than what happened with the current low standards for benchmark collections.

Super-benchmarkers!

Who’s behind the “authoritative” benchmarks or leaderboards, leading the pack? MT-Bench, ChatBot Arena, MBPP, MMLU and HHH. Who are the super-benchmarkers driving the evaluation landscape? Sam Bowman, Dan Hendrycks and Yejin Choi on the podium! All this is calculated using citations per month, github stars, #samples and #authors. Fun, code and insights in this post!

Best practices on AI evaluation from the EU and US AISI:

The second version of the EU General-Purpose AI Code of Practice is on the right track to distinguish “sources” (i.e., factors) of risk in safety evaluations, separating between capabilities, propensities and context, and has several innovations in the narrative about evaluation. Some responses suggest that this distinction could even be refined in the next versions of the code. The third version is expected to be presented at the AI Action Summit in Paris on February 10-11.
US AISI released its second public draft on guidelines on managing misuse risk, which included detailed best practices on AI evaluation. (It’s open for public comment until March 15.)

Problems about AI Evaluation:

Chatbot Arena is less reliable than we thought, as the voting can be manipulated.
“Red Teaming is not safety benchmarking” — one of the eight main conclusions from the red teaming experience of Generative AI at Microsoft in the past few years.
Misalignments between research focus and practical needs: A preprint conducted a metareview on NLP/LLM evaluation to understand the mismatch with what practitioners discuss on community forums (e.g., StackOverflow). They find, for instance, that fairness and robustness are significantly more important to researchers than practitioners, while the opposite holds for efficiency, syntactic correctness and factual correctness. However, the study's methodology has some limitations, particularly in using paper counts as a proxy for research effort and selecting a biased and small subset of the literature (only papers with the word "software" in the abstract).
Sometimes ethical evaluations can backfire. This paper demonstrates that fairness testing techniques and regulatory goals that are not well thought can end up being discriminatory when deployed. Maybe this paper gives some ideas?

Big problems (scandals) in AI evaluation:

The Berkeley MATH dataset has been taken down from HuggingFace because of strong similarity with a great portion of questions of Alcumus, a mathematics learning platform by AoPS Incorporated, which seems to violate the intellectual property of this company.
FrontierMath: it has been known that this challenging mathematics benchmark, developed by Epoch AI and used by OpenAI in its presentation of o3, was funded by OpenAI; with Epoch AI not making this public as it had a non-disclosure agreement about the existence of this contract with OpenAI.

Humanity's Last Exam:

With a name that we hope isn’t an omen, the first version of this benchmark of extra-hard questions has been released: It will likely become a staple benchmark against which new models are tested with great fanfare. The dataset is well curated, and each task is carefully analysed by five state-of-the-art LLMs (all have to fail) with automated scoring, before it is analysed by human experts in two rounds. This introduces a strong bias about what kind of questions are included, with those that can be phrased as (many) multiple-choice options or giving a number/word or a very clear result as preferred. And, of course, this process is adversarial: Selecting questions which current models fail in order to evaluate the next generation of models — a likely divergent process we have criticised in previous newsletters. But, submitting a question is great fun!

Methodology and insights:

EQUATOR: A framework for evaluations going beyond multiple-choice questions to more open-ended situations, in the context of reasoning.
How many times have you prompted “do not include an elephant in the image” and get a flying Dumbo? Negbench tests this precisely, whether vision-language models understand negation. Spoiler: they don’t.
We knew that instructable language models (especially with RLHF) are complacent, but are they conformal? This paper analyses, with a benchmark, of course. Guess the name: A) ConformBench, B) BenchForm, C) DoAsWeDoBench, D) FollowMeBench.
More evidence about an expectation gap about what LLMs know about what they know, and what people think they know, in line with papers covered in previous newsletters.
Analysing generalisation between inductive, abductive and deductive inference, gives insights about the connections between these three reasoning processes in LLMs.
We know LLMs have problems with arithmetic (beyond simple numbers), but can they spot the errors, asks this paper? Spoiler: they can’t.
Yet another paper on function approximation capabilities, now with a Bayesian perspective.
Oh, we love this one: a benchmark suite of room escape games. EscapeBench is the name (an example of thinking outside the box).
Will you remember us in a couple of years? This paper analyses the factors and ideas that can make an AI challenge’s legacy more impactful in the long term.

Correction:

In last month's edition, we included LMUnit (paper, blog), a method for automatically generating rationales in addition to scores for LLM-as-a-Judge procedures, mentioning there is a model available behind a paid API, whereas the API is actually free. Our apologies to ContextualAI.

Contributors to this month’s digest: Wout Schellaert, Jose H. Orallo, Nando Martínez-Plumed, Lexin Zhou, Lorenzo Pacchiardi, Joseph Castellano.

News to share? Feel free to reach out to wschell@vrain.upv.es.

Getting the digest: Once a month if you join:

The AI Evaluation Substack

Discussion about this post