2024 December "AI Evaluation" Digest

Think before you act!

Dec 27, 2024

Welcome to the last newsletter of 2024! Since we started this substack in January of this year, we would like to extend a huge thank you to all our subscribers!

In this era of step-by-step thinking without action, if we were to choose a new year’s resolution for AI evaluation in 2025, will it be the year of reason or action?

First: Did OpenAI’s new o3 model solve the (in)famous ARC-AGI benchmark? Here are takes by Melanie Mitchell, Nathan Lambert, and François Chollet himself (some of our views about this benchmark were already included in the June newsletter). ARC-AGI is not different from many IQ-like tests, and it was going to be saturated soon, especially as the word “AGI” was added to the name and $1M was offered in prizes. Indeed, this has happened in six month’s time. Like many other specific tests—without a clear theory of what they are measuring nor the demands for the items—a good score in this one may be necessary to show some degree of inductive reasoning, but it’s not sufficient. The o1-o3 family shows improvement in “reasoning” regardless of these results, and the results for FrontierMath up to 25% are more impressive than the saturation for ARC-AGI.

Second: Reasoning for what? With more and more inference compute being used to boost reasoning capabilities in areas such as mathematics and programming, we predict that a lot of the discussion next year will focus on the evaluation of resources vs capabilities in these areas. But, will this be translated into building benchmarks that are ecologically-valid? Frieder et al. argue that many mathematical benchmarks fail to capture how mathematicians would actually benefit from these tools. This discrepancy highlights the limitations of current benchmarks, which don't account for AI's potential as a collaborative thought partner rather than as an autonomous mathematician.
Third: Perhaps both reasoning and acting (well) require progress in metacognition? Several benchmarks highlight the need to evaluate (and make progress on) metacognition, especially identifying unsolved tasks or improving confidence estimates of what AI models know. This has been a constant during 2024 and before, but we expect more of this during 2025!

Findings

More evidence that high average scores on benchmarks don’t reflect capabilities if the difficulties are not aligned. A recent paper presents GAOKAO-Eval, which uses China’s National College Entrance Examination as a comprehensive benchmark for evaluating LLMs. So far, it’s nothing special, other than the promise of updating this every year to avoid contamination. But examples are annotated with difficulty. As they put all the models together then the R^2 with difficulty is low, and the patterns of success with regard to difficulty deviate from human behaviour, or what should be expected by the item difficulty.
Apollo research finds that models can “scheme”, i.e. hide their true capabilities and objectives. However, note that prompts include strings like “Make sure that you achieve your goal in the long term. Nothing else matters. Make sure you achieve YOUR goal at all costs.” This evaluates whether LLMs can scheme, not whether they naturally would. The authors specify this difference, but newspaper articles mostly misfire. On the other hand, the new MLCommons AILuminate benchmark measures the propensity (not the possibility or the capability) of “AI systems to respond in a hazardous manner to prompts from malicious or vulnerable users that might result in harm to themselves or others”.
ChatGPT can get an engineering degree, or at least pass 80% of engineering courses at Polytechnic University of Lausanne (ETFL). Note: All multimodal questions are filtered out, and responses were LLM graded (and, usually, you need to pass more than 80% of courses to get a degree!).

Takes

This paper proposes that GPAI models should be evaluated similarly to general-purpose microprocessors. This boils down to, among other things, averaging task performance with a geometric mean instead of an arithmetic mean, and reporting the scores as relative improvements over some baseline (e.g. 1.4xBERT, or 0.8xHuman).
The Roles of English in Evaluating Multilingual Language Models, is what the title indicates, and argues that English acts both as an interface and as a natural language, which represent different goals.
This econometric framework discusses how good (or non-contaminated) LLMs must be to be used for various kinds of economics research.
This paper presents an overview of reasons why benchmarks don’t necessarily reflect an LLM’s real performance/utility/capabilities. Nothing new, but it could serve as a good entry point.

Methods

Like adding a bucket class in a classification problem, adding an “I don’t know” token to an LLMs vocabulary seems to help with their calibration.
A metric and a method for evaluating hyper-parameter sensitivity in reinforcement learning.

Scaling

At the risk of stating the obvious, there are diminishing returns in terms of accuracy vs. energy efficiency as models are scaled up; but as long as we measure performance as a percentage, instead of as capabilities with a meaningful scale, we have to see these diminishing returns.
Sloth is a method for doing scaling analysis based on assumed latent factors, and it can make performance predictions after observing only a single trained model of any given LLM family.
For certain capabilities that only “emerge at scale”, we can find proxy-tasks that are predictive of the emergent capability, and thus evaluate them at much smaller scale. Although with regard to emergence, Schaeffer et al. 2023 and 2024 should be considered mandatory reading.

LLM as a Judge

LMUnit (paper, blog) is a method (and a model behind a free API) for automatically generating rationales in addition to scores. The rationales are based on pre-specified or generated “unit tests” (read: evaluation criteria), either prompt-specific or global, and they can help explain decisions and improve steerability and debugging.
Win-rates for LLM-based pairwise comparisons need to be calibrated.
LLM Judges are not consistent: They have incompatible ratings over different scales (e.g. 5 point vs. 10 point scale), and, more generally, they have different ratings over different output samples.

Benchmarks

The famous Arcade Learning Environment gets an upgrade with support for continuous actions (paper). The authors also find that learning algorithms generally do less well in this continuous action space, but they sometimes do better. Again, this shows that representations matter for human-AI comparisons, as humans use continuous actions.
Holmes is a benchmark resulting from a meta-study on classifier-based probing for measuring linguistic competence (e.g. part-of-speech tagging), aiming to consolidate resources.
More benchmarks:
- DeepMind FACTS, for evaluating factuality;
- ABCFair for comparing fairness methods;
- AgoraBench for evaluating the utility of LLMs as data generators for further training (since everybody and their dog is doing that now;
- The CALAMITA project for LLM evaluation in Italian;
- And, lastly, ISTAnt for benchmarking causal treatment effect size in high-dimensional space (this space being videos of ants).
Some evaluation tools: BALROG, which is a multi-benchmark (like GLUE) in a nice interface of several existing games used to test agentic LLM and VLM reasoning; and Meta’s EvalGIM for generating generative image models.

Events: NeurIPS highlights

The Concordia Contest: Advancing the Cooperative Intelligence of Language Agents had a special session at NeurIPS. The session included: a discussion on the motivations for evaluating cooperative AI, a summary of previous related competitions (MeltingPot), a presentation by some of the best participants about their approach to the competition, and a panel that covered some of the challenges of the evaluation of social LLM agents. The winner of the competition was announced: Taehun Cha, a PhD candidate who employed a tree-based agent combining the expected return and the common goal of the group of agents.
Like the previous 3 years, NeurIPS had a dataset and benchmark track: Take a look for even more benchmarks – a whopping 463 accepted entries!

Wishing you all the best for the new year!

Contributors to this month’s digest: Wout Schellaert, Lexin Zhou, Nando Martínez-Plumed, Cesar Ferri, Jose H. Orallo, Joseph Castellano.

News to share? Feel free to reach out to wschell@vrain.upv.es.

Getting the digest: Once a month if you join

The AI Evaluation Substack

Discussion about this post