2025 March "AI Evaluation" Digest

Overhauling Difficulty in Item Response Theory.

Mar 28, 2025

We are big fans of Item Response Theory (IRT), a well-established area of psychometrics. Some of the editors of this newsletter have advocated its use for AI evaluation for over a decade, and adapted it for several scenarios. IRT models estimate the difficulty of the items (test instances or benchmark examples) and the ability of the subjects (humans or AI systems) from experimental data over a population of items and subjects. One of the things we realised over the years is that, unlike the use of human samples, fixing a population of AI systems is very arbitrary, as AI is evolving too quickly to consider any sample stable or representative. Instead, extracting the difficulty in other ways, and then plugging all this back into IRT-like logistic models, seems like a more robust approach for estimating abilities and predicting performance for new items.

This perspective is finally blooming in AI evaluation. A series of papers we cover in this issue show this is happening. The first initiative, ADeLe, includes several methodological innovations for AI evaluation, automatically annotating difficulty by using rubrics that represent multiple independent scales (we note that we are the authors of this paper!). The second paper maps difficulty to the time humans take to complete a task, with “human hours” becoming the measurement unit. Finally, a third paper estimates difficulty from IRT and then builds prediction models for it (this one doesn’t fully escape from the populational dependency though).

Is this a shift of the AI evaluation paradigm? Look at the extended coverage of these papers below. But first – some important events!

Events and Initiatives

A “studio” on Measuring AI in the Real World took place in the Santa Fe Institute from March 12 to 14, 2025, congregating about 50 participants from a range of disciplines to explore challenges AI evaluation is facing and what new directions should be taken to make AI evaluation fit for purpose. By purpose, the workshop emphasised evaluating the impact of AI in the real world, not only capabilities of AI in vitro, looking at measuring what matters for and from all stakeholders. Some of the take-aways of the workshop were the need to establish a community on AI evaluation and follow-up workshops. To find out more about the workshop, feel free to email the organisers; to join the community mailing list, request to be added to the google group here.
Co-located with ACL, this is not only the Fourth Generation, Evaluation & Metrics Workshop (GEM), accepting submissions on the evaluation of LLMs and language generation systems, but a more long-term initiative, introducing datasets and good practices (data cards, model cards, reproducibility), as well as tutorials. In this year’s edition, referred to as GEM^2, they feature two large datasets of model predictions with prompts and gold standard references, with instance and model variations: DOVE (Dataset of Variation Evaluation, with variations focusing on the examples, contributions are open!) and DataDecide (taking tasks from OLMES and evaluating many model variations). Deadline for the workshop: Direct: April 11, ARR: May 5.
Scale AI has partnered with the U.S. AI Safety Institute (AISI) to become the first independent evaluator for frontier AI models, aiming to establish standardised, voluntary pre-deployment testing. Through Scale’s Safety, Evaluation, and Alignment Lab (SEAL), the collaboration will develop advanced evaluation methods in areas like maths, reasoning and coding. These tests will allow AI developers to assess their models efficiently and choose whether to share results with AISI and a global network of AI safety institutes.

Science of AI Evaluation

Several papers are taking new directions in AI evaluation, with a vision of changing the paradigm. Do they succeed?

“Unlocking AI Evaluation with explanatory and predictive power through general ability scales” presents a fully-automated methodology building on 18 newly-crafted rubrics that place instance demands on general ability scales. This makes it possible to:
- explain what abilities common benchmarks truly measure,
- extract human-understandable ability profiles of AI systems, and
- predict performance for new task instances, in and out-of-distribution.

Using the rubrics on about 16K examples in the new ADeLe battery, the paper identifies the demand profiles of 20 benchmarks (radial histograms) and the ability profiles of 15 models (derived as the points of 0.5 probability on the logistic characteristic curves). As we see in the following figure, this allows for the comparison of LLMs with benchmarks, explaining and anticipating how well they are expected to perform:

Beyond this visual inspection, the use of the demands on these scales allows for lightweight predictive models (a random forest classifier) to work much better than other assessors using embeddings or LLM finetuning, especially in out-of-distribution scenarios. There are many nuances and subtle connections with other AI evaluation and psychometric practices in the appendix, but also hidden gems: the paper features scaling laws using capabilities on an open meaningful scale compared to those same laws using % performance (which obviously saturates at 100%).

Finally, an associated collaborative platform has been created to evolve the scales and battery into a future standard. Contributions are welcome towards a standardisation of capability profiles and scales for AI evaluation.

METR, a third-party evaluation organisation, extends on previous work already using time by now presenting a more complete study (post, paper) analysing the trend in probability of success of tasks according to the time they require by humans, focussing on software tasks. Again we see IRT applied in AI evaluation to talk properly about “abilities”, as the 0.5 probability of a logistic function on performance vs difficulty (the time in logarithmic scale). It is no surprise that these new scaling laws are again exponential: when having tasks that require to do many things, if these things were independent and all of them should end up in success for the completion of the task, the probability of overall success would just be a geometric series on the probability of success p of each of the k subtasks, i.e., Prob(success)= p^k, roughly corresponding to a geometric distribution or its continuous counterpart, the exponential distribution, which are generally used in engineering to estimate the probability of failure over time. So increasing the mere volume of the task has this effect, as the AI system or human will be running longer. As volume and time are almost linearly correlated in humans (and machines), we get these results, basically showing (in a different way) that LLMs are increasing the probability of success p on isolated tasks. Compare their figure with a simple geometric series on a log scale where models have isolated probabilities p of 0.7, 0.85, 0.9, 0.95, and 0.97 and 0.98:

Disentangling what is needed for longer agential tasks, such as metacognition, and whether these tasks are just becoming larger, but not more complex or intertwined, is something that is not fully covered in the paper. There’s an analysis of “messiness”, which could partially account for how much the subtasks are related (the process is not memoryless then), and this would have been a very interesting exploration to make.

“Reliable and Efficient Amortized Model-based Evaluation” argues that while average scores on large benchmarks are useful for model selection, 1) testing on those is costly; 2) using a fixed subset is not always possible (e.g., in healthcare). Thus, the authors use IRT to estimate a unidimensional item difficulty and model ability levels, showing that these inferred abilities are more robust across test sets than average scores (a reassuring finding, as this is why IRT was developed). Further, they train a ML model to predict the item difficulty to avoid testing new instances on many models (similarly to this 2019 work). Finally, they finetune a LLM to generate prompts at specified difficulty levels for adaptive testing. Their motivation partly overlaps with the ADeLe paper featured above (putting benchmarks on a common scale), but the latter uses a multidimensional demand scale derived from rubric annotations rather than relying on a fixed population to infer univariate difficulty.
Towards an Evaluation Science for Generative AI: This position paper argues that AI evaluation should evolve over time and adapt to adverse effects noticed in post-deployment, i.e. real-world use of AI. The authors propose moving beyond ‘basic research’ towards the ‘use-inspired basic research’ where evaluation shifts from broad assessments of general intelligence to those that are task-specific and ground in real-world contexts. They note that current evaluations seldom include evaluations of human-AI interactions and they call for clearer decisions and operationalisation of the concepts under measurement. The paper recommends the establishment of dedicated institutions and standards, such as frameworks for structured evaluation and templates for model reporting, as well as increased investments in these initiatives.

Benchmarks

Harder variations of benchmarks keep coming up, and people are desperate for catchy names, such as MMLU-ProX (translation of MMLU-Pro) and BIG-Bench Extra Hard (replacing tasks in Big-Bench Hard with harder ones testing the same “capability”). They show a positive correlation between the size of the questions+answers and their difficulty, in a trend related to the correlation of difficulty and time to solve a task in the METR paper or the “volume” in the ADeLe paper. The challenge seems to be finding short and hard questions!
Expanding the space of tasks covered by LLM benchmarks: WritingBench evaluates long-form writing using a critic model equipped with query-specific criteria; KoLMogorov-Test assesses the ability to compress data sequences using program synthesis (thus trying to approximate Kolmogorov complexity; current LLMs perform poorly, but the paper doesn’t compare with the area of inductive programming using LLMs); AutoAdvExBench (with the clunkier name of all) tests whether LLMs can autonomously identify and exploit weaknesses in defences against adversarial examples.

Fancier evaluation approaches

Tired of computing average scores on fixed benchmarks? Then why not:

Generating task instances adaptively to probe the failure modes of target models (as in this paper)?
Extracting LLM profiles and identifying weaknesses by automatically building hierarchical trees of the capability tested by each query in a benchmark (EvalTree)?
Removing the correct answer from multiple-choice benchmarks and replacing it with “None of the Others” to prevent models from using memorised knowledge?

Other stuff

Whistle-stop tour:

Agents: a survey on the evaluation of LLM agents and a Gym-like framework to train LLM agents via RL to perform ML tasks, pushing them closer to closing the circle. But can you trust agents to run tasks on your computer? This paper finds that there is a gap between LLM agents theoretical understanding of risk (“is rm -rf dangerous?”) and avoiding risky behaviours when performing tasks.
Routers: when multiple models are available, a “router” can be used to redirect queries to the most suitable (in terms of performance, cost, …). RouterEval is a benchmark to evaluate LLM routers (similarly to the older RouterBench); instead, Model-SAT builds a router by evaluating candidate models on small “aptitude tests” and building capability representations in plain text, which are used by a lightweight LLM to route each new query.
Human-like AI? “On Benchmarking Human-like Intelligence in Machines” shows that a study with 240 human participants reveals low agreement rates against the ground truth (26.67% of examples have less than half of the participants agreeing with the ground truth label) using a BigBench sample (this may be caused by difficulty, which is briefly mentioned). The paper proposes five recommendations to improve future benchmarks: using robust human data, evaluating population-level distributions, capturing graded human judgments, grounding tasks in cognitive theory, and designing ecologically valid scenarios.
Llama-like Alpacas? This paper uses the log-likelihood vector computed over a set of text strings to build a “model map” clustering models with similar characteristics. By computing their similarity they allow us to answer a transcendental question: how closely related are llamas and alpacas?

Contributors to this month’s digest: Lorenzo Pacchiardi, Jose H. Orallo, Marko Tesic, Nando Martínez-Plumed, Peter Romero, Lexin Zhou, and Joseph Castellano.

News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com

Getting the digest: Once a month if you join:

The AI Evaluation Substack

Discussion about this post