2025 December "AI Evaluation" Digest
Call for Tributes: Your test of time.
It is the festive season, it is the end of the year, and everybody is doing it: It is time for reflection!
One of the key things this year has brought is collective realisation that some sort of science of evaluation is needed and is a sort of discipline in itself. Last year (i.e. 2024) there were already some takes: e.g. Apollo Research’s call to arms, Yarin Gal’s post, and our reply that this science already sort of exists. Then the year 2025 kept hammering down: NIST posted about a need for improved science, it is a workstream in the UK AISI’s research agenda, Moritz Hardt has a new book on the emerging science of benchmarks, and a position paper from an MLCommons working group tells us our plumbing needs fixing. The sorry state of evaluation affairs is also the mission statement of the EvalEval coalition, the topic of a paper by Laura Weidinger, Deborah Raji and some other familiar names and is a topic in the AAAI presidential panel…
We are overdramatising of course. It is a good thing that some sort of consensus has formed, and it is okay to say the same things a couple of times. More so, these complaints are often simultaneously backed up by concrete efforts to fix the named problems! New books, new groups, new perspectives. Wonderful really, to see the field growing.
Part of that growth is also maturing, and part of that maturing is rediscovering what already existed. Particularly psychometrics and Item Response Theory had a good run this year: METR rediscovered agent characteristic curves with horizon (time) as the difficulty parameter, Epoch introduced their IRT-based Capability Index, and psychometrics was central in the rather disappointing but widely circulated Definition of AGI (luckily psychometrics also allows for plenty of critique). Collectively, IRT and psychometrics were mentioned 70 times in our newsletter this year. There’s some sampling bias of course, but last year they were mentioned 12 times. That’s data for a trend 😉 !
And conversely, part of the growth in the field is also realising how it is different, adapting tools and theory to your situation. LLM-judges are such a tool, embracing the noisiness of results and the need for automation at scale, and this past year saw them cemented as a default method, used wherever possible. LLM-judges were mentioned roughly 30 times in the newsletter, with 3 papers just this month’s edition, still widely underreporting how prominent they have become. The world hasn’t broken yet, so maybe that makes sense, or maybe next year we’ll see the fallout…
Lastly, with the end-of-year reflections also come new year’s resolutions: how can we improve ourselves as a field? We propose two things, which we’ll execute on ourselves:
We raise the bar for talking about problems, and focus the conversation on bringing solutions.
We pay homage to the science that already exists.
Call for Tributes
In light of the above, we’re doing a Call for Tributes. Do you know any existing works that have stood the test of time, that are worthy of rediscovery, or just deserve more attention? Share them with us, and we’ll feature the best ones in next year’s newsletters. You only need to tell us in a single paragraph why it’s awesome. It’s a great opportunity to boost your friend’s cool project as a Christmas present.
Okay, let’s get to the content.
Takes
Some great takes this time of year!
The field of AI evaluation should pivot to evaluating the performance of human–AI teams (at least according to this cool position paper). “That’s expensive!” we hear you thinking, but the authors have their rebuttals ready for that and many other dismissals. And with some timely results related to the above, this preprint investigates how LLMs and humans co-construct epistemic errors, finding that as tasks became denser, evaluators increasingly relied on surface cues, suggesting that error is not solely a property of model behavior but a co-constructed outcome of generative plausibility and human interpretive shortcuts.
In his usual style, Ben Recht tells us we’re all wrong and (simultaneously) benchmarking is all we need and there is no data-generating distribution, that quite important thing we rely on to make the holdout method sensible. Also as usual, the contrarianness is just the method of discovery, and the concepts (e.g. what he calls metrical determinism) are worth engaging with.
A paper at the NeurIPS LLM evaluation workshop argues that benchmarks serve two purposes: assessing the absolute capabilities of a fixed artifact (evaluating a *trained model*) and comparing the algorithms leading to those final artifacts (evaluating *methods*, for instance two SGD algorithms). The main takeaway is that, in the first case, valid absolute scores are needed to yield predictions about real world use, and these are easily biased by dataset flaws, mismatch in tasks, etc. In the latter case, instead, we only care about ranking between different methods, which are more robust.
News
The AI Evaluator Forum launched on the 4th of December with an in-person event co-located with NeurIPS25. The Forum’s founding members include third-party evaluators such as Transluce, METR, RAND and SecureBio, and its mission includes establishing best practices for independent AI evaluation and facilitating knowledge sharing among evaluators. Towards this goal, their first output is AEF-1, a standard that specifies the minimum operating conditions by which third party evaluators can collaborate with an AI lab to conduct evaluation exercises, and aims to make it easier for such evaluators to set legal agreements and document the conditions under which the evaluation was conducted.
AISafety.com relaunched, and we’re including it here in case you’d be looking for a job or funding, as there are various orgs listed that have jobs related to evaluation and of which you might not have heard.
Hugging Face released V2 of the LLM Evaluation Guidebook; pragmatic and approachable, yet not shying away from introducing readers to limited generalisability of benchmark claims. Or quoting directly: “this model is the best on these samples for this specific task that we hope are a good proxy for this capability, without any guarantee”.
In a renaming which we can imagine offended no political sensitivities but also no one liked, the International Network of AI Safety Institutes (AISIs) becomes the roll-off-the-tongue International Network for Advanced AI Measurement, Evaluation and Science (AAIMES).
The EU Commission looks for feedback on a draft implementing act to establish AI regulatory sandboxes under the EU AI Act. Regulatory sandboxes make it possible to develop, train, validate and test new AI systems in a controlled framework set up by a competent authority, in some cases in real-world conditions, being a fundamental tool for the testing and certification of AI (eco)systems. Deadline for comments extended until Jan 13.
Reviews, Surveys, and Meta-stuff
The UK AISI released a Frontier AI Trends report which is chockfull of results and measurements. Summary: various capabilities are improving, sometimes rapidly; safeguards improve as well, but remain vulnerable; and societal impacts are being felt, e.g. in terms of emotional dependence.
Arb Research released the 2025 shallow review of technical AI safety, which highlights relevant papers and people for various research agendas. Relevant to the audience of this newsletter, it includes 12 evaluation research agendas (AGI metrics, sandbagging, …). If you prefer linear reading, it’s well complemented by this post of the same authors that did the rounds online: AI in 2025: gestalt.
NIST, in a sort of post/agenda/call-to-action, lists a few open questions in AI measurement science.
Among other points of critique, this Substack post raises the issue that at the current level of capabilities, the frontier of METR’s insanely popular horizon length plot is being measured with only 14 prompts.
Meta-methods & Methodology
Eval Factsheets is a proposal for structurally documenting AI evaluations.
In The Measure of All Measures, the authors also use IRT at the benchmark level to quantify some aspects of benchmark quality: hardness (i.e. difficulty), separability (between-model variance), diversity (embedding-based dispersion of the prompts). It is a bit weird that they use a 1PL IRT model for hardness, and don’t use or even mention the 2PL IRT model for discrimination (instead of the new separability).
Benchrisk iteratively analysed 26 LLM benchmarks, identifying 57 failure modes and 196 mitigations. These are combined into a single “benchmark risk” score that is used to provide a metaevaluation benchmark, where higher scores indicate that benchmark users are less likely to reach an incorrect or unsupported conclusion.
Methods and Findings
This preprint finds using small, cheaper, surrogate models to be a practical approach for active testing of LLMs.
Everybody knows LLM-judges have certain biases and we all could probably do a little better mitigating them. In terms of New Year’s resolutions, some things to consider for the next time you’re reporting score:
This preprint proposes a simple calculation to at least correct the aggregate score to something more realistic. You still have to measure alignment with ground truth on a small subset, but that really should be table stakes.
Going a bit further, this paper brings individual scores closer to how humans would have scored by correcting specific differences in preference. Instance features like length, positive tone, or creativity are annotated with an LLM, and together with a reference set of human scores, are used to fit a linear corrector model.
If you can’t be bothered by humans at all, but have a bunch of judge models, this workshop paper presents a framework to aggregate various judges and measures how well it reduces bias and improves accuracy compared to baseline aggregation methods against human-annotated ground truths.
In order to deal with sandbagging and hidden objectives, we might just need to finetune the model to make mistakes sometimes and then admit them? Failures provide good opportunities for teaching, and often the specific mistakes are less important than underlying dynamics like epistemic humility and honesty. We should not optimise them all out.
Epoch recently published their Capability Index, which is based on IRT at the benchmark level and their preprint with a slightly overzealous title now gives some more details. Abusing this newsletter to provide some feedback: (i) IRT at the benchmark level (Sec. 4.4) is not necessarily new (see e.g., this and this), and (ii) moving to the instance level (Sec. 4.5) is good in general and will indeed remediate some of the invariance issues discussed, but you will need to be careful about weighting various benchmarks against one another. Anyway, cool work.
In Auditing Games for Sandbagging, a red team trains an LLM to underperform (unless a password unlocks all the intact capabilities) and a blue team has to differentiate the sandbagged model from a less capable, non-sandbagging LLM. Figure 1 explains it very clearly. The blue team was generally unsuccessful, even when allowed to use probes, but the use of finetuning (to recover the capabilities) was a much more predictive signal. The authors (and we) wonder how well this method could stand future sandbagging approaches.
Anthropic’s Bloom is a new open source tool for automated behavioral evaluations.
Scaling laws predict performance based on properties of the task or of the system completing the task. Predicting performance and analysing these properties is a key part of evaluation. This preprint does so for agent systems, analysing the relation between coordination mechanisms, model families, model sizes, and capabilities as measured through the Artificial Analysis Intelligence Index, finding several patterns that need too much background to summarise here.
This method estimates problem difficulty automatically by using LLMs to do a lot of pairwise difficulty comparisons and then calculating Bradley-Terry scores, finding high correlation with human raters. Figure 1 is also a nice dissection of the space of difficulty measures.
The authors of this psychometric framework find evidence that reliable shaping and measurement of models’ synthetic personalities is possible, especially for larger, instruction-tuned models, and that these display consistent external validity on downstream tasks.
Events
The first EurIPS in history was held at the same time as NeurIPS, and was also full of evaluative events, papers and posters. One of the workshops focused specifically on the Science (yet again) of Benchmarking and Evaluating AI. Four presentations, posters and a panel made the day. Diversity emerged: different research questions in AI evaluation require different techniques and approaches.
Evaluation was also a key topic in and around NeurIPS25 in San Diego. Beyond the papers covered above, the most relevant events were:
before the start of the conference, MLCommons ran an event in Qualcomm’s headquarters, with key topics being collaboration between companies and academia, the need for a professionalization of AI evaluation and increased evaluation literacy among practitioners, and a panel where issues with benchmarks were discussed
the Evaluating the Evolving LLM Lifecycle workshop which, among others, included a great invited talk by Prof Sanmi Koyejo on “how to make LLM evaluation more robust”, highlighting the need for more measurement and predictive modelling and going beyond the “tyranny of averaging”
a workshop on Evaluating AI in Practice co-organised by the EvalEval coalition and UK AISI the day after the end of NeurIPS25.
Benchmarks
Finally, a rock solid benchmark! LITHOS is a collection of >200k (!) expert-annotated images of microscopical thin section rock samples, where the task is to identify the mineral classes.
The ripple effect in LLMs refers to the notion that the unlearning (suppression) of some knowledge (e.g., bioweapons) can have effect on other areas (e.g., basic biology). This preprint considers many ways in which ripple effects happen (dependencies in the knowledge or mere associations in the training set) and develops a new benchmark. They explore several unlearning techniques using it and analyse the effect at the benchmark level in terms of semantic distance.
RouterArena is a new framework to evaluate LLM routing systems that addresses the lack of specialised tasks and multi-faceted metrics of previous ones, while allowing to include many specialised models, that prevent the routing objective from simply being “identify the best generalist model”.
There is a lot of talk about AI recursively improving; two benchmarks presented at NeurIPS study two specific sub-aspects: optimising liquid cooling in data centers and writing code to speed up the training of “NanoGPT” models.
MMAR is a benchmark testing reasoning of audio-language models through speech, audio and music.
MindEval introduces an expert‑validated benchmark for testing LLMs in realistic, multi‑turn mental‑health‑support conversations. Using simulated patients and LLM‑as‑judge scoring grounded in clinical supervision guidelines, the authors show that even top models perform poorly, especially on therapy‑specific communication skills.
AdvancedIF is a set of 1,600 prompts and curated rubrics that assess LLMs’ ability to follow complex instructions on long-horizon tasks.
SimWorld is an open-ended realistic simulator for autonomous LLMs/VLMs in the physical and social world, built on Unreal Engine.
DAComp is a benchmark of 210 data engineering and data analysis tasks. Current performance hovers around the midway mark of the authors’ custom score.
ReasonBENCH benchmarks the (in)stability of LLM reasoning.
Happy New Year all! And reminder, share your awesome eval projects deserving rediscovery here to be featured on the digest.
News to share? Feel free to reach out to ai.evaluation.newsletter@gmail.com
Getting the digest: Once a month if you join.
Editor-in-Chief: Wout Schellaert
Contributors to this month’s digest: Lorenzo Pacchiardi, Jose H. Orallo, Behzad Mehrbakhsh, Ben Slater, Peter Romero, Konstantinos Voudouris, Daniel Romero, Zack Tidler, Joseph Castellano.


