2023 July “AI Evaluation” Digest

Jul 28, 2023

The European Commission has launched the TEFs: the Sectorial AI Testing and Experimentation Facilities (compute.dtu.dk, ec.europa.eu).
Apollo Research starts as a new AI research organisation dedicated to building a holistic evaluation suite (announcement).
Jack Clark advocates for the UK Foundation Model Taskforce to focus on evaluation (essay), and UK AI companies seem willing to give early and priority access to support this effort (source). Internal sources mention that it is likely evaluation will be a central aspect of the task force.
Toby Shevlane and many others publish “Model evaluation for extreme risks” (paper).
Yan Zhuang et al. bring adaptive testing to LLMs (paper).
In “Lost in the Middle: How Language Models Use Long Contexts”, Nelson F. Liu et al. evaluate and investigate how LLMs use long input contexts, finding that the performance varies significantly depending on where the relevant information is located (paper).
OpenAI commits 20% of their compute for alignment superintelligence, and sets AI evaluation performed by AI as the central part of the whole programme: “AI systems to assist evaluation of other AI systems (scalable oversight)”. Does this mean human oversight is over?
Brittle evaluation and static benchmarks are mentioned as a few of the challenges in the trending survey paper “Challenges and Applications of Large Language Models” (paper).
In “Multi-Dimensional Ability Diagnosis for Machine Learning Algorithms”, Qi Liu et al. publish work on the psychometric evaluation of classifiers (paper).
In a Science letter titled “How do we know how smart AI systems are?”, Melanie Mitchell, makes a good summary of the rights and wrongs of AI evaluation. (letter)
In tandem with the previous entry, Science also reported in April on AI evaluation with Ryan Burnell et al’s “Rethink reporting of evaluation results in AI”, testifying to the increasing importance of AI evaluation to the broader public. (report)

Contributors to this month’s digest: Wout Schellaert, Jose Hernandez-Orallo, Lexin Zhou.

How to contribute: feel free to reach out to wschell@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.

The AI Evaluation Substack

Discussion about this post