2024 February "AI Evaluation" Digest
In a recent blog post titled “We Need a Science of Evals” the AI alignment-focused research organisation Apollo Research advocates for the establishment of a "Science of Evals". While we applaud the initiative, and precisely because we stand behind the overall message, we have some comments to add on culture, history, and reinventing the wheel.
Read our reply here.
Highlights
We did a half-day lab at AAAI on “Measurement Layouts for Capability-Oriented AI Evaluation”. You can check out the materials here, including the slides and several Google Colabs to get started from the ground up, deriving cognitive profiles for both reinforcement learning agents and LLMs.
With the release of Gemini 1.5 Pro, the “Needle in a Haystack” test gains more traction. With the scope so narrow and the task rather simple it reminds of unit testing practices from software development: success should be expected and failure tells a lot. More like this; we haven’t run out of failure modes yet.
Open RL Benchmark is a large collaborative collection of meticulously tracked reinforcement learning experiments reporting various metrics over multiple implementations and environments (GitHub).
The Calibration Gap between Model and Human Confidence in Large Language Models highlights that default explanations from LLMs often lead to user overestimation of both the model’s confidence and its’ accuracy.
Two papers on AI auditing, from two groups of authors who tend to debate each other on AI safety priorities, with two distinct vocabularies and perspectives: (i) AI auditing: The Broken Bus on the Road to AI Accountability and (ii) Black-Box Access is Insufficient for Rigorous AI Audits.
Commentary & Techniques
Only LLMs, sorry!
A Collection of Principles for Guiding and Evaluating Large Language Models
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
Peer-review-in-LLMs: Automatic Evaluation Method for LLMs in Open-environment
… and red teaming specifically
A StrongREJECT for Empty Jailbreaks, since not all jailbreaks are equally useful.
Robust Testing of AI Language Models Resilience with Novel Adversarial Prompts
Benchmarks
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
MMToM-QA: Multimodal Theory of Mind Question Answering. Note that it only includes videos of people looking for objects in household environments.
Jobs
The UK AI Safety Institute is looking for research scientist and engineers for AI evaluations.
Next month we will have a selection of papers from AAAI, which is currently happening and has over 40 papers on evaluation and benchmarks.
Contributors to this month’s digest: José H. Orallo, Wout Schellaert, Nando Martínez-Plumed
How to contribute: Feel free to reach out to wschell@vrain.upv.es if you have news to share or want to get involved.