2024 February "AI Evaluation" Digest

Feb 23, 2024

In a recent blog post titled “We Need a Science of Evals” the AI alignment-focused research organisation Apollo Research advocates for the establishment of a "Science of Evals". While we applaud the initiative, and precisely because we stand behind the overall message, we have some comments to add on culture, history, and reinventing the wheel.

Read our reply here.

Highlights

We did a half-day lab at AAAI on “Measurement Layouts for Capability-Oriented AI Evaluation”. You can check out the materials here, including the slides and several Google Colabs to get started from the ground up, deriving cognitive profiles for both reinforcement learning agents and LLMs.
With the release of Gemini 1.5 Pro, the “Needle in a Haystack” test gains more traction. With the scope so narrow and the task rather simple it reminds of unit testing practices from software development: success should be expected and failure tells a lot. More like this; we haven’t run out of failure modes yet.
Open RL Benchmark is a large collaborative collection of meticulously tracked reinforcement learning experiments reporting various metrics over multiple implementations and environments (GitHub).
The Calibration Gap between Model and Human Confidence in Large Language Models highlights that default explanations from LLMs often lead to user overestimation of both the model’s confidence and its’ accuracy.
Two papers on AI auditing, from two groups of authors who tend to debate each other on AI safety priorities, with two distinct vocabularies and perspectives: (i) AI auditing: The Broken Bus on the Road to AI Accountability and (ii) Black-Box Access is Insufficient for Rigorous AI Audits.

Commentary & Techniques

Only LLMs, sorry!

… and red teaming specifically

A StrongREJECT for Empty Jailbreaks, since not all jailbreaks are equally useful.
Robust Testing of AI Language Models Resilience with Novel Adversarial Prompts
Red Teaming Visual Language Models

Benchmarks

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models
MMToM-QA: Multimodal Theory of Mind Question Answering. Note that it only includes videos of people looking for objects in household environments.

Jobs

The UK AI Safety Institute is looking for research scientist and engineers for AI evaluations.

Next month we will have a selection of papers from AAAI, which is currently happening and has over 40 papers on evaluation and benchmarks.

Contributors to this month’s digest: José H. Orallo, Wout Schellaert, Nando Martínez-Plumed

How to contribute: Feel free to reach out to wschell@vrain.upv.es if you have news to share or want to get involved.

The AI Evaluation Substack

Discussion about this post