2024 March "AI Evaluation" Digest

Warning, it's a big one!

AI Evaluation

Mar 29, 2024

There was a lot of great content this month. We organised this digest into sections:

Reinforcement Learning
NLP
Meta & cross-domain papers
(Existential) Safety
Policy
And direct coverage from AAAI in Vancouver, and EACL in Malta.

There are no overall highlights, but the sections should be either short enough, or have section-specific highlights!

Here we go.

Reinforcement learning

Banafsheh Rafiee, in her recently defended PhD thesis, introduces three diagnostic RL testbeds based on animal learning experiments. See Chapter 3. (Twitter)
With RewardBench, the Allen institute takes a stab at evaluating reward models that predict human preferences over outputs, which are central to the reinforcement learning from human feedback (RLHF) that makes modern LLMs so successful.
Jumanji introduces a suite of diverse RL environments fully programmed in JAX. It should be lightning fast! We see some traditional games (e.g. Sokoban), but also NP-hard problems like bin packing.
Craftax is an extension of Crafter with procedural dungeons and more stuff. Also programmed in JAX, and reported 100 fold speedups. Of course, you mostly unlock the benefits if your agent is also highly parallelizable (e.g JAX based).

NLP

Highlights

We all know we should do it more: actually look at the data. With LLM Comparator, a visual side-by-side analytics tool for assisting evaluation, that should be easier now.
As you might have noticed before, we have a soft spot for evaluations based on methodologies from other cognitive sciences. CogBench: a large language model walks into a psychology lab is such a paper.
In a new paper on bias evaluation called Toward RUTEd Evaluation, authors argue that a lot of current bias benchmarks are “trick tests” and overly synthetic, engineered and de-contextualized. They perform a few evaluations based on more ecologically valid tasks (i.e. things people actually use LLMs for), and find that traditional bias benchmarks barely correlate with the newly measured harms. The authors do note the study is limited in many ways.

…and other cool works

Not a big surprise, but more proof is always welcome: LLMs don’t hold up to semantic-preserving variations in the input. See Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models.
Why use thousands of examples in your benchmark if 100 will do? By using IRT, the tinyBenchmarks paper constructs small subsets of instances that correlate highly with performance on the full benchmark. Useful if you’re on a budget (of money or time!).
Two new benchmarks dealing with long inputs and going beyond the “Needle in a Haystack” test we covered last month: the GoodAI Long Term Memory Benchmark and Bar-Ilan University’s Same Task, More Tokens.
After HELM, HELM Lite, and HEIM, there is now HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings, with mixed human/LLM evaluation.

It will be outdated soon, but for now, the paper Datasets for Large Language Models: A Comprehensive Survey provides an impressive and thorough 180 page survey of LLM datasets and benchmarks.
DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents: This paper is split between automatically generating variations of the items to evaluate the contamination of LLMs in the first part, and creating new instances (items) that are more loaded on three main factors (language understanding, problem solving and domain knowledge), in the second part.
Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models finds that if you set up an analogical reasoning problem with a custom alphabet, LLMs suddenly stop performing better than humans. More proof for the LLMs-are-not-good-reasoners pile.

Meta and cross-domain

Nature published a perspective about Artificial intelligence and illusions of understanding in scientific research.
Dan Hendrycks, from the Centre of AI Safety and author of many popular benchmarks, published a short blogpost on Devising ML Metrics that is worth a read. While this newsletter often argues for in-depth and granular evaluations, statements from the blogpost like “Benchmark performance should be described with as few numbers as possible” are a good reminder that simplicity is an important factor in adoption and popularity, which are major drivers for a benchmark’s utility.
Characterising instance hardness in classification and regression problems is what it says. The study mostly uses synthetic datasets.

Safety

Highlight

SafeBench is back. After being promoted, and then disappearing for a while, SafeBench returns with $250,000 in prizes for ML Safety benchmarks.

Papers

Auditing the AI Auditors provides an interdisciplinary look into auditing practices from the perspective of the field of psychology.
Evaluating Frontier Models for Dangerous Capabilities
The "Weapons of Mass Destruction Proxy" Benchmark

Policy and Governance

The Biden administration issued a directive requiring all executive agencies to appoint Chief AI Officers and establish AI Governance Boards to ensure the safe and responsible use of AI. However, Burtell and Toner argue that the lack of standardised tools for evaluating AI systems' reliability, fairness, and security poses a significant challenge to implementing the directives. They emphasise the need for increased funding and support for organisations like NIST to develop better AI measurement and evaluation methods – In the US, Congress allocated a mere $10m for this, while the UK’s AI Safety institute received $125m.
Who has achieved AGI? As another example, in a new lawsuit Elon Musk brought against OpenAI, his claims require a judicial determination of whether OpenAI has already achieved “Artificial General Intelligence”. Who will evaluate that? (Twitter)
In a public letter, more than 300 AI researchers asked for a legal and technical safe harbour to perform independent external safety evaluations of AI models. The gist is that AI companies like OpenAI and Google have implemented developer policies that make it difficult for external researchers to conduct evaluations of their AI systems. While many companies perform extensive testing in-house, for technology as impactful as AI, it would be nice if we could double check them without going to jail.

EACL

Everything here is worth a highlight!

IRT for NLP Tutorial. Item response theory (an instance-based item analysis theory from psychometrics) is being increasingly applied in machine learning and artificial intelligence, but this is the first tutorial of IRT in AI focusing specifically on NLP. The tutorial introduces the key ideas of IRT, goes through examples and notebooks in the python library pyIRT (which expedites the construction of the most common IRT models), and includes many pointers for extensions and advanced material.
“Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs” won the Best “Non-Publicised” Paper Award, a new category at EACL (for papers that are anonymous during submission, with no Arxiv version until accepted). The authors of this paper present a very comprehensive account of NLP benchmarks that are leaked to proprietary language models – not by including them in the training set, but just by using them. This is an important message to anyone doing evaluations: don’t try any item or benchmark you want to keep private: the simple use of a benchmark (chat window vs API, depending on the licence) will leak information about the benchmark. Also, the presentation included a series of recommendations: 1) Access the model in a way that does not leak data, 2) Interpret performance with caution, 3) When possible, avoid using closed-source models, 4) Adopt a fair and objective comparison, 5) Make the evaluation reproducible, 6) Report indirect data leakage. If you’re interested in data contamination, this is the first workshop on contamination.
In A Proposal for Scaling the Scaling Laws, the authors of this newsletter present an idea based on granular evaluation data for scaling analysis, without doing any of the actual work. I guess we were too busy reading papers and making this newsletter!

AAAI

Highlight

Reproduce, Replicate, Reevaluate.The Long but Safe Way to Extend Machine Learning Methods – The paper presents a systematic approach to ML experiments, which enables the early detection and correction of deficiencies to develop more robust and transparent methods. It extends Knowledge Enhanced Neural Networks to knowledge graphs. Recognising the common need to re-implement systems for extension or other reasons, the authors propose a progressive approach of reproducing, replicating and re-evaluating experiments to ensure the reliability of their re-implementation. The authors claim that this methodology facilitates a deeper understanding of potential challenges at each stage and how to address them. Thus, the paper advocates integrating these reproducibility steps into the workflow for extending ML methods, emphasising that while direct development of new methods is tempting, reproducibility increases the reliability of results. The authors also suggest adding recording and repeating to the reproducibility steps to improve SOTA integration and self-checking of documentation and automation.

… and a relatively arbitrary selection of the rest

NLP

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

LLMS as graders

Computer vision in an open-vocabulary setting:

ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Reinforcement learning

Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms

Behavioural models and game theory

How to Evaluate Behavioral Models

Contributors to this month’s digest: Wout Schellaert, Jose H. Orallo, Nando Martínez-Plumed, Joe Castellano.

News to share? Feel free to reach out to wschell@vrain.upv.es.

The AI Evaluation Substack

Discussion about this post