Highlights
Drawing from three years of experience, the paper “Lessons from the Trenches on Reproducible Evaluation of Language Models” addresses several challenges in evaluating LLMs, such as assessing correct natural language responses, designing benchmarks, and managing the often opaque implementation details. It proposes best practices for improving the communication of results and the rigour of evaluation in the NLP community. To facilitate the adoption, the authors also present lm-eval, an open source library that provides a flexible API for model and task implementations and simplifies model evaluations (some case studies included!).
Stuff from ICLR 2024
The latest from ICLR 2024 showcases a range of interesting works on AI evaluation. Highlights include studies evaluating the psychological traits of LLMs; testing the ability of LLMs to learn low-resource languages from a single grammar book; and tracing data contamination in LLMs. Lots of new benchmarks (MuRS, Pinocchio, Genie, ImagenHub, COLLIE, RepoBench, LLMBar). Also, new papers explore the capabilities of LLMs such as memorisation, zero-shot robustness, human cognition; bias in multiple choice Qs, binding mechanisms or reasoning biases. In particular, an invited talk by Moritz Hardt explores the emerging science of benchmarking, which addresses issues—namely, annotator error, the external validity of model rankings, and the potential of multi-task benchmarks—challenges conventional wisdom, and highlights the need for systematic benchmarking studies in machine learning.
Policy and Governance
The UK government launched an £8.5 million grant programme to support cutting-edge research into AI safety and security and to protect society from AI-related risks.
Despite a landmark agreement announced by the UK PM to conduct pre-launch safety testing of AI models, Big Tech isn’t letting the UK Safety institute test their models pre-deployment, according to Politico.
Techniques (mostly LLMs)
Noise may be a much more significant component of inaccuracy compared to bias, according to this paper detailing noise audit of human-labelled benchmarks in machine commonsense reasoning. The authors find that noisy labels impact AI evaluation by up to 10% in human performance estimates and over 4% for AI models like ChatGPT, suggesting the need to reconsider single 'ground-truth' labels.
Model evaluations are critical to AI safety, but often neglect real-world human-AI interactions. This paper introduces "human interaction evaluations" (HIEs) to fill this gap, proposes a three-stage framework for designing HIEs, demonstrates its application to risks of overreliance and persuasion, and offers recommendations for improving AI evaluation practices.
Research shows that multimodal models such as CLIP and Stable-Diffusion require exponentially more pre-training data to improve zero-shot generalisation, debunking the notion of innate zero-shot ability and introducing the "Let it Wag!" benchmark for further study.
In a Careful Examination of Large Language Model Performance on Grade School Arithmetic, Scale AI finds significant "overfitting" of certain LLMs (e.g., Phi and Mistral) on popular AI benchmarks in, while those on the frontier (e.g., Gemini/GPT/Claude) show minimal signs of overfitting.
Prometheus 2 is an open-source language model designed to evaluate other LLMs with high accuracy, flexibility, and alignment to human judgement, surpassing existing models in correlating with human and proprietary LM scores. (The first in the saga, Prometheus 1, presented in ICLR 2024)
This study proposes methods to automatically detect under-trained tokens, known as "glitch tokens", in LLMs, aiming to improve model safety and efficiency by addressing the disconnect between tokenizer creation and model training.
PRISM gathers feedback from 1,500 participants across 75 countries, linking their socio-demographic profiles to their preferences in conversations with 21 LLMs, revealing critical insights into the subjective and multicultural orientation of AI, and advocating for broader, inclusive participation in AI development.
Benchmarking Benchmark Leakage in LLMs identifies pervasive benchmark dataset leakage in LLMs, which undermines the fairness and validity of model evaluations, and proposes a detection pipeline using perplexity and N-gram accuracy metrics, along with recommending the implementation of a "Benchmark Transparency Card" (tab 19, pages 29-30).
A new method for evaluating the task-specific accuracy of RAGs using Item Response Theory to generate and refine multiple-choice exams, highlighting that optimising retrieval algorithms can significantly outperform mere model size improvements.
Commentary
As in previous works [like this or this], the guys at AI Snake Oil argue that using Pareto curves that account for cost-accuracy trade-offs can provide a more informative evaluation framework (compared to traditional AI rankings), revealing, for example, that simpler and cheaper baseline approaches can match or exceed the performance of more complex and costly AI agents.
REFORMS is a checklist of 32 questions and guidelines that aims to improve the validity, reproducibility and generalisability of ML-based science by providing a consensus-based framework developed by 19 experts from different fields. Modules 7 and 8 in the checklist are those related to evaluation.
Interesting post from NLP News about how the rapid advancements of LLMs have outpaced existing benchmarks, with issues of memorization and overfitting becoming significant concerns.
A philosophical take of what a capability is in a machine learning system here.
Benchmarks
In Evaluating feature attribution methods in the image domain, Gevaert et al. investigate 10 different quality metrics for attribution-based explanations on 8 image datasets, and use the results to propose a set of benchmarking guidelines. They find that the quality metrics do not generalise well to other datasets and that methods with desirable theoretical properties don’t outperform cheaper alternatives in practice..
Can Good Benchmarks Contain Mistakes? Yes, they can, with up to 35% error rates, but remain useful until models surpass the expert agreement ceiling of 74%, beyond which improvements reflect models' ability to predict errors, pointing to the need for highly valid benchmarks despite high costs and logistical challenges.
The CaLM (Causal Evaluation of Language Models) framework is presented as the first comprehensive benchmark for evaluating the causal reasoning capabilities of LLMs, encompassing a broad evaluation taxonomy, a comprehensive dataset, extensive model evaluations, and a multi-faceted platform.
EWOK is a new framework for assessing LLMs’' world knowledge across 11 domains, revealing that even large models lag behind human performance and suggesting rich opportunities for focused improvement in AI world modelling.
Google DeepMind introduces Gecko, a comprehensive benchmark that evaluates text-to-image AI models using 2,000 complex prompts and advanced metrics to reveal models' true capabilities and limitations, supported by extensive human scoring and advanced automated evaluations. Venturebeat also made a pop piece for it. VALOR-EVAL has a similar goal and addresses hallucination issues in large vision-language models by evaluating objects, attributes and relations.
ALERT is a new comprehensive benchmark for assessing LLMs’ Safety through Red Teaming.
Microsoft releases a dataset containing 10 million real Bing search queries with 60 million user clicks.
The Frontier Safety Framework by Google DeepMind is aimed at identifying and mitigating severe risks from future AI models' "capabilities" (based on four domains: autonomy, biosecurity, cybersecurity and Machine Learning R&D), ensuring they align with human values and societal goals, and is set for full implementation by early 2025.
The Databricks' Mosaic Evaluation Gauntlet includes 39 benchmarks across six competencies. They rely on scaling laws as a robust ground truth while selecting benchmarks based on their correlation with model scale, which could bias the selection process and undermine the claimed robustness of scaling laws.
Simple (and less simple) ways of making a benchmark harder from HuggingFace.
…and other cool stuff
A tutorial on LLM evaluation at LREC-COLING 2024
Contributors to this month’s digest: Wout Schellaert, Jose H. Orallo, Nando Martínez-Plumed, Carlos Monserrat, Lorenzo Pacchiardi, Joseph Castellano
News to share? Feel free to reach out to wschell@vrain.upv.es.
Getting the digest: Once a month if you join us: