From Mexico 1968 to Paris 2024: Lessons for AI evaluation?
In a bold arxiv paper, Domínguez-Olmedo et al. claim that if we can’t beat contamination in large language model evaluation, we should embrace it, by “finetuning each model under comparison on the same task-relevant data before evaluation”. This is their Olympic rationale:
“The 1968 Olympics took place in Mexico City at the significant altitude of 2340 meters, higher than Australia’s tallest peak. Runners who had trained at altitude in their home countries were better prepared to compete in Mexico City’s conditions, as it turned out. But the hotly debated results of the Games did not lead the organizers to prohibit training at natural altitude. Instead, they let everyone do it; and athletes came to consider altitude training an excellent way to train. The anecdote holds a lesson for the evaluation of large language models half a century later. Knowledge about the evaluation conditions necessarily influences training practices under competitive pressure. It may be a fool’s errand to prohibit the practice. Instead, we propose to adjust for it by giving every model the same task-specific preparation before evaluation. We work from the assumption that training on the test task, in general, cannot be effectively detected, disallowed, or disincentivized. Detecting what training data a model has seen is a notoriously difficult problem—existing heuristics achieve partial success at best. Researchers routinely acknowledge the futility of fighting data contamination. Moreover, we anticipate that the ways to effectively train on the test task will only grow in scope and adoption.”
This is controversial for a number of reasons. Starting with the rationale, there are many other Olympics-themed analogies that could be used, such as doping. Doping, like contamination, is hard to detect, but effective measures have been put into place over the years, and it is a key element of the Paris 2024 Olympics and every sports competition today. Another issue is that while the sports analogy works well for performance-centric task-oriented AI evaluation, it may not work well for other kinds of evaluation, such as fairness or safety. But the main question is whether we would like AI Evaluation to give medals, or we would like it to estimate capabilities or other constructs that allow us to predict behaviour in and out of distribution. For the former, what the authors propose may actually be fairer than the current situation of comparing some models that have been contaminated against others that haven’t. For the latter, it is more questionable. But the bold claim of the paper may deviate attention from some very interesting findings. The main message, as they show in their Figure 1, is that adjusting for differences by fine-tuning all models on the same task-specific data can shed some light on scaling laws and emergent performance phenomena, by resituating models in a fairer way. However, the amount of data for the task-specific finetuning and whether this could saturate models in the future are open questions.
Commentary
This paper argues that current AI evaluation methods are fundamentally flawed, and that cognitive science-inspired reform is needed to ensure the safety and reliability of increasingly capable and common AI systems. Similar arguments are made here.
A new report from the Ada Lovelace Institute examines the evaluation of foundation models, arguing that while they are essential for understanding the risks and impacts of AI, they need significant improvement and standardisation to effectively inform policy and regulation.
This study identifies significant limitations in current AI agent evaluation benchmarks and argues for cost-controlled evaluations, joint accuracy/cost optimisation, and standardised practices to align agent performance with real-world applications.
Anthropic announces funding for third-party AI model evaluations to enhance safety and capabilities.
OpenAI has introduced an internal scale to measure the progress of its large language models towards achieving AGI. This scale defines five levels of advancement, with current chatbots like ChatGPT at Level 1 and progressing towards more complex capabilities such as AI handling tasks independently (Level 3), creating new innovations (Level 4), and performing the work of entire organisations (Level 5).
Benchmarks
MetaBench creates a highly efficient and scalable evaluation by distilling six large benchmarks into a concise set, maintaining accuracy at less than 3% of the original size.
Researchers evaluate the effectiveness of existing hallucination benchmarks for large vision-language models, identifying reliability and validity issues. They propose a new High-Quality Hallucination Benchmark (HQH) and a Hallucination benchmark Quality Measurement framework (HQM) to improve the assessment of model hallucinations and conduct extensive evaluations using this new framework.
DeepMind introduces the Perception Test, a unique benchmark for evaluating multimodal video models on abilities such as memory, abstraction, physics and semantics across video, audio and text modalities.
Wolfram's LLM benchmarking project evaluates LLMs for their ability to generate accurate Wolfram Language code, highlighting significant differences in syntax correctness and functional accuracy between different models.
Methods
Another kind of Olympics: Large models of vision and speech struggle with age-appropriate problem solving in children's maths Olympiads and perform significantly worse than children, especially in the lower grades.
According to Simon and Garfunkel, there were at least 50 ways to leave your lover. Well, according to this paper, there are at least 43 ways in which machine learning evaluations can be misleading.
Current LVLMs show improved reasoning skills for higher grades, but struggle significantly with basic-level problems designed for younger children, highlighting a fundamental difference in reasoning ability.
This paper identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence (Type I), and dismissing LLM mechanistic strategies that differ from those of humans as not really competent (Type II).
This paper proposes the Evidence-Centered Benchmark Design (ECBD) framework for the systematic design and evaluation of NLP benchmarks, focusing on the formalisation of design decisions and the collection of validity evidence.
LLMs instead of Human Judges? Nope, not yet. LLMs exhibit inconsistent performance across 20 diverse NLP tasks compared to human judgments, indicating they aren't ready to fully replace human evaluators.
But… weaker LLM judges improve the accuracy of oversight when strong LLMs debate rather than providing consultancy; the effectiveness varies by task type, showing great promise for debate protocols in ensuring accurate AI oversight.
Contributors to this month’s digest: Wout Schellaert, Jose H. Orallo, Nando Martínez-Plumed, Joseph Castellano
News to share? Feel free to reach out to wschell@vrain.upv.es.
Getting the digest: Once a month if you join: