2024 April "AI Evaluation" Digest
Only a few updates in the realm of evaluation this month. Short and sweet.
Highlights
A well deserved highlight for this month is OSWorld: a benchmark + infrastructure project for running digital agents in real computer environments, with the screen acting as the observation space and keyboard + mouse control as the action space. Example tasks include: creating GIFs from video, manipulating Powerpoint, or installing software. We are very excited about the potential of this sort of environment as a platform for testing generality, where various tasks can be unified behind a singular interface. But, one challenge is the difficulty doing automated evaluation in complex environments with multi-step tasks. In any case, there is a data explorer where you can see some agent recordings, a website, a Discord community... Go have a look!
… and other items:
LLM Evaluators Recognize and Favour Their Own Generations. LLMs evaluating LLMs—as opposed to relying on scripts, humans, or gold labels—has gained traction, but a bias is introduced this way. Caveat: self detection and self preference increase for larger/better models, both of which make sense (as better models produce better answers). Although the authors do perform some checks (Section 3.3) that invalidate this being the sole dynamic at play.
The people behind the widely relied upon Chatbot Arena, for ranking LLMs with pairwise human evaluations, have released a paper explaining some of the details, including analysis of the diversity of raters and the discriminating ability of their votes (arxiv).
Multiclass ROC is a paper about a new multiclass extension of ROC curves, which comes with a different metric for the multiclass AUC. Some new theory is always welcome!
AI Mathematical Olympiad is a 10 million prize fund and public+private benchmark for AI models that can solve International Math Olympiad questions.
The AI Index Report 2024 states: “AI beats humans on some tasks, but not all,” “harder benchmarks are emerging,” and “human evaluations are in” (Chapter 2). It also mentions that “robust and standardised evaluations for LLM responsibility are seriously lacking” (Chapter 3).
Compression Represents Intelligence Linearly finds that for 30 public LLMs, their compression (measured in bits per character) linearly correlates with an average of different benchmarks.
The New York Times published A.I. Has a Measurement Problem (paywall), which is a layman's introduction to many of the issues with AI evaluation.
In the paper Long-form factuality in large language models, a framework is presented in which an LLM agent evaluates the factuality of long-form answers produced by another LLM by looking for references online. This auto-grading seems to perform better than humans.
Lastly: Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions
Contributors to this month’s digest: Wout Schellaert, Jose H. Orallo, Nando Martínez-Plumed, Lorenzo Pacchiardi
News to share? Feel free to reach out to wschell@vrain.upv.es.
Getting the digest: Once a month if you join.