2024 June "AI Evaluation Digest"

Summer projects and more!

Jun 28, 2024

A challenge offering $1M is definitely getting attention, but that doesn’t necessarily mean that this should be the priority for AI or that it presents a more solid methodology from the AI evaluation perspective. A few years ago, François Chollet, the creator of the deep learning Keras library at Google, proposed the Abstraction and Reasoning Challenge (ARC) as one of the parts of his arxiv paper “On the Measure of Intelligence” –discussing it and its scientific lineage would need a special issue. The paper refreshes the old idea that intelligence is the ability to acquire skills and turns it into a benchmark of visual transformations (symmetry, counting, colours, rotation, etc.) that, as any good AI challenge, is easy for humans but hard for state-of-the-art AI. It is no surprise that it resembles some psychometric tests, such as Raven’s Progressive Matrices, or the traditional Bongard problems in machine vision, but with a colourful Tetris-like appearance.

All this doesn’t make this challenge bad, but it’s not better or more original than Lázaro-Gredilla et al.’s Tabletop world (Fig. 1), or the recent Abstract Visual Reasoning challenge. But linking it to “AGI” and offering $1M reminds us of some other big challenges in the past that did little to change the dominant paradigm in AI, especially if the examples are not systematically generated. By this we mean a test that were actually based on a theory of the capabilities the problems are supposed to measure, with a good understanding of the elements that make some instances more demanding than others in terms of these capabilities and a procedural generator for held-out problems.

But who cares? For those of you living in the Northern Hemisphere, this is the new AI summer project for 2024 that gives you the opportunity “to get a million bucks” and “solve AGI” at the same time.

Findings

In a bit of a compare-to-the-baseline experiment, this paper tests whether in-context learning would actually be enough to beat instruction fine-tuning and RLHF (it is not).
This paper claims that LLM classification performance is “overclaimed”, but mainly finds that models don’t avoid the question when the correct answer is not present in a list of multiple-choice options.

Methods

Is Chatbot Arena the way of evaluating LLMs from a human perspective? A recent workshop on LLM evaluation (from a HCI perspective) seems to suggest this is more complicated than preferences: Human-Centered Evaluation and Auditing of Language Models.
More on Chabot Arena: MixEval is not a megabenchmark (such as BigBench) but a minibenchmark that extracts small selections of instances that lead to performance results that correlate as much as possible with ChatBot Arena.
This paper proposes a set of metrics aimed at measuring geographic disparities in the depiction of objects and backgrounds in generated images and finds among other things that objects are depicted with more realism than backgrounds, and that image generators struggle to generate modern vehicles in Africa.

Benchmarks

More on long context reasoning: https://novelchallenge.github.io/. Not there!
LiveBench is a challenging LLM benchmark that aims to be contamination free by updating the questions on a monthly basis. Similarly, but with the data kept private, Scale announced their SEAL leaderboard, of which the questions will also be updated periodically.

RUPBench is a LLM reasoning benchmark with both syntactic and semantic perturbations, which have undergone expert review, and this paper introduces phonological, morphological, and lexical distance to study cross-lingual generalisation.
DevBench: A multimodal developmental benchmark for language learning
The popular MMLU benchmark gains an update with MMLU-Pro, which integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten.
IrokoBench is a new benchmark for African languages.
NYU releases a new computer vision benchmark (CV-Bench, to keep it simple) together with their Cambrian 1 vision centric multimodal LLM, which contains an interesting analysis of the construct validity of existing multimodal benchmarks.
LMSYS (known for their Chatbot Arena, yet again) releases a challenge on the topic of predicting which answers humans would prefer, which is basically what reward models in RLHF do. This is similar to RewardBench which we wrote about in March. LMSYS also makes a vision variant of their leaderboard now.

Contributors to this month’s digest: Wout Schellaert, Jose H. Orallo, Nando Martínez-Plumed.

News to share? Feel free to reach out to wschell@vrain.upv.es.

Getting the digest: Once a month if you join us:

The AI Evaluation Substack

Discussion about this post