Evaluating Reliability from Human Expectations
A car that doesn’t fly is not unreliable, because no one expects cars to fly. This is what happened with early large language models (LLMs). Humans didn’t expect much from them. But in the past few years, as LLMs were getting more and more powerful, people started relying on them, perhaps too much, to the extent of using them as calculators, dictionaries or atlases. A recent study published in Nature reveals that, as LLM families scaled up (in parameters, data and compute) and shaped up (e.g., refined with human feedback), their models didn’t become more reliable. To reach this conclusion, the paper introduces some methodological innovations from the perspective of AI evaluation. First, it analyses the results of the models using three categories (correct, incorrect and avoidant), showing that recent models have almost totally eliminated the epistemic avoidance category. Recent models are anti-Socratic: they almost never say “I don’t know”. Second, the paper studies this distribution of correct, incorrect and avoidant answers in terms of difficulty expectations taken from a human study, showing that, even for instances humans consider very difficult and success is unlikely, the most recent models almost always give an answer. As a result, new models make more errors than old models, and they make them across all levels of difficulty. The paper also presents the first scaling laws of the proportion of incorrect answers divided by incorrect + avoidant answers, showing that models have become more ultracrepidarian: they give answers beyond their competence. This is related to the phenomena of bullshit and hallucination in LLMs, but explored from the lens of instance difficulty. A second human study shows that oversight is insufficient to compensate for these reliability issues. While humans recognise high-difficulty tasks, they often fail to spot LLM mistakes, indicating over-reliance on the models. Overall, there are no safe operating conditions—where either model error or human supervision error is negligible—even for low-difficulty tasks.
The study found these issues consistent across multiple LLM families (GPT, Llama and BLOOM). The paper doesn’t cover very recent models, but we begged the authors to explore some for this newsletter: anecdotal evidence on very recent models, such as OpenAI’s o1 models, Antrophic’s Claude-3.5-Sonnet and Meta’s Llama-3.1-405B, suggests the reliability problems persist.
More on difficulty
New paper where assessor models and eXplainable AI (XAI) techniques are used to predict and explain instance hardness.
a PhD dissertation that tries to answer “What makes a task difficult for machine learning models in NLP datasets?”.
What’s harder than MMLU? MMLU-Pro. And what’s harder than MMLU-Pro? MMLU-Pro+: This benchmark focuses on higher-order reasoning and tries to prevent shortcut learning (exploiting superficial patterns to get the correct answers.
You see this coming… What’s harder than MMLU-Pro+? Humanity's Last Exam, is the grandiose name chosen for a recent call looking for contributors that propose extremely difficult questions at the limit of what humans can do, to be compiled for next generations of LLMs. Organised by the Center for AI Safety (CAIS) and the startup Scale AI. Reuters Coverage.
Contamination
This paper evaluates five contamination detection methods for LLMs and finds that the state of the art in contamination detection has many limitations.
Syntheval uses LLMs to generate variations of the items that make performance drop from 92.4% to 10.17%. But are the new items of comparable difficulty?
Another case that shows that as the test distribution deviates from the examples used during training, the performance drops very significantly, showing the effect of lack of generalisation, if not contamination. In this case, it is linguistic reasoning skills without relying on pre-existing language-specific knowledge.
Psychometrics, psychology and cognition
Item response theory (IRT) applied to bias (arguably funny title), image classification (neutral title) and explainable AI (eXirt, no comment about the acronym).
A paper presenting Deep Boltzmann Machine models as an alternative to IRT to estimate instance hardness and classifier predictive performance.
Other alternatives to IRT in Psychometrics? Classical Test Theory! Here, applied to dataset evaluation.
Machine psychology, yet again? Table 1 in this paper covers and classifies 25 “machine psychology” studies.
This paper uses the term cognitive development to refer to the increase in capabilities evolution as the parameter size and optimisation objective evolve. They refer to Piaget’s cognitive development theory but we have to clarify that LLMs (at least as studied in this paper) don’t develop; they are referring to the evolution of families.
Miscellanea
Two different proposals for the evaluation of cooperation and competition of LLM multi-agent systems. One is a benchmark with a martial name: BattleAgentBench, and the other one is a contest with a more peaceful name: Concordia.
Eureka! An extensive new benchmark showing uneven “performance” (not capability) of LLMs for different dimensions. Major take-away: radial plots are becoming the standard representation for LLM profiles.
Windows Agent Arena: a reproducible, general environment to assess multimodal agents for Windows.
Holistic Bias-Benchmarking Pipeline, very timely for the US Elections: “Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating roleplaying performance bias in these models”. Can we simulate a second debate?
Contributors to this month’s digest: Jose H. Orallo, Nando Martínez-Plumed, Joseph Castellano
News to share? Feel free to reach out to wschell@vrain.upv.es.
Getting the digest: Once a month if you join: