Build AI that does AI evaluation for you, then retire
Can AI evaluation be automated? In a summer (or winter) where everybody is discussing when –and not if– AI research scientists will be automated, we ask the more specific question, or perhaps more general (because of the implications on alignment), of how much AI evaluation can be automated. But isn’t AI evaluation already automated once a benchmark is created? If you answer yes to this question you’re still stuck in multiple-choice questions benchmarks. In general, we can talk about the generation of benchmarks, the annotation of examples in those benchmarks and the interpretation of results, just to start with. In this post, Eugene Yan presents an up-to-date account on LLMs grading of results of other LLMs, also known as ‘LLM-as-a-judge’, ‘LLM evaluators’, ‘critics’ or ‘verifiers’. The post brings insights on “how to apply, evaluate, and operate LLM-evaluators”. The take-away is that there’s a long-way ahead for them to do a good job in general, especially because we are raising the bar on what things we want to evaluate automatically, and because AI is becoming more and more capable by the day. But it’s clear that automated grading works well in some areas, if done with care, interpreted and supervised by human experts. This diversity of results depending on the area or the expectations explains the discrepancy of sentiment about the state of the art of AI evaluation automation, compared to other papers dealing with LLM evaluators, such as https://arxiv.org/abs/2403.02839 or https://arxiv.org/pdf/2406.18403 (which we covered last month in our digest), despite the improvement (https://arxiv.org/abs/2408.02666) on one of the benchmarks for evaluating evaluators: RewardBench.
However, AI judges come as a necessity for AI evaluation and alignment as models become smarter (scalable evaluation and oversight), with humans no longer being able to score them reliably. In fact, the limitations of human evaluation have been there for a long time, with humans easily fooled by minor details and biases (preference for gullible answers rather than abstention). This has been a constant in many techniques that are used to align LLMs, such as RLHF. So it is no surprise that this is reflected in leaderboards based on human evaluation. In this very insightful blogpost they show the dynamics of ChatArena, given the human preferences for (1) models that follow certain stylistic patterns and (2) ultracrepidarian models that tend to comply with user requests (at the risk of hallucination). Perhaps humans don’t put the bar too high.
Commentary
Very complete introduction to Foundation Model Evaluation from the Ada Lovelace Institute, collected from bibliographic research and expert interviews, with an emphasis on safety (reflecting the shift the field of AI evaluation has suffered in the past two years), focusing on benchmarking and red teaming as dominant paradigms.
A comprehensive meta-analysis of AI safety benchmarks that, despite the provocative title of “safetywashing”, calls to separate between safety evaluation and capability evaluation, a confusion that didn’t exist a few years ago and is creating many of the problems of AI evaluation today —and also alignment. It seems that the field of AI “evals” / “dangerous capabilities” is finally realising the mess that has been created. The paper just goes on passing on the analysis of the structure of capabilities, beyond just correlations.
Paper adding yet more evidence that results from psychometric tests used for humans do not render calibrated results for AI systems, but that adaptive testing holds promise. Perhaps more informative is their finding that many hallucinations are not triggered by the inability to answer correctly in other situations.
Well-known in psychology, items in tests usually involve several demands that should help explain and predict performance, but some of these demands are unrelated to the construct that is being measured by the test. The specific question of this paper is whether the existence of these other irrelevant demands has a stronger effect on weaker LLMs than on stronger LLMs. Guess what, it affects weaker LLMs more. As expected, with strong models we can take for granted that they do well on many unrelated demands and hence can focus on the thing being tested. This is similar, in a way, to the problems of prompt sensitivity that earlier models had when we evaluated their capabilities.
METR presents an update on their capability evaluations. Interestingly, they use the number of hours a human requires to complete a task as a way of scaling task difficulty, similar to what Watt did with the horsepower. They also compare costs between humans and models.
Ways of determining what makes a software task hard because “a model achieving a 90% score on a benchmark of predominantly easy tasks is likely less capable than a model achieving a 90% score on a benchmark containing predominantly difficult tasks”. Finally we see the elephant in the room (answer: it’s item difficulty).
New evaluation framework, StructEval, introduces item variations and structurally connects them by using Bloom’s taxonomy, to address contamination, among other things.
Benchmarks
In the June AI evaluation digest we discussed a new challenge to win $1M for those solving the ARC-AGI benchmark, in a humorous tone. Now, a sound and insightful analysis of this challenge comes with Melanie Mitchell’s substack “AI: A Guide for Thinking Humans”, totally recommended!
More on tests like Raven’s progressive matrices, or the ARC-AGI challenge, and many others, in this paper: Kid-inspired Visual Analogies for Testing Large Multimodal Models. But this time, guess what: AI fails but even a kid could solve them.
Cognitive Assessment of Language Models (CALM) is a new benchmark inspired by (neuro)psychology including numeric reasoning, visual spatial reasoning, attention, memory, executive functioning, etc.
More cognitive! GOGLM (Cognitive Ability Evaluation for Language Model) is a new benchmark based on Piaget Theory of Cognitive Development.
OpenAI significantly extends and validates a benchmark to evaluate software engineering capabilities, SWE-Bench.
The CeSIA - Centre pour la Sécurité de l'IA, a French non-profit organisation dedicated to the safety of AI systems, presents BELLS, a new platform and leaderboard to evaluate the quality of safeguards. A good summary here.
Toolsandbox, a benchmark for evaluating how well LLMs using external tools.
This MMAU benchmark is holistic, and it’s for agents. It includes Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, organised into “five essential capabilities”: Understanding, Reasoning, Planning, Problem-solving, and Self-correction.
Change a letter to MMAU and you have MMIU! MMIU means Multimodal Multi-image Understanding, a benchmark meant for vision-language models.
MMAU, MMIU, MEOW! No more benchmarks!
CFPs (Events and Grants)
EvalEval workshop at NeurIPS 2024 (Evaluating AI evaluation, focusing on social impact). Deadline: Sep 20.
Large Language Model (LLM) Evaluation Research Grants from Meta. Deadline: Sep 6.
Epilogue:
Has Professor Mitchell passed the Turing Test Tiny Twist Test (T5)? This test is passed when someone (human or machine) writes a paper about the Turing Test and presents a new twist about it that hasn’t been made before. In Science. Judge for yourself! Or ask a LLM to judge it for you.
Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert, Pablo Moreno-Casares
News to share? Feel free to reach out to wschell@vrain.upv.es.
Getting the digest: Once a month if you join: