2023 September “AI Evaluation” Digest
Some media coverage and introductory pieces about the state of AI evaluation:
“AI hype is built on high test scores. Those tests are flawed.” (MIT Technology Review) Review of the state of AI evaluation for the general public echoing all the “wrongs” of AI evaluation.
“A test of artificial intelligence” (Nature). The Turing Test, not again please! But there are also some interesting insights in this piece, beyond that.
Administration, policy and tech:
Governor of California Gavin Newsom signed an executive outlining California’s strategy towards a responsible process for evaluation and deployment of AI. (source).
David Krueger and Yarin Gal announced as the first research directors for the UK Frontier AI Task Force, together with a host of heavyweight external advisory board members (press release). The task force is working extensively on “AI evals”, which are mostly AI testing and risk evaluation efforts, including red teaming.
Anthropic Responsible Scaling Policy, with focus on evaluations aimed to catch early warning signs (source).
Test and Evaluation (T&E) Methodology from scale.com. (source). More on using the term AI Evals, especially when doing testing and red teaming.
Language models:
“AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models”, (arxiv). A new benchmark for large language models from BenchCouncil. Despite the unfortunate name, a very interesting feature is that it is human referenced, with five difficulty levels per instance.
“Efficient Benchmarking (of Language Models)” (arxiv) Shows that something well-known in ML evaluation also happens in HELM: using aggregate metrics on batteries of dataset can change leader boards with the single removal of one dataset.
LLM Reversal Curse: https://owainevans.github.io/reversal_curse.pdf. LLMs trained on sentences of the form “A is B”, will underperform for questions of the form “B is A”. This is called the Reversal Curse. The authors claim that this is the cause of failure of logical deduction in LLMs.
Foundations and evaluation methodology:
“Inferring Capabilities from Task Performance with Bayesian Triangulation” (arxiv). Introduces the concept of measurement layouts to infer AI capabilities, illustrating them in the Animal AI evaluation environment. Highly recommended! (this comment is biased).
More specific papers:
Human Uncertainty in Concept-Based AI Systems (arxiv). Mostly relates to training, but labels are always relevant to evaluation! It was presented at AIES in August.
Federated benchmarking of medical AI (NatMachIntell), Introduces MedPerf, an open platform for benchmarking AI models in the medical domain in a federated way.
Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert
How to contribute: Feel free to reach out to wschell@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.