2023 December “AI Evaluation” Digest
Our last digest of 2023 brings yet more news on AI evaluation!
This year has been full of activity in this increasingly recognised area, with the number of events, papers, initiatives, benchmarks and platforms increasing very notably. Following this trend we expect the area of AI Evaluation to become even more hectic in the future. Because of this, we plan changes for the digest for 2024. Stay tuned!
Cognitive and capability-oriented evaluations are now commonplace!
Have we built machines that think like people? (arxiv) Evaluating cognitive capabilities in multimodal models, covering intuitive physics, causal reasoning, theory of mind, etc.
Running cognitive evaluations on large language models: The dos and the don’ts (arxiv) A good compendium of what everybody knows, or not? At least what everybody should know!
AAAI tutorial on Measurement Layouts for Capability-oriented AI Evaluation (link) will be held at AAAI2024 in Vancouver
And relatedly, the Animal AI environment 3.0 is out! (link)
GRASP: Grounding and Situated Physics evaluation of multimodal language models (https://arxiv.org/pdf/2311.09048.pdf)
Of course, more benchmarks, competitions and prizes:
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (arxiv)
AI-MO Prize: Artificial Mathematical Olympiad Prize (website): XTX Markets has launched a $10 million AI-MO Prize for AI models that can solve difficult International Mathematical Olympiad (IMO) level mathematical problems.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark (arxiv): comes with information with humans performing and evaluating the questions.
MLCommons Benchmarking optimizers competition (website).
AlignBench: guess what, it’s an Alignment Benchmark (in Chinese) https://arxiv.org/abs/2311.18743
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models (https://arxiv.org/abs/2311.18232)
ROBBIE: Robust Bias Evaluation of Large Generative Language Models (https://arxiv.org/abs/2311.18140)
LLMEval: a benchmark with a lot of scoring annotations, and useful for prediction. https://arxiv.org/pdf/2312.07398.pdf (questions only in Chinese)
Evaluation and Safety getting more and more intertwined:
News from recent events: EMNLP & NeurIPS
ROBBIE: Robust Bias Evaluation of Large Generative Language Models (arxiv)
More on compositional benchmarks: https://aclanthology.org/2023.conll-1.19/
The Emergent Abilities Mirage paper (see our Sep2023 Digest) (https://openreview.net/forum?id=ITw9edRDlD) got one of the NeurIPS Outstanding Main Track awards!
Another edition of the NeurIPS 2023 Datasets and Benchmarks Track (link), with these two papers being awarded:
Miscellanea:
Comparing humans and AI: Comparing the Evaluation and Production of Loophole Behavior in Children and Large Language Models (openreview)
Contamination: Technique to detect/understand contamination (https://arxiv.org/pdf/2311.12337.pdf)
Uncertainty estimation: Polygraph - a python tool for integrated uncertainty estimation in LLMs (https://arxiv.org/pdf/2311.07383.pdf)
Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert, Nando Martínez-Plumed
How to contribute: Feel free to reach out to wschell@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.