2024 January "AI Evaluation" Digest

Jan 26, 2024

Welcome to the new home of the AI Evaluation newsletter — Substack!

We are kicking off 2024 with a fresh look and a better user interface for our monthly digest.

Our intention is still to send a once a month newsletter with our favorite papers on AI Evaluation—a signal through the noise in the burgeoning field of AI evaluation.

As usual, we will distribute our digest on the last Friday of every month.

We encourage your input for future monthly digests: Please reach out if you want to get involved, have news to highlight, or general feedback to share. Also, you are welcome to engage in the comments section of the monthly posts.

If you think anyone would appreciate joining the newsletter, please forward it to them so our community can grow!

Thank you,

Jose H. Orallo, Nando Martínez-Plumed, Wout Schellaert, Lexin Zhou, Yael Moros, Joseph Castellano

ChaLearn.org is preparing a collaborative book on AI competitions and benchmarks, and they are looking for contributors.

For the rest, LLMs and their multi-modal variants keep getting a lot of attention (including ours):

How predictable is language model benchmark performance? finds that performance aggregated over many individual tasks, is decently predictable as a function of training compute scale.
State of What Art? A Call for Multi-Prompt LLM Evaluation is exactly what the title says.
Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty studies the reliability of LLMs, and sees both that the calibration is quite bad and that humans show over-reliance. One cause could be be: humans have aversion to texts with uncertainty.
Leveraging Large Language Models for NLG Evaluation: A Survey is a very helpful account of approaches using LLMs for the grading of LLMs.

For the inaugural Substack issue, we also have special collection of benchmarks for LLMs as agents:

In AgentBench: Evaluating LLMs as Agents, models interact with SQL or bash or other simulated environments and do things recursively. This only considers text-based LLMs.
MLAgentBench (Benchmarking Large Language Models As AI Research Agents) uses LLMs to implement ML models for a given dataset; the agent can read/write files, execute code and inspect output.
GAIA: a benchmark for General AI Assistants proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. They split the questions across 3 difficulty levels.
WebArena: A Realistic Web Environment for Building Autonomous Agents locally runs versions of popular websites (such as Reddit, Wikipedia, GitLab). The agents need to perform tasks that require going through the web pages as a human would and can see either HTML or screenshot. A more comprehensive version of the web browsing and web shopping task in AgentBench.
Language Models can Solve Computer Tasks tests LLMs on an older benchmark which was originally developed for RL agents, where multiple environments (e.g. terminal, a website, …) are unified with the same HTML framework. The agents need to act by specifying keystrokes and mouse clicks. All of the description of the environment is translated to text. It is however more simplistic and less realistic than WebArena.
InFoBench: Evaluating Instruction Following Ability in Large Language Models plays with decomposing complex instructions into simpler ones, and introduces the Decomposed Requirements Following Ratio metric.
Smartplay: a benchmark for LLMs as intelligent agents introduces a package that translates 6 agentic games to textual description to evaluate the performance of LLMs.
Evaluating Language Model Agency through Negotiations introduces a negotiation-between-LLM benchmark.
With a similar multi-agent viewpoint, MAgIC: Benchmarking Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration evaluates LLMs through classic game theory scenarios and more loose games such as Undercover and Chameleon.

Contributors to this month’s digest: Jose H. Orallo, Lorenzo Pachiardi, Wout Schellaert

How to contribute: Feel free to reach out to wschell@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.

Getting the digest: Once a month if you join:

The AI Evaluation Substack

Discussion about this post