Welcome to our October newsletter! This month, we're excited to share a wealth of new publications on AI evaluation, along with insights from the recent ECAI 2024 conference in historic Santiago de Compostela, Galicia, celebrating its 50th anniversary, ¡Parabéns e feliz aniversario!
Developments at ECAI 2024:
The latest from ECAI showcases significant work on AI evaluation, with papers exploring new evaluation metrics in 3D object detection for better cross-domain performance, and Dirichlet logistic Gaussian processes offering improved evaluation for complex systems. UI design and fairness (best paper award) put evaluation at the forefront when introducing novel methods for improving model effectiveness.
New benchmarks have been presented, such as adversarial robustness in speech emotion recognition; FlowLearn, for evaluating Vision-LLMs on Flowchart Understanding; PrimeKGQA, a biomedical knowledge graph QA dataset; ECLIPSE, for prolonged engagement assessment in online learning; or EthiX, for argument scheme classification in ethical debates. Also new papers exploring data contamination with black-box dataset watermarking approaches, or exploring how errors from human, AI, and random sources affect the fine-tuning of LLMs.
On the topic of difficulty, LLMs are used to predict the difficulty of linguistic tasks, providing nuanced insights beyond traditional metrics.
A paper on calibration (analysing the focal loss and temperature scaling) got another best paper award, showing a loss decomposition where one component is a proper scoring loss and the other component is a term that tends to compensate for the overestimation of confidence happening from train to test.
Finally, a panel on the next 50 years of AI research showed strong disagreements in what AI research will look like and who (or what) will do it, but there was consensus that AI evaluation is a major research area that should help us navigate the decades ahead.
Methods:
A lot of new approaches aiming to improve the evaluation and performance of LLMs (mostly), from innovative ranking systems to enhanced efficiency strategies:
An ICANN 2024 paper proposes a Ranking-based Automatic Evaluation Method (REM) that gets multiple LLMs to play judge, jury, and evaluator, ensuring that LLM performance is impartially assessed and ranked.
LEVERWORLDS is a framework for creating physics-inspired scenarios to evaluate the learning efficiency of LLMs. The authors find that transformers can perform well on these tasks, but are less sample efficient than traditional methods.
This paper explores the relationship between LLMs and vision-language models (VLM) showing that as LLMs grow in size, they partially converge towards representations similar to those of VLMs. This has implications for understanding multimodal processing and the development of internal world models in computational systems.
Β-calibration refines confidence assessment in generative QA by ensuring calibration across different question-answer groups using new post-hoc techniques such as β-binning (adjusting confidence scores within grouped QA pairs) and scaling-β-binning (adding a scaling step to prevent overfitting).
A new method to efficiently evaluate LLMs by selecting key subsets of prompts using an RL approach, reducing evaluation costs while maintaining accuracy.
Researchers propose a framework to identify how LLMs choose between internal and external knowledge sources.
IBM research presented a confusion-based uncertainty method to improve the reliability of LLM’s scores when used as judges in a variety of tasks.
Expert Router improves the efficiency and scalability of LLMs inference by routing tasks to specialised models, ensuring minimal latency and stable performance, even with many users.
Commentary:
Thought-provoking insights on LLMs:
This paper challenges the assumption that stronger reward models always improve the performance of LLMs by showing that moderately accurate reward models can lead to better outcomes, suggesting new strategies for selecting reward models for optimal alignment with human expectations.
Proof-of-concept has shown that even simple "null models" (outputting irrelevant or repetitive content) can manipulate top LLM benchmarks such as AlpacaEval 2.0 and Arena-Hard-Auto to achieve high scores. The authors highlight the need for better anti-cheating measures to maintain the reliability of AI evaluations.
Cognition and capability evaluation:
Works examining the potential for misuse of AI, the capabilities of functional modelling, and the challenges LLMs face in rational reasoning compared to human thought processes:
Anthropic's latest paper introduces a set of evaluations to assess AI models' potential capabilities for sabotage, focusing on human decision manipulation, code sabotage, sandbagging, and undermining oversight.
This paper presents a Bayesian evaluation framework to assess the function modelling capabilities of LLMs, finding that while these models struggle with pattern recognition in raw data, they excel at using domain knowledge to accurately approximate functions.
A new study explores whether LLMs exhibit rational reasoning (through cognitive psychology tasks) and finds that they often make inconsistent and unpredictable errors that differ a lot from typical human errors.
Exploring 10 LLMs for the analytical writing assessment of the Graduate Record Exam (GRE). They are scored by both an automated scoring system (e-rater) or by humans.
Benchmarking:
Several new benchmarks, with names that are not particularly funny this month, but targeting key areas in bias, adaptability, reasoning, and efficiency:
More about reward models: RM-Bench, a benchmark to evaluate reward models with sensitivity to subtle content differences and resistance to style biases. The SAGED pipeline and VHELM focus on addressing bias in LLMs, with VHELM also evaluating visual perception and multilingualism. The Easy2Hard-Bench looks at how well AI can handle tasks of varying difficulty to improve adaptability. For reasoning and procedural skills, Michelangelo examines the ability of models to work with complex, long contexts, while ProcBench tests the ability of LLMs to apply new rules and follow multi-step instructions. KOR-Benc also explores reasoning. In an interesting twist of the future use of agential LLMs for research, CORE-Bench (not to be confused with KOR-Benc in the previous line), focuses on reproducibility, i.e., whether the agent can reproduce the results of a computational experimental study (e.g., a ML paper) using the provided code and data. AgentHarm assesses AI's susceptibility to harmful tasks, shedding light on security issues. GSM-Symbolic reveals the limitations of LLMs in genuine mathematical reasoning, showing performance drops with small changes in question details. Finally, MixEval-X and MLPerf Power provide standardised evaluations of AI performance. MixEval-X covers multiple task formats in real-world scenarios, and MLPerf Power measures energy efficiency, highlighting sustainable AI operations. NaturalBench is designed for visual-language models and includes “natural” adversarial examples, namely 10,000 human-verified visual question-answering samples
Upcoming events:
The First Workshop of Evaluation of Multi-Modal Generation at COLING 2025 aims to advance research efforts in evaluation methods for multimodal AI by addressing underexplored areas such as the coherence, relevance and contribution of modalities, integration of information and modalities, faithfulness, and fairness, etc.
The Datasets and Evaluators of AI Safety workshop aims to tackle AI safety issues by focusing on datasets and benchmarks to evaluate LLMs, addressing concerns like fairness, robustness, and reliability in order to prevent societal harms and misuses of AI technologies.
Positions:
Postdoc position on Uncertainty in Machine Learning at the University of Tartu.
Postdoc position on AI Evaluation at the Technical University of Valencia.
Contributors to this month’s digest: Nando Martínez-Plumed, Jose H. Orallo, Wout Schellaert, Joe Castellano, Pat Kyllonen
News to share? Feel free to reach out to wschell@vrain.upv.es.
Getting the digest: Once a month if you join
.