Welcome to our November newsletter! The theme for this month’s highlight is “Mission Impossible”, i.e. very difficult benchmarks and too much content to process all at once.
The real usefulness of very challenging, or simply impossible benchmarks. In previous issues we have covered an acceleration of the ‘challenge-solve-and-replace’ evaluation dynamics (as David Schlangen referred to in 2019), where new benchmarks soon become “too easy” and have to be replaced by more complex ones. Because of this, some recent initiatives include questions that are extremely challenging for humans, anticipating a future where AI systems will be better than most (or even the best) humans in some domains.
FrontierMath is one such example: A new benchmark of exceptionally difficult mathematical problems designed to test frontier AI models for years to come. To reduce the risk of data contamination, items are freshly made (and not yet open). And: today's top models solve less than 2% of the problems. At first glance, the low success rate suggests that the test is only suitable for advanced AI systems in mathematics. However, the benchmark could even be more useful for models that are not good at mathematics at all. Why? Let us explain this with another benchmark . . .
“The Impossible Test” is a benchmark that uses unsolvable questions designed to test the ability of LLMs to recognise when they are facing intractable problems. The correct answer should be: “I don’t know”. This can be used to test metacognition and to calibrate models that favour saying “I don’t know” more often, instead of being ultracrepidarian. The paper also analyses the results by difficulty, and finds that models are more ultracrepidarian for questions that look easy. The take-away is that any challenging or even impossible test can be used as a metacognition test, where we can test AI systems to recognise and admit their limitations.
Other hot picks from the benchmark bazaar:
The HELM-adjacent evaluation suites keep multiplying, this time with HELM Safety v1.0, a collection of 5 safety-oriented benchmarks.
NaturalBench is designed to evaluate VLMs on "natural adversarial samples" that challenge AI models but are easily answered by humans. Leading models such as GPT-4 are 50% below human performance on these tasks. NaturalBench is also designed to be frequently updated.
RE-Bench evaluates AI R&D capabilities by directly comparing LM agents with human experts on ML research engineering tasks. The paper shows that although AI agents can outperform humans when given shorter time frames, humans still maintain an advantage with more time.
Findings:
Optimising LLMs for user feedback can lead to manipulative behaviours as they learn to exploit user vulnerabilities for positive feedback. Current evaluations—especially those based on human feedback—may not be sufficient to identify these harmful behaviours.
Can assessor models predict when LLMs will succeed or fail on different tasks? This study finds that assessors can accurately predict performance across several BIG-bench tasks, especially when trained with data from multiple tasks or models.
This paper evaluates approaches to detecting data contamination in LLMs, finding that many detection assumptions don’t hold consistently across different scenarios. They reviewed 47 papers, categorised detection approaches and their implicit requirements, and tested three dominant assumptions through case studies. Their findings suggest that current LLMs learn data distributions rather than memorising specific instances, which calls into question the effectiveness of existing detection methods.
This study assesses the potential and limitations of LLMs' teamwork capabilities, revealing their proficiency in action phases such as task coordination, but highlighting their struggle with transition phases requiring abstract reasoning and planning. A systematic review and meta-analysis on the heterogeneity of the effects of human–AI collaboration here.
Methods:
In another paper on assessor models, assessors were used in conjunction with Local Performance Regions (LPRs) to identify areas where models may underperform.
This paper discusses the challenges and advances in ensuring individual fairness in machine learning, and introduces a measure to assess how robust a learning algorithm is to changes in the similarity function between individuals.
More on automated scoring, from overviews of LLMs as judges (challenges and opportunities and a survey) to a perplexity based approach, where an answer is as good as the perplexity of the pair “qa” (the answer follows the question) minus the perplexity of the answer. Of course this requires a better model as an evaluator.
Centaur is a new computational model that predicts human behaviour across diverse experiments using data from over 60,000 participants. It surpasses existing models in accuracy and adapts well to new tasks. Interestingly, the model's internal structures align closely with human brain activity.
A delicatessen of cognition and capability evaluation:
There is debate on the extent to which LLMs are performing general abstract reasoning versus employing shortcuts or other non-robust processes, such as ones that overly rely on similarity to their training data. This study investigates the robustness of LLMs (GPT) in analogical reasoning tasks across three domains: letter-string analogies, digit matrices, and story analogies. Results show that while humans maintain consistent accuracy, LLMs often struggle with task variation and exhibit biases such as response order effects.
A study found that chain-of-thought (CoT) prompting can reduce model performance on tasks such as implicit statistical learning and face recognition, where similar elaboration impairs human performance. Conversely, CoT did not harm performance on tasks such as logical inconsistency, where models differ from human limitations. This research shows that insights from cognitive psychology can help predict when CoT might negatively affect models.
This paper examines the limitations of VLMs in handling tasks such as counting and visual analogy, and relates their difficulties to the binding problem, where shared resources affect the representation of multiple objects. These challenges are similar to human cognitive limitations, suggesting that VLMs need to develop better mechanisms for managing feature binding.
The MIRAGE dataset tests inductive reasoning in language models. It shows that LLMs struggle with rule-based reasoning, but they are effective at using neighbour-based reasoning (a.k.a. case-based reasoning), which applies similar observed examples to new cases.
Frameworks:
This work presents an evaluation framework for dataset documentation to systematically assess the strengths and weaknesses of 60 datasets published in NeurIPS from 2021-2023. The paper shows greater need for documentation about the environmental footprint and ethical considerations, and then suggests recommendations for improving more rigorous data curation in ML. Who’s taking up this mission?
A new framework for evaluating the quality of AI benchmarks, highlighting significant discrepancies and offering a checklist of best practices. It urges the development of higher-quality benchmarks for robust benchmarking in AI for both technical advancements and policy implications. Mission impossible?
HuggingFace has introduced an informal LLM Evaluation guidebook, sharing some existing practices and theoretical knowledge about LLM evaluation which they gathered while managing the Open LLM Leaderboard. This covers automatic benchmarks, human evaluation, and LLM-as-a-judge.
This position paper highlights the need for more systematic, user-experience-based evaluation frameworks inspired by human-centred design. The paper argues for aligning the evaluation of AI tools with real-world use through cultural probes and experience sampling methods.
Policy and Third-Party Evaluations
COMPL-AI is (i) a technical interpretation of the EU AI Act to translate its regulatory requirements into measurable criteria for LLMs and (ii) an open-source benchmarking suite centred on the AI Act. It identifies several current gaps in both LLMs and benchmarks.
The think tank Pour Demain argues that the EU AI Office's forthcoming Codes of Practice for general-purpose AI should emphasise multi-faceted evaluation methods over black-box testing. Key recommendations include providing 'de facto' white-box access through custom APIs for independent evaluators, facilitating access to contextual information, and implementing layered safeguards.
Relatedly, the UK AI Safety institute published a blog recounting their initial experience with designing and conducting (multi-faceted) third-party evaluations.
The Workshop on the Future of Third-Party Evaluation was held last month. (Recording.)
Calls and Positions:
A call by the EU AI office for selected participation on a workshop on AI evaluations for systemic risk, inviting evaluators to submit abstracts of previously published papers. Deadline: 8 December.
Postdoc position on AI Personalization and Evaluation at the University of Notre Dame (USA).
Contributors to this month’s digest: Jose H. Orallo, Nando Martínez-Plumed, Lexin Zhou, Wout Schellaert, Joseph Castellano
News to share? Feel free to reach out to wschell@vrain.upv.es.
Getting the digest: Once a month if you join!
.