2023 October “AI Evaluation” Digest
The JMLR family of journals launches a new venue: the Journal of Data-centric Machine Learning (DMLR) with an emphasis on --among other data related topics-- benchmark tooling & methods, data quality evaluation, metrics, methodology of empirical evaluations and more (blogpost).
DeepMind publishes “Evaluating social and ethical risks from generative AI” (blog, arxiv)
China announced its own Global AI Governance Initiative at the 3rd BRI Forum, 2 weeks before the UK's AI Safety Summit. The document contains new talking points on AI safety, model evals, and national sovereignty (Twitter thread in English).
The European Lighthouse on Secure and Safe AI (ELSA) announces the ELSA Benchmarks platform.
Language, Common Sense, and the Winograd Schema Challenge (link), or why the Winograd Schema Challenge was/is not a sufficient test of intelligence (basically elaborating on this earlier paper).
A good piece on LLM Evaluation and Reasoning abilities (link), based on specialised and relatively new benchmarks, with a related Twitter thread by the author (link). It ties in well with Ida Momennejad et al.’s cog-sci inspired evaluation of planning capabilities and cognitive maps in language models (arxiv) and with Cohn’s spatial reasoning evaluation (arxiv).
Anthropic’s ‘Challenges in Evaluating AI Systems’ (link) provides useful insights into perceived difficulties with evaluation in the industry labs, including challenges with common benchmarks such as BIG-Bench and HELM.
According to the popular State of AI Report (link, p. 32), AI evaluation is apparently so unreliable for LLMs that people just follow the “vibes” of anecdotal evaluation.
A few other, technical, papers:
And we close with a bit of humour: Pretraining on the training set is all you need!
Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert
How to contribute: Feel free to reach out to wschell@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.