2023 August “AI Evaluation” Digest
AI Evaluation getting more attention from psychology and other fields, as this comment paper in Nature Reviews Psychology: “Baby steps in evaluating the capacities of large language models”.
Anthropic, Google, Microsoft and OpenAI have established the Frontier Model Forum, an industry body committed to the safe and responsible development of advanced AI systems. One objective is to enable independent and standardised evaluation of capabilities and safety. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models. (blog)
This paper (arxiv) about statistically inferring skills in language models is a combination of factor analysis, scaling laws and partition graphs.
Evaluating human-AI systems will become increasingly common. This paper in AI Magazine discusses the “minimum necessary rigor” in empirical human-AI evaluation. (paper)
MosaicML introduces a new multi metric and multi benchmark LLM evaluation leaderboard (webpage)
Benchmarks are getting solved faster, a blogpost by Contextual AI.
A survey, taxonomy, and discussion on the relevance of instance-level difficulty (ACM).
A selection of evaluation related work at ICML2023
RankMe: Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank (PMLR)
In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation (arxiv)
Distributional Offline Policy Evaluation with Predictive Error Guarantees (arxiv)
How many perturbations break this model? Evaluating robustness beyond adversarial accuracy (arxiv)
A selection of evaluation related work at UAI2023
Composing Efficient, Robust Tests for Policy Selection (openreview)
TCE: A Test-Based Approach to Measuring Calibration Error (openreview)
Validation of Composite Systems by Discrepancy Propagation (openreview)
Contributors to this month’s digest: Jose Hernandez-Orallo, Wout Schellaert, Lexin Zhou.
How to contribute: Feel free to reach out to wschell@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.