What's this about?
In a recent blogpost titled “We Need a Science of Evals” the AI alignment-focused research organisation Apollo Research advocates the establishment of a "Science of Evals". While we applaud the initiative, and precisely because we stand behind the overall message, we have some comments to add on culture, terminology, and reinventing the wheel.
About the post
The Apollo blogpost encourages scientific rigour and reliability of the evaluation of “maximal capabilities” of AI models, highlighting their importance for high-stakes decisions related to AI safety and policy making. According to the post, current evals practices lack the maturity and standardisation found in other scientific fields, making them unsuitable for high-stakes decisions.
The motivational examples they provide are both quite simple and limited to language models, e.g. model “evals” often being affected by minor changes in test prompts, resulting in an unreliable assessment of the true (“maximal”) capabilities of these models. The paper cites examples from very recent research that illustrate the variability in model performance based on different prompting techniques, and points to the need for a sophisticated approach to accurately measure the capabilities of AI models.
By drawing parallels with established testing regimes in industries such as aviation (a discipline used for countless similes with AI for decades, see, e.g., Ford and Hayes 1988), and highlighting the ongoing challenges in evaluating language models, Apollo Research emphasises the need for a systematic approach to reducing uncertainties around AI capabilities. The post also calls for a collaborative effort between academia, industry, and regulators to develop a robust framework for AI evaluation that aims to move from an art to a science, and proposes initial steps and open research questions to achieve this goal.
Is it about AI “evals” or AI evaluation more generally?
This is a confusing element of this post, but it’s not their fault, it’s a general trend. In the past two years we have seen the introduction of completely new terminology in AI evaluation, such as “AI evals”, “emergent capabilities”, “capability elicitation” and “dangerous capabilities”. It is important to clarify that “evals” means the estimation of an “upper bound of capabilities” or “maximal capabilities” of an AI system, motivated by an emphasis on safety (worst-case situation). The rationale is that if a system can become very smart (seen as a “best-case” scenario in capabilities) then it may be dangerous as a result (“worst-case” in terms of harm). This is very different from the use of the term capability in other sciences about the measurement of (natural or artificial) cognition. But in the end, this is quite difficult to grasp for people who are not in the “evals” or “red-teaming” communities. This “more is worse” is opposite to the traditional evaluation in machine learning, artificial intelligence and other disciplines, where more is better.
Actually, a capability in other sciences is not a best-case or worst-case indicator, not even an average-case indicator over a distribution, it is a (latent) property of a system that has predictive and explanatory value about the systematic behaviour of the system. For lay people, if an athlete can pole-jump up to 4m high with a given probability (let’s say, 50%, so, after three attempts, the probability of passing to the next round is 87.5%), we would say that this 4m is her capability, while the worst-case value would clearly be 0m (whenever she falls or aborts) and the best-case value is her personal record (e.g., 4.56m). The average-case depends on the distribution of jumps the athlete has made during a competition or her whole career. So the first thing to do when talking about capabilities is clarifying the term. Then, perhaps by “emergent capabilities” they mean “potential capabilities” or by “capability elicitation” they refer to options such as giving the athlete new shoes, a new pole or training her with a new technique. And by dangerous capabilities, maybe they mean that the athlete can rob a bank where she has to jump into a window that is 4m high (because if the window is 4.56m high, her personal record, the police will almost certainly arrive before she can make it). Depending on the attack, an “upper bound of capabilities” may be the wrong thing to estimate.
What's good and what’s missing?
More science & collaboration: Emphasising the importance of developing a 'science of evals' to ensure that evaluations of AI models are rigorous, replicable and scientifically sound, the Apollo Research Initiative addresses a critical need for greater reliability in AI evaluations. This initiative promises to make AI development safer through evidence-based scaling, improve policy-making through scientifically robust evaluations, and foster collaboration across sectors to standardise and refine AI evaluation methodologies.
While the call to establish a "science of evals" (not the same as “a science of evaluation”) is laudable in its intention to improve the reliability and rigour of AI model “evals”, a critical oversight is the apparent neglection of foundational work in measurement theory that spans several decades. The focus on references primarily from recent years overlooks the wealth of knowledge and methods developed in the fields of psychometrics, quantitative psychology, and the broader social sciences, which have rigorously addressed similar challenges of measurement, assessment, and validation of complex constructs. It also misses important work on AI evaluation performed in AI itself, especially machine learning, and areas that are so closely related to their goals, such as software testing and cybersecurity.
Measurement theory, with its origins in the early 20th century, provides a structured framework for ensuring the validity, reliability and accuracy of measurement. Classic works such as (Stevens, 1946) or (Krantz et al, 1971) provide insights into the complexities of measuring abstract constructs that could be highly relevant to the evaluation of AI systems. In addition, the field of psychometrics has developed sophisticated methods for test construction, validation, and interpretation, as seen in classic texts such as (Lord & Novick, 1986) and contemporary applications in computerised adaptive testing (Meijer et al, 1999) or Item Response Theory (Embretson & Raise 2013). These resources highlight essential considerations for the design of reliable and valid measures, many of which could inform the development of a more scientific approach to AI evaluation (Hernández-Orallo 2017), now that AI is becoming more cognitive and powerful. For instance, the two main research questions they identify in their note are: “Are we measuring the right quantity?”, which is elsewhere referred to as the “validity of measurement”, with different concepts of validity such as construct, face, internal, external and ecological validity, and “Are our results trustworthy?”, which is also elsewhere referred to as “reliability of measurement” with several associated metrics too. The problem of sensitivity to different elicitation methods (e.g., prompts, chain-of-thought, etc.) is usually referred to as Vygotsky’s Zone Of Proximal Development or simply scaffolding, which is recurrent in developmental psychology and theories of learning. In general, a cognitive perspective on evaluation (Ivanova 2023) can bring light to many of the questions raised in the post.
Then, the gaps in the literature of machine learning evaluation are notorious, especially when discussing statistical analysis and confidence (and their abuse, Drummond and Japkowicz et al. 2010) or many other problems of traditional and modern AI (see, e.g., Mitchell 2023). They ask for support from organisations such as NIST, but they do not mention the decades-long AI evaluation efforts at NIST (with prominent series of workshops trying to create a science of AI measurement, e.g., Meystel, 2000b; Messina et al., 2001), other organisations and many academics highlighting the issues of AI evaluation (e.g., Marcus et al., 2016), with relevant discussions around the AI effect (Is it the developers or the system showing the capability? McCorduck, 2004), Moravec’s paradox (Are the easy things hard and conversely? 1988) or the debate about what generality is, going from McCarthy (1987) to its reunderstanding for foundation models (Schellaert et al. 2023). The problem of task representativeness has been discussed many times (Cohen and Howe 1988) as so has been the distinction between measuring performance and intelligence/capabilities (Japkowicz and Shah, 2011, Meystel 2000b, Flach 2019, Burden et al. 2023). But if best-case or worst-case situations is what matters, then the literature of AI is full of discussions about training to the test, benchmark overfitting and Clever Hans phenomenon. The fields of software testing or safety engineering more generally should be fully examined when the angle of “evals” is more on testing than proper evaluation. Indeed, aviation is just a particular case of safety engineering where testing and safety have grown together, but this happens in all engineering disciplines, including computer and software engineering.
Let’s do just science
In a nutshell, by focusing almost exclusively on recent AI and machine learning literature on emergent, dangerous or maximal capabilities of language models, the discourse on creating a science of evals risks reinventing the wheel or narrowing the field of AI evaluation, or both things. It misses opportunities to learn from and apply established principles from measurement theory and previous efforts in AI itself that could potentially accelerate the development of rigorous, scientifically sound evaluation methods for AI models.
References
Burden, J., Voudouris, K., Burnell, R., Rutar, D., Cheke, L., & Hernández-Orallo, J. (2023). Inferring Capabilities from Task Performance with Bayesian Triangulation. arXiv preprint arXiv:2309.11975.
Cohen, P. R. and Howe, A. E. (1988). How evaluation guides AI research: The message still counts more than the medium. AI Magazine, 9(4):35.
Drummond, C. and Japkowicz, N. (2010). Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. Journal of Experimental and Theoretical Artificial Intelligence, 22(1):67–80.
Embretson, S. E., & Reise, S. P. (2013). Item response theory. Psychology Press.
Flach, P. (2019, July). Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 9808-9814).
Ford, K. M. and Hayes, P. J. (1998). On computational wings: Rethinking the goals of artificial intelligence – the gold standard of traditional artificial intelligence –passing the so-called Turing test and thereby appearing to be. Scientific American, 9(Winter):78–83.
Hernández-Orallo, J. (2017). The measure of all minds: evaluating natural and artificial intelligence. Cambridge University Press.
Ivanova, A. Running cognitive evaluations on large language models: The do's and the don'ts. 2023
Japkowicz, N. and Shah, M. (2011). Evaluating learning algorithms. Cambridge University Press.
Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971-1990). Foundations of Measurement (Vols. I-III). Academic Press.
Lord, F. M., & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley.
Marcus, G., Rossi, F., and Veloso, M., eds. (2016). Beyond the Turing test (special issue). AI Magazine, 37(1):3–101.
McCarthy, J. (1987). Generality in artificial intelligence. Communications of the ACM, 30(12):1030–1035.
McCorduck, P. (2004). Machines who think. A K Peters/CRC Press
Meijer, R. R., & Nering, M. L. (1999). Computerized adaptive testing: Overview and introduction. Applied psychological measurement, 23(3), 187-194.
Meystel, A. (2000). PerMIS 2000 white paper: Measuring performance and intelligence of systems with autonomy. In Meystel, A. M. and Messina, E. R., editors, Measuring the performance and intelligence of systems: Proceedings of the 2000 PerMIS Workshop, pages 1–34. NIST Special Publication 970. NIST.
Messina, E., Meystel, A., and Reeker, L. (2001). PerMIS 2001, white paper. In Meystel, A. M. and Messina, E. R., editors, Measuring the performance and intelligence of systems: Proceedings of the 2001 PerMIS Workshop, pages 3–15. NIST Special Publication 982. NIST.
Mitchell, M. (2023). How do we know how smart AI systems are?. Science, 381(6654), eadj5957.
Moravec, H. P. (1988). Mind children: The future of robot and human intelligence. Harvard University Press.
Schellaert, W., Martínez-Plumed, F., Vold, K., Burden, J., Casares, P. A., Loe, B. S., ... & Hernández-Orallo, J. (2023). Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models. Journal of Artificial Intelligence Research, 77, 377-394.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677-680. DOI: 10.1126/science.103.2684.677.