Is the Definition of AGI a Percentage?
Zachary Tidler, Marko Tešić, Lorenzo Pacchiardi, John Burden, Lexin Zhou, Manuel Cebrián, Fernando Martínez-Plumed, Jose Hernandez-Orallo
A preprint making the rounds in October 2025 (Hendrycks et al. 2025) is aiming to end the long-running indefiniteness of the term Artificial General Intelligence (AGI), introduced by Mark Gubrud in 1997, and later popularised by Shane Legg and Ben Goertzel. Originally, the G emphasised “the generality that AI systems [did not] yet have”, according to Legg. This ambition had older roots: John McCarthy’s Turing award lecture was about generality, back in 1971. Since the advent of GPT-3 in this decade, however, AI has shown genuine breadth: today’s large language models (LLMs) are general-purpose systems. It is not the generality that they are lacking, but other things.
Over time, the concept of AGI has been adapting to this shift in AI. The term is now used at the will or whim of corporations and politicians alike: from justifying huge rounds of funding to scaremongering about the risk that China reaches “it” before the US. For some companies, such as OpenAI, the definition of AGI is even baked into their charter: “If a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be ‘a better-than-even chance of success in the next two years’”. There’s a lot at stake.
Definitions matter. A judge might someday rely on one such definition to decide whether OpenAI’s charter should be triggered. Even if some definitions are not perfect, they allow us to talk about the same thing. The informal definition Hendrycks et al. give is not very different from many versions that have circulated in the past few years: “AGI is an AI that can match or exceed the cognitive versatility and proficiency of a well-educated adult”. This is closer to “human-level machine intelligence”, another popular term, but it is only referring to generality implicitly through the term “versatility”. The paper argues that such an informal definition can still be interpreted in many different ways. Hence, Hendrycks et al. make an attempt to formalise AGI as a score from 0 to 100, with 100 representing full AGI, which could be objectively determined from the results of a set of tests. This is in line with the goals of AI evaluation as a discipline. It looks like a step forward!
Unfortunately, this is where the excitement ends. It is not because many of us think that defining AGI based on human intelligence as a special reference point may be misleading and shortsighted — systems with capability profiles that differ from humans can be even more transformative and dangerous than human-like AI. In this blog post we do not judge the goal of the paper or the very notion of AGI. Rather, we point out various flaws in the methodology, one by one, and explain why recognising such flaws is important.
CHC is not an appropriate framework for AI
The Cattell-Horn-Carroll (CHC) model is likely the most widely accepted human cognitive-ability theory. It is the result of decades of efforts by differential psychologists and psychometricians to hierarchically reconcile Spearman’s general factor “g” with multifactor views of human intelligence variability. One of the authors of the paper, Kevin McGrew, has been instrumental in the development of the CHC model over the years; his name on the paper should be a guarantee that the model is properly understood.
However, CHC brings a methodological apparatus that undermines its use for evaluating machine intelligence: it only reflects dimensions of ability that show variance in the reference human population. Factor analysis, which underpins CHC, identifies patterns of covariation across individuals. When everyone in your sample performs similarly on tests of some ability, that ability becomes invisible. For example, if you test jellyfish on their quantitative-reasoning performance, you aren’t likely to be able to come up with a test that is sensitive enough to distinguish differences in performance among individual jellyfish. By factor-analytic logic this would suggest no such ability exists. However, when we test humans, we obviously see a salient dimension of interindividual differences in quantitative reasoning. Who knows what abilities could emerge (or vanish) if the data source is something other than human interindividual differences. Treating CHC’s broad factors as the right axes for machine cognition assumes equivalence that has not yet been demonstrated (not even for LLM populations, Burnell et al., 2023). At minimum, the burden is on the adopters to show that CHC’s structure is predictive and explanatory for machine systems, rather than merely familiar.
One common defense of using CHC is: since the paper defines AGI relative to humans, the taxonomy should also be human-based. This is a fallacious argument for exactly the same reason as before: CHC is built from differences between humans. Two different capabilities that are strongly correlated in humans can create too little variance to register in factor analysis, so they are confounded in the model as one single factor. Others are not detectable because they are always present in almost all adults: they neither predict nor explain the results in human adult populations. Indeed, some of the core capabilities that develop in human children (e.g., object permanence) are not directly represented in CHC, because almost all adult humans have them to similar levels, or are highly correlated with others. CHC is therefore well suited to describing variation among typical adult humans, but much less for children, neurodivergent people, other species, or AI systems. This matters because the CHC-inspired definition could award “100% AGI” to an AI system while still missing core human capabilities that show little inter-individual variation among educated human adults.
CHC is not well represented by the benchmarks and testing conditions
CHC is hierarchical, with a general factor at the top and more specific factors further down in the hierarchy. Hendrycks et al. select an intermediate level with 10 capabilities, but do not employ the hierarchy to determine the way scores are aggregated or the demands required by the benchmarks. The paper simply stitches together several benchmarks per category, mixing some versions of traditional psychometric tests (e.g., what they describe as “private sets” of Raven’s Progressive Matrices items) with a variety of AI benchmarks. We know there are many issues with AI benchmarks, but repurposing human psychometric tests for AI systems has repeatedly proven flawed (see for example a paper by one of the authors of ‘A Definition of AGI’, Sühr et al. 2025).
Human psychometric tests are extremely sensitive to administration conditions, even if used for humans only. This meta-analysis demonstrates that even slight changes (e.g., imposing a time limit) can alter what construct is being measured. There is little reason to expect validity to survive the leap from humans to LLMs. The way these tests are applied represents a radical change in administration conditions. For instance, the authors report that their private version of Raven’s Progressive Matrices, a famous test of abstract fluid reasoning, includes detailed verbal representations of the traditionally visual questions, which has been used in the past to devise simple AI systems that score almost perfectly on these tests. But the opposite is also true; in some of the tests the AI systems need to circle the right answer, which may be a very unnatural way for the AI system to solve the problem. There is, therefore, ample reason to be skeptical of whether a given test is still indeed measuring the same CHC construct it was designed to.
Consequently, an LLM scoring 90% on a Raven’s progressive matrices test is not comparable to a human scoring 90%. By using human tests, not only are we missing core capabilities that are not measured by these tests, but we are measuring them in a way that has been designed to help human administration. Any adaptation of a test administration across groups has to be validated beforehand, to determine that it is not detrimental for the new group. This matters because AI systems could fall short of “100% AGI” despite already having all the capabilities of an educated human adult.
Arbitrary thresholds aggregated and framed as a misleading percentage
For each of the 10 dimensions and their relevant benchmarks, the authors invoke thresholds that determine whether a system earns “points” for that dimension: 85% here, 70% there, surpass the threshold and a system earns one point (sometimes two; with multiple tiers). These thresholds seem to be “human‑level” cutoffs (e.g., 85% being the average human performance on that test). Yet the paper provides no details about the human samples used to establish these levels. We are not sure whether there are human results for each of these benchmarks, whether the reference populations are similar or whether some of the cutoffs are arbitrary. In psychometrics, a cutoff is meaningful only relative to a defined population and purpose; it could be either norm-referenced (e.g., based on population percentiles) or criterion-referenced (based on an external performance standard). Without a defensible reference population across all tests, 85% (or any other value) risks being a round number with sharp consequences.
The appendix explains how to aggregate all these points, up to a maximum of 100. Because the score does not allow for compensation, a deficit in one dimension cannot be offset by strength in another. It is therefore quite hard for AI systems to get to this 100. In other words, diminishing returns will happen by design. We wonder how many points many educated adult humans will score. Unlikely to be 100. And then, why is this framed as a percentage? Saying 90 out of 100 is 90% is true by definition but conceptually misleading in this case. It is like claiming you are 90% of the way through assembling a rare philatelic collection because you have 90 of the 100 stamps, when the remaining 10% are progressively harder to acquire than the first 90%. The authors mention “bottleneck capabilities”, but the lead is buried underneath the headline of an AGI percentage.
Using percentages is not only confusing for the general public, it makes Dan Hendrycks’ “AI Frontiers” highlight fall in their own trap and say that 57% performance (as achieved by GPT-5) means that “we’re already halfway to AGI”. They add that “the rest of the way will mostly require business-as-usual research and engineering”! This matters because these scores create hugely miscalibrated expectations, both about what the thresholds actually represent for educated human adults and about how much time and effort progress towards AGI will require.
The scores do not align with general perception
If GPT-4 is really at 27% of AGI then GPT-5 more than doubled that, to get all of the way to 57%. Even if they evaluate an early version of GPT-4, there does not seem to be general agreement that this really represents the progress people observe when comparing GPT-4 and GPT-5 in terms of the intellectual tasks these systems can do relative to an educated human adult.
Looking closely at some dimensions yields surprising results. Figure 1 reports speed at 3/10. Even if comparing LLM “speed” to human speed made sense (dedicated hardware can accelerate LLM inference by orders of magnitude), this seems low relative to experience: models read long documents in seconds and generate poems almost immediately. How can they be considered slower than humans? Memory storage gets 0 points, which seems consistent with transformer technology, with frozen weights in operation. But is it the right way of evaluating AI models these days, without the use of some of the many ways that have been developed to handle memory better?
The paper refers to the familiar concept that some results come from imitations of the ability rather than the real thing, now called “capability contortions”, but this concept is too briefly and informally developed to deserve further attention. Overall, this matters because if the results do not match the common perceptions about popular AI systems, the credibility of the score is undermined.
Incomplete coverage or understanding of related work
It is standard practice these days for preprint papers to have insufficient coverage of related work, because no reviewer is there to point out the missing references. But this is not a standard research paper. The authors are not claiming any big innovation that could motivate them to not cite closely related work for the paper to look more novel. It is true that they do not claim their approach is better than anything else, so there is no obvious need to cite alternatives. However, the implicit goal seems to be to convene a large and diverse group of authors, to set a standard metric for measuring progress towards AGI. This makes the incomplete coverage of related work likely caused by other reasons, such as the rush of getting this out very soon. Perhaps there were competing definitions of AGI in the making from other big players in the “race for AGI”.
Some foundational work is cited in unconventional ways. For instance, the paper says that “Turing (1950) argues that the Turing Test can indicate general ability”, yet Turing’s paper does not explicitly discuss “general ability” or even “general intelligence” in those terms. For a paper arguing that precise definitions are needed, the use of terms is fuzzy, if not void occasionally. For instance, the authors write that “drawing together all the capabilities detailed in our framework, we can think of general intelligence as a cognitive “engine” that transforms inputs into outputs”. Many cognitive engines transform inputs into outputs but we would not consider them general intelligence.
Hendrycks et al. mention some “levels of AGI” papers, and they even cover “related” definitions such as “pandemic AI”, but the paper doesn’t mention the efforts to characterise general-purpose AI (GPAI), especially those being initiated in 2024 by the European Commission through its Joint Research Centre and the EU AI Office, and published earlier in October 2025 as a “Collection of External Scientific Studies on General-Purpose AI Models under the EU AI Act”. There are different approaches: compute-based (FLOPS) characterisations, rubric-based scales for a catalogue of capabilities, and populational factors extracted from benchmarks. The question of how several capabilities should be balanced to consider that a system has enough capability and generality (in their words “cognitive versatility and proficiency”) has also been analysed before, including a premonition (in October 2024) that an additive and saturated score using thresholds was not the best way forward. Another related approach is OECD’s AI Capability Indicators, an expert-based categorisation of progress of AI along several dimensions that were published in June 2025. Finally, the 18 dimensions and scales presented in March 2025 by Zhou et al. use a taxonomy of cognitive capabilities previously used for the analysis of AI and the future work alongside various non-cognitive dimensions (including knowledge), to define scales that are commensurate. This methodology was used in a paper to characterise GPAI, leaving the policy-makers the choice of what thresholds to use to define GPAI for regulatory purposes, without constraining them into a single monolithic percentage scale. This matters because if papers that try to build consensus on definitions do not build or compare well against previous work, there is the risk of alienating part of the research community and reinventing a wheel of a poorer quality than the previous one.
The risks of adopting this score
The Hendrycks et al. paper represents an ambitious attempt to standardize AGI as a score, but its methodological flaws undermine the goal. By adopting CHC, a framework designed to model interindividual differences in human cognitive abilities, the paper inherits blind spots that make it poorly suited for evaluating machine intelligence. The resulting percentage score obscures critical capability gaps, misleads about progress (claiming we’re “halfway to AGI”), and overlooks substantial prior work on general-purpose AI characterization. While we appreciate the desire for clear metrics in an area plagued by definitional confusion, a flawed standard may be worse than no standard at all.
The score has only been calculated for two AI systems so far. In fact, the paper only evaluates OpenAI’s models, perhaps because if other companies’ models were evaluated and scored above GPT-5, OpenAI may consider applying their charter and assist the competition for creating AGI. But it may be the case that nobody is taking this AGI definition seriously. Or even worse, some policy makers may misinterpret values such as 57% AGI for a commercial AI system such as GPT-5 as evidence that AI poses no immediate risk, ignoring the possibility that other, less human-like profiles of AI could soon become very unsafe. This is likely the opposite to what the authors want to achieve.
What matters most is not that this particular definition is imperfect, –all definitions are–, but that it risks becoming influential despite these problems. When corporate charters, regulatory decisions, and public understanding hinge on AGI thresholds, we cannot afford metrics that conflate “57% of human-like performance on selected tests” with “halfway to AGI.” The AI evaluation community has developed more nuanced approaches that capture capability profiles rather than flattening them into single scores. If we are to define AGI in ways that inform policy and guide development responsibly, we must build on this work rather than repeat its mistakes under the banner of consensus. The stakes are too high for shortcuts.
References
Burnell, R., Hao, H., Conway, A. R., & Orallo, J. H. (2023). Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062. https://arxiv.org/abs/2306.10062
European Commission Joint Research Centre (JRC). (2025). New JRC collection of external scientific reports to inform the implementation of the EU AI Act on general-purpose AI models. European Commission. https://ai-watch.ec.europa.eu/news/new-jrc-collection-external-scientific-reports-inform-implementation-eu-ai-act-general-purpose-ai-2025-10-14_en
Gubrud, M. A. (1997, November). Nanotechnology and international security. Paper presented at the Fifth Foresight Conference on Molecular Nanotechnology, Palo Alto, CA. Retrieved from https://web.archive.org/web/20070205153112/http://www.foresight.org/Conferences/MNT05/Papers/Gubrud/index.html
Heaven, W. D. (2020, October 15). Artificial general intelligence: Are we close, and does it even make sense to try? MIT Technology Review. https://www.technologyreview.com/2020/10/15/1010461/artificial-general-intelligence-robots-ai-agi-deepmind-google-openai/
Hendrycks, D., Song, D., Szegedy, C., Lee, H., Gal, Y., Brynjolfsson, E., Li, S., Zou, A., Levine, L., Han, B., Fu, J., Liu, Z., Shin, J., Lee, K., Mazeika, M., Phan, L., Ingebretsen, G., Khoja, A., Xie, C., … Bengio, Y. (2025). A definition of AGI. arXiv preprint arXiv:2510.18212. https://arxiv.org/abs/2510.18212
Hernández-Orallo, J. (2024). Caveats and solutions for characterising general-purpose AI. In ECAI 2024 (pp. 2–9). Amsterdam: IOS Press. https://ebooks.iospress.nl/doi/10.3233/FAIA240459
Hernández-Orallo, J., Martínez-Plumed, F., Schmid, U., Siebers, M., & Dowe, D. L. (2016). Computer models solving intelligence test problems: Progress and implications. Artificial Intelligence, 230, 74-107. https://doi.org/10.1016/j.artint.2015.09.011
McCarthy, J. (1987). Generality in artificial intelligence. Communications of the ACM, 30(12), 1030–1035. https://doi.org/10.1145/33447.33448
McGrew, K. S. (2005). The Cattell–Horn–Carroll theory of cognitive abilities: Past, present, and future. In D. P. Flanagan & P. L. Harrison (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (pp. 136–181). New York, NY: The Guilford Press.
OECD. (2025). Introducing the OECD AI capability indicators. Paris: OECD Publishing. https://doi.org/10.1787/be745f04-en
OpenAI. (2018). OpenAI Charter. OpenAI Policy Document. https://openai.com/charter/
Pacchiardi, L., Burden, J., Martínez-Plumed, F., Hernández-Orallo, J., Gómez, E., & Fernández-Llorca, D. (2025). A framework for the categorisation of general-purpose AI models under the EU AI Act. In NeurIPS 2025 Workshop on Regulatable Machine Learning.
Penrose, L. S., & Raven, J. C. (1936). A new series of perceptual tests: Preliminary communication. British Journal of Medical Psychology, 16(2), 97–104.
Sühr, T., Dorner, F. E., Salaudeen, O., Kelava, A., & Samadi, S. (2025). Stop evaluating AI with human tests, develop principled, AI-specific tests instead. arXiv preprint arXiv:2507.23009. https://arxiv.org/abs/2507.23009
Tatel, C. E., Tidler, Z. R., & Ackerman, P. L. (2022). Process differences as a function of test modifications: Construct validity of Raven’s Advanced Progressive Matrices under standard, abbreviated, and speeded conditions – A meta-analysis. Intelligence, 90, 101604. https://doi.org/10.1016/j.intell.2021.101604
Tolan, S., Pesole, A., Martínez-Plumed, F., Fernández-Macías, E., Hernández-Orallo, J., & Gómez, E. (2021). Measuring the occupational impact of AI: Tasks, cognitive abilities and AI benchmarks. Journal of Artificial Intelligence Research, 71, 191–236. https://www.econstor.eu/handle/10419/231334
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433
Zhou, L., Pacchiardi, L., Martínez-Plumed, F., Collins, K. M., Moros-Dávalos, Y., Zhang, S., … Hernández-Orallo, J. (2025). General scales unlock AI evaluation with explanatory and predictive power. arXiv preprint arXiv:2503.06378.https://arxiv.org/abs/2503.06378








