I agree the term AGI is rough and might be more misleading than it’s worth in some cases. But I do quite strongly disagree that current models are ‘AGI’ in the sense most people intend.
Examples of very important areas where ‘average humans’ plausibly do way better than current transformers:
Most humans succeed in making money autonomously. Even if they might not come up with a great idea to quickly 10x $100 through entrepreneurship, they are able to find and execute jobs that people are willing to pay a lot of money for. And many of these jobs are digital and could in theory be done just as well by AIs. Certainly there is a ton of infrastructure built up around humans that help them accomplish this which doesn’t really exist for AI systems yet, but if this situation was somehow equalized I would very strongly bet on the average human doing better than the average GPT-4-based agent. It seems clear to me that humans are just way more resourceful, agentic, able to learn and adapt etc. than current transformers are in key ways.
Many humans currently do drastically better on the METR task suite (https://github.com/METR/public-tasks) than any AI agents, and I think this captures some important missing capabilities that I would expect an ‘AGI’ system to possess. This is complicated somewhat by the human subjects not being ‘average’ in many ways, e.g. we’ve mostly tried this with US tech professionals and the tasks include a lot of SWE, so most people would likely fail due to lack of coding experience.
Take enough randomly sampled humans and set them up with the right incentives and they will form societies, invent incredibly technologies, build productive companies etc. whereas I don’t think you’ll get anything close to this with a bunch of GPT-4 copies at the moment
I think AGI for most people evokes something that would do as well as humans on real-world things like the above, not just something that does as well as humans on standardized tests.
Current AIs suck at agency skills. Put a bunch of them in AutoGPT scaffolds and give them each their own computer and access to the internet and contact info for each other and let them run autonomously for weeks and… well I’m curious to find out what will happen, I expect it to be entertaining but not impressive or useful. Whereas, as you say, randomly sampled humans would form societies and fnd jobs etc.
This is the common thread behind all your examples Hjalmar. Once we teach our AIs agency (i.e. once they have lots of training-experience operating autonomously in pursuit of goals in sufficiently diverse/challenging environments that they generalize rather than overfit to their environment) then they’ll be AGI imo. And also takeoff will begin, takeover will become a real possibility, etc. Off to the races.
Yeah, I agree that lack of agency skills are an important part of the remaining human<>AI gap, and that it’s possible that this won’t be too difficult to solve (and that this could then lead to rapid further recursive improvements). I was just pointing toward evidence that there is a gap at the moment, and that current systems are poorly described as AGI.
With respect to METR, yeah, this feels like it falls under my argument against comparing performance against human experts when assessing whether AI is “human-level”. This is not to deny the claim that these tasks may shine a light on fundamentally missing capabilities; as I said, I am not claiming that modern AI is within human range on all human capabilities, only enough that I think “human level” is a sensible label to apply.
However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further.
The point about forming societies seems less clear. Productive labor in the current economy is in some ways much more complex and harder to navigate than it would be in a new society built from scratch. The Generative Agents paper gives some evidence in favor of LLM-base agents coordinating social events.
I think humans doing METR’s tasks are more like “expert-level” rather than average/”human-level”. But current LLM agents are also far below human performance on tasks that don’t require any special expertise.
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. [Note: The latest highest AI agent score is now 39%.] This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions.
And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:
I don’t know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can’t match human performance at simple agent tasks.
I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.
However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further.
This is what jumped out at me when I read your post. Transformer LLM can be described as a “disabled human who is blind to motion and needs seconds to see a still image, paralyzed, costs expensive resources to live, cannot learn, and has no long term memory”. Oh and they finished high school and some college across all majors.
“What job can they do and how much will you pay”. “Can they support themselves financially?”.
And you end up with “well for most of human history, a human with those disabilities would be a net drain on their tribe. Sometimes they were abandoned to die as a consequence. ”
And it implies something like “can perform robot manipulation and wash dishes, or the “make a cup of coffee in a strangers house” test. And reliably enough to be paid minimum wage or at least some money under the table to do a task like this.
We really could be 3-5 years from that, if all you need for AGI is “video perception, online learning, long term memory, and 5-25th percentile human like robotics control”. 3⁄4 elements exist in someone’s lab right now, the robotics control maybe not.
This “economic viability test” has an interesting followup question. It’s possible for a human to remain alive and living in a car or tent under a bridge for a few dollars an hour. This is the “minimum income to survive” for a human. But a robotic system may blow a $10,000 part every 1000 hours, or need $100 an hour of rented B200 compute to think with.
So the minimum hourly rate could be higher. I think maybe we should use the human dollar figures for this “can survive” level of AGI capabilities test, since robotic and compute costs are so easy and fast to optimize.
Summary :
AGI when the AI systems can do a variety of general tasks, completely, you would pay a human employee to do, even a low end one.
Transformative AGI (one of many thresholds) when the AI system can do a task and be paid more than the hourly cost of compute + robotic hourly costs.
Note “transformation” is reached when the lowest threshold is reached. Noticed that error all over, lots of people like Daniel and Richard have thresholds where AI will definitely be transformational, such as “can autonomously perform ai research” but don’t seem to think “can wash dishes or sort garbage and produce more value than operating cost” is transformational.
And you end up with “well for most of human history, a human with those disabilities would be a net drain on their tribe. Sometimes they were abandoned to die as a consequence. ”
And it implies something like “can perform robot manipulation and wash dishes, or the “make a cup of coffee in a strangers house” test. And reliably enough to be paid minimum wage or at least some money under the table to do a task like this.
The replace-human-labor test gets quite interesting and complex when we start to time-index it. Specifically, two time-indexes are needed: a ‘baseline’ time (when humans are doing all the relevant work) and a comparison time (where we check how much of the baseline economy has been automated).
Without looking anything up, I guess we could say that machines have already automated 90% of the economy, if we choose our baseline from somewhere before industrial farming equipment, and our comparison time somewhere after. But this is obviously not AGI.
A human who can do exactly what GPT4 can do is not economically viable in 2024, but might have been economically viable in 2020.
Yes, I agree. Whenever I think of things like this I focus on how what matters in the sense of “when will agi be transformational” is the idea of criticality.
I have written on it earlier but the simple idea is that our human world changes rapidly when AI capabilities in some way lead to more AI capabilities at a fast rate.
Like this whole “is this AGI” thing is totally irrelevant, all that matters is criticality. You can imagine subhuman systems using AGI reaching criticality, and superhuman systems being needed. (Note ordinary humans do have criticality albeit with a doubling time of about 20 years)
There are many forms of criticality, and the first one unlocked that won’t quench easily starts the singularity.
Examples:
Investment criticality: each AI demo leads to more investment than the total cost, including failures at other companies, to produce the demo. Quenches if investors run out of money or find a better investment sector.
Financial criticality: AI services delivered by AI bring in more than they cost in revenue, and each reinvestment effectively has a greater than 10 percent ROI. This quenches once further reinvestments in AI don’t pay for themselves.
Partial self replication criticality. Robots can build most of the parts used in themselves, I use post 2020 automation. This quenches at the new equilibrium determined by the percent of automation.
Aka 90 percent automation makes each human worker left 10 times as productive so we quench at 10x number of robots possible if every worker on earth was building robots.
Full self replication criticality : this quenches when matter mineable in the solar system is all consumed and made into either more robots or waste piles.
AI research criticality: AI systems research and develop better AI systems. Quenches when you find the most powerful AI the underlying compute and data can support.
You may notice 2 are satisfied, one eoy 2022, one later 2023. So in that sense the Singularity began and will accelerate until it quenches, and it may very well quench on “all usable matter consumed”.
Ironically this makes your central point correct. Llms are a revolution.
I agree the term AGI is rough and might be more misleading than it’s worth in some cases. But I do quite strongly disagree that current models are ‘AGI’ in the sense most people intend.
Examples of very important areas where ‘average humans’ plausibly do way better than current transformers:
Most humans succeed in making money autonomously. Even if they might not come up with a great idea to quickly 10x $100 through entrepreneurship, they are able to find and execute jobs that people are willing to pay a lot of money for. And many of these jobs are digital and could in theory be done just as well by AIs. Certainly there is a ton of infrastructure built up around humans that help them accomplish this which doesn’t really exist for AI systems yet, but if this situation was somehow equalized I would very strongly bet on the average human doing better than the average GPT-4-based agent. It seems clear to me that humans are just way more resourceful, agentic, able to learn and adapt etc. than current transformers are in key ways.
Many humans currently do drastically better on the METR task suite (https://github.com/METR/public-tasks) than any AI agents, and I think this captures some important missing capabilities that I would expect an ‘AGI’ system to possess. This is complicated somewhat by the human subjects not being ‘average’ in many ways, e.g. we’ve mostly tried this with US tech professionals and the tasks include a lot of SWE, so most people would likely fail due to lack of coding experience.
Take enough randomly sampled humans and set them up with the right incentives and they will form societies, invent incredibly technologies, build productive companies etc. whereas I don’t think you’ll get anything close to this with a bunch of GPT-4 copies at the moment
I think AGI for most people evokes something that would do as well as humans on real-world things like the above, not just something that does as well as humans on standardized tests.
Current AIs suck at agency skills. Put a bunch of them in AutoGPT scaffolds and give them each their own computer and access to the internet and contact info for each other and let them run autonomously for weeks and… well I’m curious to find out what will happen, I expect it to be entertaining but not impressive or useful. Whereas, as you say, randomly sampled humans would form societies and fnd jobs etc.
This is the common thread behind all your examples Hjalmar. Once we teach our AIs agency (i.e. once they have lots of training-experience operating autonomously in pursuit of goals in sufficiently diverse/challenging environments that they generalize rather than overfit to their environment) then they’ll be AGI imo. And also takeoff will begin, takeover will become a real possibility, etc. Off to the races.
Yeah, I agree that lack of agency skills are an important part of the remaining human<>AI gap, and that it’s possible that this won’t be too difficult to solve (and that this could then lead to rapid further recursive improvements). I was just pointing toward evidence that there is a gap at the moment, and that current systems are poorly described as AGI.
Yeah I wasn’t disagreeing with you to be clear. Just adding.
With respect to METR, yeah, this feels like it falls under my argument against comparing performance against human experts when assessing whether AI is “human-level”. This is not to deny the claim that these tasks may shine a light on fundamentally missing capabilities; as I said, I am not claiming that modern AI is within human range on all human capabilities, only enough that I think “human level” is a sensible label to apply.
However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further.
The point about forming societies seems less clear. Productive labor in the current economy is in some ways much more complex and harder to navigate than it would be in a new society built from scratch. The Generative Agents paper gives some evidence in favor of LLM-base agents coordinating social events.
I think humans doing METR’s tasks are more like “expert-level” rather than average/”human-level”. But current LLM agents are also far below human performance on tasks that don’t require any special expertise.
From GAIA:
And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:
I don’t know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can’t match human performance at simple agent tasks.
I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.
This is what jumped out at me when I read your post. Transformer LLM can be described as a “disabled human who is blind to motion and needs seconds to see a still image, paralyzed, costs expensive resources to live, cannot learn, and has no long term memory”. Oh and they finished high school and some college across all majors.
“What job can they do and how much will you pay”. “Can they support themselves financially?”.
And you end up with “well for most of human history, a human with those disabilities would be a net drain on their tribe. Sometimes they were abandoned to die as a consequence. ”
And it implies something like “can perform robot manipulation and wash dishes, or the “make a cup of coffee in a strangers house” test. And reliably enough to be paid minimum wage or at least some money under the table to do a task like this.
We really could be 3-5 years from that, if all you need for AGI is “video perception, online learning, long term memory, and 5-25th percentile human like robotics control”. 3⁄4 elements exist in someone’s lab right now, the robotics control maybe not.
This “economic viability test” has an interesting followup question. It’s possible for a human to remain alive and living in a car or tent under a bridge for a few dollars an hour. This is the “minimum income to survive” for a human. But a robotic system may blow a $10,000 part every 1000 hours, or need $100 an hour of rented B200 compute to think with.
So the minimum hourly rate could be higher. I think maybe we should use the human dollar figures for this “can survive” level of AGI capabilities test, since robotic and compute costs are so easy and fast to optimize.
Summary :
AGI when the AI systems can do a variety of general tasks, completely, you would pay a human employee to do, even a low end one.
Transformative AGI (one of many thresholds) when the AI system can do a task and be paid more than the hourly cost of compute + robotic hourly costs.
Note “transformation” is reached when the lowest threshold is reached. Noticed that error all over, lots of people like Daniel and Richard have thresholds where AI will definitely be transformational, such as “can autonomously perform ai research” but don’t seem to think “can wash dishes or sort garbage and produce more value than operating cost” is transformational.
Those events could be decades apart.
The replace-human-labor test gets quite interesting and complex when we start to time-index it. Specifically, two time-indexes are needed: a ‘baseline’ time (when humans are doing all the relevant work) and a comparison time (where we check how much of the baseline economy has been automated).
Without looking anything up, I guess we could say that machines have already automated 90% of the economy, if we choose our baseline from somewhere before industrial farming equipment, and our comparison time somewhere after. But this is obviously not AGI.
A human who can do exactly what GPT4 can do is not economically viable in 2024, but might have been economically viable in 2020.
Yes, I agree. Whenever I think of things like this I focus on how what matters in the sense of “when will agi be transformational” is the idea of criticality.
I have written on it earlier but the simple idea is that our human world changes rapidly when AI capabilities in some way lead to more AI capabilities at a fast rate.
Like this whole “is this AGI” thing is totally irrelevant, all that matters is criticality. You can imagine subhuman systems using AGI reaching criticality, and superhuman systems being needed. (Note ordinary humans do have criticality albeit with a doubling time of about 20 years)
There are many forms of criticality, and the first one unlocked that won’t quench easily starts the singularity.
Examples:
Investment criticality: each AI demo leads to more investment than the total cost, including failures at other companies, to produce the demo. Quenches if investors run out of money or find a better investment sector.
Financial criticality: AI services delivered by AI bring in more than they cost in revenue, and each reinvestment effectively has a greater than 10 percent ROI. This quenches once further reinvestments in AI don’t pay for themselves.
Partial self replication criticality. Robots can build most of the parts used in themselves, I use post 2020 automation. This quenches at the new equilibrium determined by the percent of automation.
Aka 90 percent automation makes each human worker left 10 times as productive so we quench at 10x number of robots possible if every worker on earth was building robots.
Full self replication criticality : this quenches when matter mineable in the solar system is all consumed and made into either more robots or waste piles.
AI research criticality: AI systems research and develop better AI systems. Quenches when you find the most powerful AI the underlying compute and data can support.
You may notice 2 are satisfied, one eoy 2022, one later 2023. So in that sense the Singularity began and will accelerate until it quenches, and it may very well quench on “all usable matter consumed”.
Ironically this makes your central point correct. Llms are a revolution.