With respect to METR, yeah, this feels like it falls under my argument against comparing performance against human experts when assessing whether AI is “human-level”. This is not to deny the claim that these tasks may shine a light on fundamentally missing capabilities; as I said, I am not claiming that modern AI is within human range on all human capabilities, only enough that I think “human level” is a sensible label to apply.
However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further.
The point about forming societies seems less clear. Productive labor in the current economy is in some ways much more complex and harder to navigate than it would be in a new society built from scratch. The Generative Agents paper gives some evidence in favor of LLM-base agents coordinating social events.
I think humans doing METR’s tasks are more like “expert-level” rather than average/”human-level”. But current LLM agents are also far below human performance on tasks that don’t require any special expertise.
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. [Note: The latest highest AI agent score is now 39%.] This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA’s philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system’s capability to exhibit similar robustness as the average human does on such questions.
And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:
I don’t know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can’t match human performance at simple agent tasks.
I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.
However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further.
This is what jumped out at me when I read your post. Transformer LLM can be described as a “disabled human who is blind to motion and needs seconds to see a still image, paralyzed, costs expensive resources to live, cannot learn, and has no long term memory”. Oh and they finished high school and some college across all majors.
“What job can they do and how much will you pay”. “Can they support themselves financially?”.
And you end up with “well for most of human history, a human with those disabilities would be a net drain on their tribe. Sometimes they were abandoned to die as a consequence. ”
And it implies something like “can perform robot manipulation and wash dishes, or the “make a cup of coffee in a strangers house” test. And reliably enough to be paid minimum wage or at least some money under the table to do a task like this.
We really could be 3-5 years from that, if all you need for AGI is “video perception, online learning, long term memory, and 5-25th percentile human like robotics control”. 3⁄4 elements exist in someone’s lab right now, the robotics control maybe not.
This “economic viability test” has an interesting followup question. It’s possible for a human to remain alive and living in a car or tent under a bridge for a few dollars an hour. This is the “minimum income to survive” for a human. But a robotic system may blow a $10,000 part every 1000 hours, or need $100 an hour of rented B200 compute to think with.
So the minimum hourly rate could be higher. I think maybe we should use the human dollar figures for this “can survive” level of AGI capabilities test, since robotic and compute costs are so easy and fast to optimize.
Summary :
AGI when the AI systems can do a variety of general tasks, completely, you would pay a human employee to do, even a low end one.
Transformative AGI (one of many thresholds) when the AI system can do a task and be paid more than the hourly cost of compute + robotic hourly costs.
Note “transformation” is reached when the lowest threshold is reached. Noticed that error all over, lots of people like Daniel and Richard have thresholds where AI will definitely be transformational, such as “can autonomously perform ai research” but don’t seem to think “can wash dishes or sort garbage and produce more value than operating cost” is transformational.
And you end up with “well for most of human history, a human with those disabilities would be a net drain on their tribe. Sometimes they were abandoned to die as a consequence. ”
And it implies something like “can perform robot manipulation and wash dishes, or the “make a cup of coffee in a strangers house” test. And reliably enough to be paid minimum wage or at least some money under the table to do a task like this.
The replace-human-labor test gets quite interesting and complex when we start to time-index it. Specifically, two time-indexes are needed: a ‘baseline’ time (when humans are doing all the relevant work) and a comparison time (where we check how much of the baseline economy has been automated).
Without looking anything up, I guess we could say that machines have already automated 90% of the economy, if we choose our baseline from somewhere before industrial farming equipment, and our comparison time somewhere after. But this is obviously not AGI.
A human who can do exactly what GPT4 can do is not economically viable in 2024, but might have been economically viable in 2020.
Yes, I agree. Whenever I think of things like this I focus on how what matters in the sense of “when will agi be transformational” is the idea of criticality.
I have written on it earlier but the simple idea is that our human world changes rapidly when AI capabilities in some way lead to more AI capabilities at a fast rate.
Like this whole “is this AGI” thing is totally irrelevant, all that matters is criticality. You can imagine subhuman systems using AGI reaching criticality, and superhuman systems being needed. (Note ordinary humans do have criticality albeit with a doubling time of about 20 years)
There are many forms of criticality, and the first one unlocked that won’t quench easily starts the singularity.
Examples:
Investment criticality: each AI demo leads to more investment than the total cost, including failures at other companies, to produce the demo. Quenches if investors run out of money or find a better investment sector.
Financial criticality: AI services delivered by AI bring in more than they cost in revenue, and each reinvestment effectively has a greater than 10 percent ROI. This quenches once further reinvestments in AI don’t pay for themselves.
Partial self replication criticality. Robots can build most of the parts used in themselves, I use post 2020 automation. This quenches at the new equilibrium determined by the percent of automation.
Aka 90 percent automation makes each human worker left 10 times as productive so we quench at 10x number of robots possible if every worker on earth was building robots.
Full self replication criticality : this quenches when matter mineable in the solar system is all consumed and made into either more robots or waste piles.
AI research criticality: AI systems research and develop better AI systems. Quenches when you find the most powerful AI the underlying compute and data can support.
You may notice 2 are satisfied, one eoy 2022, one later 2023. So in that sense the Singularity began and will accelerate until it quenches, and it may very well quench on “all usable matter consumed”.
Ironically this makes your central point correct. Llms are a revolution.
With respect to METR, yeah, this feels like it falls under my argument against comparing performance against human experts when assessing whether AI is “human-level”. This is not to deny the claim that these tasks may shine a light on fundamentally missing capabilities; as I said, I am not claiming that modern AI is within human range on all human capabilities, only enough that I think “human level” is a sensible label to apply.
However, the point about autonomously making money feels more hard-hitting, and has been repeated by a few other commenters. I can at least concede that this is a very sensible definition of AGI, which pretty clearly has not yet been satisfied. Possibly I should reconsider my position further.
The point about forming societies seems less clear. Productive labor in the current economy is in some ways much more complex and harder to navigate than it would be in a new society built from scratch. The Generative Agents paper gives some evidence in favor of LLM-base agents coordinating social events.
I think humans doing METR’s tasks are more like “expert-level” rather than average/”human-level”. But current LLM agents are also far below human performance on tasks that don’t require any special expertise.
From GAIA:
And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:
I don’t know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can’t match human performance at simple agent tasks.
I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.
This is what jumped out at me when I read your post. Transformer LLM can be described as a “disabled human who is blind to motion and needs seconds to see a still image, paralyzed, costs expensive resources to live, cannot learn, and has no long term memory”. Oh and they finished high school and some college across all majors.
“What job can they do and how much will you pay”. “Can they support themselves financially?”.
And you end up with “well for most of human history, a human with those disabilities would be a net drain on their tribe. Sometimes they were abandoned to die as a consequence. ”
And it implies something like “can perform robot manipulation and wash dishes, or the “make a cup of coffee in a strangers house” test. And reliably enough to be paid minimum wage or at least some money under the table to do a task like this.
We really could be 3-5 years from that, if all you need for AGI is “video perception, online learning, long term memory, and 5-25th percentile human like robotics control”. 3⁄4 elements exist in someone’s lab right now, the robotics control maybe not.
This “economic viability test” has an interesting followup question. It’s possible for a human to remain alive and living in a car or tent under a bridge for a few dollars an hour. This is the “minimum income to survive” for a human. But a robotic system may blow a $10,000 part every 1000 hours, or need $100 an hour of rented B200 compute to think with.
So the minimum hourly rate could be higher. I think maybe we should use the human dollar figures for this “can survive” level of AGI capabilities test, since robotic and compute costs are so easy and fast to optimize.
Summary :
AGI when the AI systems can do a variety of general tasks, completely, you would pay a human employee to do, even a low end one.
Transformative AGI (one of many thresholds) when the AI system can do a task and be paid more than the hourly cost of compute + robotic hourly costs.
Note “transformation” is reached when the lowest threshold is reached. Noticed that error all over, lots of people like Daniel and Richard have thresholds where AI will definitely be transformational, such as “can autonomously perform ai research” but don’t seem to think “can wash dishes or sort garbage and produce more value than operating cost” is transformational.
Those events could be decades apart.
The replace-human-labor test gets quite interesting and complex when we start to time-index it. Specifically, two time-indexes are needed: a ‘baseline’ time (when humans are doing all the relevant work) and a comparison time (where we check how much of the baseline economy has been automated).
Without looking anything up, I guess we could say that machines have already automated 90% of the economy, if we choose our baseline from somewhere before industrial farming equipment, and our comparison time somewhere after. But this is obviously not AGI.
A human who can do exactly what GPT4 can do is not economically viable in 2024, but might have been economically viable in 2020.
Yes, I agree. Whenever I think of things like this I focus on how what matters in the sense of “when will agi be transformational” is the idea of criticality.
I have written on it earlier but the simple idea is that our human world changes rapidly when AI capabilities in some way lead to more AI capabilities at a fast rate.
Like this whole “is this AGI” thing is totally irrelevant, all that matters is criticality. You can imagine subhuman systems using AGI reaching criticality, and superhuman systems being needed. (Note ordinary humans do have criticality albeit with a doubling time of about 20 years)
There are many forms of criticality, and the first one unlocked that won’t quench easily starts the singularity.
Examples:
Investment criticality: each AI demo leads to more investment than the total cost, including failures at other companies, to produce the demo. Quenches if investors run out of money or find a better investment sector.
Financial criticality: AI services delivered by AI bring in more than they cost in revenue, and each reinvestment effectively has a greater than 10 percent ROI. This quenches once further reinvestments in AI don’t pay for themselves.
Partial self replication criticality. Robots can build most of the parts used in themselves, I use post 2020 automation. This quenches at the new equilibrium determined by the percent of automation.
Aka 90 percent automation makes each human worker left 10 times as productive so we quench at 10x number of robots possible if every worker on earth was building robots.
Full self replication criticality : this quenches when matter mineable in the solar system is all consumed and made into either more robots or waste piles.
AI research criticality: AI systems research and develop better AI systems. Quenches when you find the most powerful AI the underlying compute and data can support.
You may notice 2 are satisfied, one eoy 2022, one later 2023. So in that sense the Singularity began and will accelerate until it quenches, and it may very well quench on “all usable matter consumed”.
Ironically this makes your central point correct. Llms are a revolution.