This is my personal opinion, and in particular, does not represent anything like a MIRI consensus; I’ve gotten push-back from almost everyone I’ve spoken with about this, although in most cases I believe I eventually convinced them of the narrow terminological point I’m making.
In the AI x-risk community, I think there is a tendency to ask people to estimate “time to AGI” when what is meant is really something more like “time to doom” (or, better, point-of-no-return). For about a year, I’ve been answering this question “zero” when asked.
This strikes some people as absurd or at best misleading. I disagree.
The term “Artificial General Intelligence” (AGI) was coined in the early 00s, to contrast with the prevalent paradigm of Narrow AI. I was getting my undergraduate computer science education in the 00s; I experienced a deeply-held conviction in my professors that the correct response to any talk of “intelligence” was “intelligence for what task?”—to pursue intelligence in any kind of generality was unscientific, whereas trying to play chess really well or automatically detect cancer in medical scans was OK.
I think this was a reaction to the AI winter of the 1990s. The grand ambitions of the AI field, to create intelligent machines, had been discredited. Automating narrow tasks still seemed promising. “AGI” was a fringe movement.
As such, I do not think it is legitimate for the AI risk community to use the term AGI to mean ‘the scary thing’—the term AGI belongs to the AGI community, who use it specifically to contrast with narrow AI.
Modern Transformers[1] are definitely not narrow AI.
It may have still been plausible in, say, 2019. You might then have argued: “Language models are only language models! They’re OK at writing, but you can’t use them for anything else.” It had been argued for many years that language was an AI complete task; if you can solve natural-language processing (NLP) sufficiently well, you can solve anything. However, in 2019 it might still be possible to dismiss this. Basically any narrow-AI subfield had people who will argue that that specific subfield is the best route to AGI, or the best benchmark for AGI.
The NLP people turned out to be correct. Modern NLP systems can do most things you would want an AI to do, at some basic level of competence. Critically, if you come up with a new task[2], one which the model has never been trained on, then odds are still good that it will display at least middling competence. What more could you reasonably ask for, to demonstrate ‘general intelligence’ rather than ‘narrow’?
Generative pre-training is AGI technology: it creates a model with mediocre competence at basically everything.
Furthermore, when we measure that competence, it usually falls somewhere within the human range of performance. So, as a result, it seems sensible to call them human-level as well. It seems to me like people who protest this conclusion are engaging in goalpost-moving.
More specifically, it seems to me like complaints that modern AI systems are “dumb as rocks” are comparing AI-generated responses to human experts. A quote from the dumb-as-rocks essay:
GenAI also can’t tell you how to make money. One man asked GPT-4 what to do with $100 to maximize his earnings in the shortest time possible. The program had him buy a domain name, build a niche affiliate website, feature some sustainable products, and optimize for social media and search engines. Two months later, our entrepreneur had a moribund website with one comment and no sales. So genAI is bad at business.
That’s a bit of a weak-man argument (I specifically searched for “generative ai is dumb as rocks what are we doing”). But it does demonstrate a pattern I’ve encountered. Often, the alternative to asking an AI is to ask an expert; so it becomes natural to get in the habit of comparing AI answers to expert answers. This becomes what we think about when we judge whether modern AI is “any good”—but this is not the relevant comparison we should be using when judging whether it is “human level”.
I’m certainly not claiming that modern transformers are roughly equivalent to humans in all respects. Memory works very differently for them, for example, although that has been significantly improving over the past year. One year ago I would have compared an LLM to a human with a learning disability and memory problems, but who has read the entire internet and absorbed a lot through sheer repetition. Now, those memory problems are drastically reduced.
Edited to add:
There have been many interesting comments. Two clusters of reply stick out to me:
One clear notion of “human-level” which these machines have not yet satisfied is the competence to hold down a human job.
There’s a notion of “AGI” where the emphasis is on the ability to gain capability, rather than the breadth of capability; this is lacking in modern AI.
Hjalmar Wijk would strongly bet that even if there were more infrastructure in place to help LLMs autonomously get jobs, they would be worse at this than humans. Matthew Barnett points out that economically-minded people have defined AGI in terms such as what percentage of human labor the machine is able to replace. I particularly appreciated Kaj Sotala’s in-the-trenches description of trying to get GPT4 to do a job.
Kaj says GPT4 is “stupid in some very frustrating ways that a human wouldn’t be”—giving the example of GPT4 claiming that an appointment has been rescheduled, when in fact it does not even have the calendar access required to do that.
Comments on this point out that this is not an unusual customer service experience.
I do want to concede that AIs like GPT4 are quantitatively more “disconnected from reality” than humans, in an important way, which will lead them to “lie” like this more often. I also agree that GPT4 lacks the overall skills which would be required for it to make its way through the world autonomously (it would fail if it had to apply for jobs, build working relationships with humans over a long time period, rent its own server space, etc).
However, in many of these respects, it still feels comparable to the low end of human performance, rather than entirely sub-human. Autonomously making one’s way through the world feels very “conjunctive”—it requires the ability to do a lot of things right.
I never meant to claim that GPT4 is within human range on every single performance dimension; only lots and lots of them. For example, it cannot do realtime vision + motor control at anything approaching human competence (although my perspective leads me to think that this will be possible with comparable technology in the near future).
In his comment, Matthew Barnett quotes Tobias Baumann:
The framing suggests that there will be a point in time when machine intelligence can meaningfully be called “human-level”. But I expect artificial intelligence to differ radically from human intelligence in many ways. In particular, the distribution of strengths and weaknesses over different domains or different types of reasoning is and will likely be different2 – just as machines are currently superhuman at chess and Go, but tend to lack “common sense”.
I think we find ourselves in a somewhat surprising future where machine intelligence actually turns out to be meaningfully “human-level” across many dimensions at once, although not all.
Anyway, the second cluster of responses I mentioned is perhaps even more interesting. Steven Byrnes has explicitly endorsed “moving the goalposts” for AGI. I do think it can sometimes be sensible to move goalposts; the concept of goalpost-moving is usually used in a negative light, but, there are times when it must be done. I wish it could be facilitated by a new term, rather than a redefinition of “AGI”; but I am not sure what to suggest.
I think there is a lot to say about Steven’s notion of AGI as the-ability-to-gain-capabilities rather than as a concept of breadth-of-capability. I’ll leave most of it to the comment section. To briefly respond: I agree that there is something interesting and important here. I currently think AIs like GPT4 have ‘very little’ of this rather than none. I also thing individual humans have very little of this. In the anthropological record, it looks like humans were not very culturally innovative for more than a hundred thousand years, until the “creative explosion” which resulted in a wide variety of tools and artistic expression. I find it plausible that this required a large population of humans to get going. Individual humans are rarely really innovative; more often, we can only introduce basic variations on existing concepts.
I’m saying “transformers” every time I am tempted to write “LLMs” because many modern LLMs also do image processing, so the term “LLM” is not quite right.
Obviously, this claim relies on some background assumption about how you come up with new tasks. Some people are skilled at critiquing modern AI by coming up with specific things which it utterly fails at. I am certainly not claiming that modern AI is literally competent at everything.
However, it does seem true to me that if you generate and grade test questions in roughly the way a teacher might, the best modern Transformers will usually fall comfortably within human range, if not better.
Modern Transformers are AGI, and Human-Level
This is my personal opinion, and in particular, does not represent anything like a MIRI consensus; I’ve gotten push-back from almost everyone I’ve spoken with about this, although in most cases I believe I eventually convinced them of the narrow terminological point I’m making.
In the AI x-risk community, I think there is a tendency to ask people to estimate “time to AGI” when what is meant is really something more like “time to doom” (or, better, point-of-no-return). For about a year, I’ve been answering this question “zero” when asked.
This strikes some people as absurd or at best misleading. I disagree.
The term “Artificial General Intelligence” (AGI) was coined in the early 00s, to contrast with the prevalent paradigm of Narrow AI. I was getting my undergraduate computer science education in the 00s; I experienced a deeply-held conviction in my professors that the correct response to any talk of “intelligence” was “intelligence for what task?”—to pursue intelligence in any kind of generality was unscientific, whereas trying to play chess really well or automatically detect cancer in medical scans was OK.
I think this was a reaction to the AI winter of the 1990s. The grand ambitions of the AI field, to create intelligent machines, had been discredited. Automating narrow tasks still seemed promising. “AGI” was a fringe movement.
As such, I do not think it is legitimate for the AI risk community to use the term AGI to mean ‘the scary thing’—the term AGI belongs to the AGI community, who use it specifically to contrast with narrow AI.
Modern Transformers[1] are definitely not narrow AI.
It may have still been plausible in, say, 2019. You might then have argued: “Language models are only language models! They’re OK at writing, but you can’t use them for anything else.” It had been argued for many years that language was an AI complete task; if you can solve natural-language processing (NLP) sufficiently well, you can solve anything. However, in 2019 it might still be possible to dismiss this. Basically any narrow-AI subfield had people who will argue that that specific subfield is the best route to AGI, or the best benchmark for AGI.
The NLP people turned out to be correct. Modern NLP systems can do most things you would want an AI to do, at some basic level of competence. Critically, if you come up with a new task[2], one which the model has never been trained on, then odds are still good that it will display at least middling competence. What more could you reasonably ask for, to demonstrate ‘general intelligence’ rather than ‘narrow’?
Generative pre-training is AGI technology: it creates a model with mediocre competence at basically everything.
Furthermore, when we measure that competence, it usually falls somewhere within the human range of performance. So, as a result, it seems sensible to call them human-level as well. It seems to me like people who protest this conclusion are engaging in goalpost-moving.
More specifically, it seems to me like complaints that modern AI systems are “dumb as rocks” are comparing AI-generated responses to human experts. A quote from the dumb-as-rocks essay:
That’s a bit of a weak-man argument (I specifically searched for “generative ai is dumb as rocks what are we doing”). But it does demonstrate a pattern I’ve encountered. Often, the alternative to asking an AI is to ask an expert; so it becomes natural to get in the habit of comparing AI answers to expert answers. This becomes what we think about when we judge whether modern AI is “any good”—but this is not the relevant comparison we should be using when judging whether it is “human level”.
I’m certainly not claiming that modern transformers are roughly equivalent to humans in all respects. Memory works very differently for them, for example, although that has been significantly improving over the past year. One year ago I would have compared an LLM to a human with a learning disability and memory problems, but who has read the entire internet and absorbed a lot through sheer repetition. Now, those memory problems are drastically reduced.
Edited to add:
There have been many interesting comments. Two clusters of reply stick out to me:
One clear notion of “human-level” which these machines have not yet satisfied is the competence to hold down a human job.
There’s a notion of “AGI” where the emphasis is on the ability to gain capability, rather than the breadth of capability; this is lacking in modern AI.
Hjalmar Wijk would strongly bet that even if there were more infrastructure in place to help LLMs autonomously get jobs, they would be worse at this than humans. Matthew Barnett points out that economically-minded people have defined AGI in terms such as what percentage of human labor the machine is able to replace. I particularly appreciated Kaj Sotala’s in-the-trenches description of trying to get GPT4 to do a job.
Kaj says GPT4 is “stupid in some very frustrating ways that a human wouldn’t be”—giving the example of GPT4 claiming that an appointment has been rescheduled, when in fact it does not even have the calendar access required to do that.
Comments on this point out that this is not an unusual customer service experience.
I do want to concede that AIs like GPT4 are quantitatively more “disconnected from reality” than humans, in an important way, which will lead them to “lie” like this more often. I also agree that GPT4 lacks the overall skills which would be required for it to make its way through the world autonomously (it would fail if it had to apply for jobs, build working relationships with humans over a long time period, rent its own server space, etc).
However, in many of these respects, it still feels comparable to the low end of human performance, rather than entirely sub-human. Autonomously making one’s way through the world feels very “conjunctive”—it requires the ability to do a lot of things right.
I never meant to claim that GPT4 is within human range on every single performance dimension; only lots and lots of them. For example, it cannot do realtime vision + motor control at anything approaching human competence (although my perspective leads me to think that this will be possible with comparable technology in the near future).
In his comment, Matthew Barnett quotes Tobias Baumann:
I think we find ourselves in a somewhat surprising future where machine intelligence actually turns out to be meaningfully “human-level” across many dimensions at once, although not all.
Anyway, the second cluster of responses I mentioned is perhaps even more interesting. Steven Byrnes has explicitly endorsed “moving the goalposts” for AGI. I do think it can sometimes be sensible to move goalposts; the concept of goalpost-moving is usually used in a negative light, but, there are times when it must be done. I wish it could be facilitated by a new term, rather than a redefinition of “AGI”; but I am not sure what to suggest.
I think there is a lot to say about Steven’s notion of AGI as the-ability-to-gain-capabilities rather than as a concept of breadth-of-capability. I’ll leave most of it to the comment section. To briefly respond: I agree that there is something interesting and important here. I currently think AIs like GPT4 have ‘very little’ of this rather than none. I also thing individual humans have very little of this. In the anthropological record, it looks like humans were not very culturally innovative for more than a hundred thousand years, until the “creative explosion” which resulted in a wide variety of tools and artistic expression. I find it plausible that this required a large population of humans to get going. Individual humans are rarely really innovative; more often, we can only introduce basic variations on existing concepts.
I’m saying “transformers” every time I am tempted to write “LLMs” because many modern LLMs also do image processing, so the term “LLM” is not quite right.
Obviously, this claim relies on some background assumption about how you come up with new tasks. Some people are skilled at critiquing modern AI by coming up with specific things which it utterly fails at. I am certainly not claiming that modern AI is literally competent at everything.
However, it does seem true to me that if you generate and grade test questions in roughly the way a teacher might, the best modern Transformers will usually fall comfortably within human range, if not better.