Language models are nearly AGIs but we don’t notice it because we keep shifting the bar
I’m putting my existing work on AI on Less Wrong, and editing as I go, in preparation to publishing a collection of my works on AI in a free online volume. If this content interests you, you could always follow my Substack, it’s free and also under the name Philosophy Bear.
Anyway, enjoy. Comments are appreciated as I will be rewriting parts of the essays before I put them out. A big thank you to user TAG who identified a major error in my previous post regarding the Chinese Room Thought experiment, which prompted its correction [in the addition that will go in the book] and a new corrections section for my Substack page.
Glossary:
GPT-3- a text-generating language model.
PaLM-540B- a stunningly powerful question-answering language model.
Great Palm- A hypothetical language model that combines the powers of GPT-3 and PaLM-540B. Probably buildable with current technology, a lot of money and a little elbow grease.
Great Palm with continuous learning (GPWCL)- A hypothetical language model that combines the capacities of GPT-3 and PaLM-540B, with an important additional capacity. Most language models work over a “window” of text, functioning as short-term memory. Their long-term memory is set by their training. Continuous learning is the capacity to keep adding to long-term memory as you go, and this would allow a language model to tackle much longer texts.
The argument
What I’ll be doing in this short essay is a bit cheeky, but I think we’ll make a few important points, viz:
Goals that seem very concrete can turn out to be vulnerable to bar-shifting- shifting which we may scarcely even notice.
AGI is such a goal.
We have gotten very good, much too good, at denying the progress we have made in AGI.
A focus on able-bodied humanity, and the tendency to forget disabled people exist when thinking about these topics, deceives us in these matters.
If I’m being a bit of a gadfly here, it’s not without a purpose.
Everything I say in this article in a sense maybe applies to GPT-3 alone, but for the avoidance of doubt, let me specify that I’m talking about a hypothetical language model that has the fluency of GPT-3 and the question-answering capabilities of PaLM-540B which we will call The Great Palm to make it clear that we’re not taking ourselves too seriously. In my view, The Great Palm is very close to being an AGI.
I think the Great Palm lacks only one thing, the capacity for continuous learning- the capacity to remember the important bits of everything it reads, and not just in its training period. If Great Palm (GPT-3+PaLM540B) had that ability, it would be an AGI.
“But hang on”, you say “Great Palm can’t draw, it can’t play computer games, it can’t listen to music, it can’t so much as discriminate an apple from a banana, and adding on a capacity for continuous learning doesn’t change that”.
I have two responses.
Response 1: Sure, but neither could noted author, activist, and communist intellectual Helen Keller and other completely deaf and blind people, who are all general intellects.
Response 2: Actually, it may be able to do some of these things so long as you can convert them into the modality of text. It’s quite conceivable that Great Palm could analyze music, for example, if the notation were converted into text. We should focus more on content than modality.
Why do I say that Great Palm with a capacity for continuous learning would be an artificial general intelligence? Because it can attempt basically all tasks a human with access to a text input, text output console and nothing more could and make a reasonable go at them. In the case of Great Palm with continuous learning, looking at what PaLM-540B and GPT-3 can do, it’s actually hard to find tasks that the average human can beat it. Look at the MMLU dataset if you don’t believe me- they’re tough questions). That kind of broad scope is comparable to the scope of many humans.
To be clear I am absolutely not saying that, for example, Helen Keller could only answer text input text output problems. There are numerous other sensory modalities-touch taste etc. Helen Keller could navigate a maze, whereas Great-Palm-With-Continuous learning could only do that if the maze were described to it. I suppose this gives a possible line of counterargument. We could disqualify Great-Palm-With-Continuous-Learning by adding a disjunction like “AGIs must be proficient in at least one of touch, taste, smell, sight or hearing”, but that seems arbitrary to me.
I’m not exactly going to proffer a definition of AGI here, but it seems to me that entities that can make a reasonable go at almost all text input text output tasks count as AGIs. At the very least, imposing the need to be able to use particular sensory modalities is not only wrongly human-centric, but it also doesn’t even account for all human experience (e.g. the deaf and blind).
Objections:
What about Commonsense reasoning: Maybe you’re worried about commonsense reasoning. Looking at PaLM’s capabilities, Its performance on commonsense reasoning tasks is human, or very close to it. For example, PaLM 540B scored ~96% on the Winograd Schema test. My recollection is that most humans don’t get this much, but the authors set the bar 100 because they reasoned a human properly paying attention would get full marks [at least I seem to recall that’s why they changed it to 100 between GLUE and superGLUE]. Requiring 100% of human performance on commonsense reasoning tasks to be an AGI seems to me like special pleading. Near enough is good enough to count.
What about the Turing test: Would the Great Palm continuous learning edition be able to pass the Turing test reliably? I don’t know. I’m confident it could pass it sometimes and I’m confident it could pass it more reliably than some humans- humans who are undoubtedly general intelligences. Language models have gotten very good at Turing tests after all.
Surely there are some tasks it cannot do: Is it not possible that there might be some tasks that humans can do that Great Palm with continuous learning (GPWCL) can’t do?: I’d say it’s probable! Nonetheless, the great bulk of tasks an average literate human could do, GPWCL can do- and it’s quite difficult to find counterexamples. I think that insisting that AGI requires a computer to be able to perform literally every task a literate human can do is special pleading. If we encountered aliens, for example, it’s quite likely that there would be some tasks the average human can do that the average alien couldn’t do (and vice versa) this wouldn’t exclude either of us from counting as AGI.
Haven’t you just arbitrarily drawn a line around text input, text output problems and said “being able to do the majority of these is enough for AGI”? Sure, definitions of AGI that exclude the deaf and the blind may be wrong, but that doesn’t prove text alone is sufficient. Maybe some third definition that includes Helen Keller, but excludes Great-Palm-With-Continuous-Learning is right: Ultimately, this will come down to definition debate. However when we focus on the content of problems rather than the modality, it becomes clear the range of text input, text output is vast, one might even say general.
What if there are other huge categories of text input text output tasks that Great Palm with continuous learning could not attempt that you are unaware of: Am I certain that continuous learning is the only thing holding something like Great Palm back from the vast bulk of literate-human accessible tasks? No, I’m not certain. I’m very open to counterexamples if you have any, put them in the comments. Nonetheless, PaLM can do a lot of things, GPT-3 can do a lot of things, and when you put them together, the only things that stand out to me as obviously and qualitatively missing in the domain of text input, and text output involve continuous learning.
Am I saying that text input text output is the only way to prove intelligence?: Absolutely not! The vast majority of humans who ever lived were illiterate. However, it seems general enough to me to qualify. It is sufficient, not necessary.
Aren’t you treating continuous learning as if it were a very easy problem, a negligible barrier when it fact it’s very hard?: That’s not my intention. I recognize that it is very hard. That said, at a guess, it is probably possible to make Great-Palm sans continuous learning now. Adding on the continuous learning component will take time, but I would be very surprised if it took anywhere near as much time as it took us to reach GPT-3 and PaLM-540B.
Implications
Turing proposed the Turing test as a test for something like AGI, but since then it seems the concept of AGI has somewhat metastasized. For example, Metaculus gives this as the requirements to qualify as a “weakly general” AGI:
Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize.
Able to score 90% or more on a robust version of the Winograd Schema Challenge, e.g. the “Winogrande” challenge or comparable data set for which human performance is at 90+%
Be able to score 75th percentile (as compared to the corresponding year’s human students; this was a score of 600 in 2016) on all the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data. (Training on other corpuses of math problems is fair game as long as they are arguably distinct from SAT exams.)
Be able to learn the classic Atari game “Montezuma’s revenge” (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play (see closely-related question.)
And this as the definition of a strong AGI on Metaculus:
Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files (as is done in ordinary text messaging applications) during the course of their conversation. An ‘adversarial’ Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of an AI passing such a Turing test, or one that is sufficiently similar, will be sufficient for this condition, so long as the test is well-designed to the estimation of Metaculus Admins.
Has general robotic capabilities, of the type able to autonomously, when equipped with appropriate actuators and when given human-readable instructions, satisfactorily assemble a (or the equivalent of a) circa-2021 Ferrari 312 T4 1:8 scale automobile model. A single demonstration of this ability, or a sufficiently similar demonstration, will be considered sufficient.
High competency at a diverse fields of expertise, as measured by achieving at least 75% accuracy in every task and 90% mean accuracy across all tasks in the Q&A dataset developed by Dan Hendrycks et al..
Able to get top-1 strict accuracy of at least 90.0% on interview-level problems found in the APPS benchmark introduced by Dan Hendrycks, Steven Basart et al. Top-1 accuracy is distinguished, as in the paper, from top-k accuracy in which k outputs from the model are generated, and the best output is selected.
But to me, these aren’t really definitions of AGI. They’re definitions of visual, auditory and kinaesthetic sensory modality utilizing AGI. Putting this as the bar for AGI effectively excludes some disabled people from being general intelligences, which is not desirable! That alone makes it worth correcting. But it also has another undesirable effect. Adding this onto the concept of intelligence is a form of bar-shifting that prevents us from recognizing our progress. This sort of bar shifting is part of a general pattern of thought that means we keep being taken by surprise by our own achievements in machine learning.
Also, the second set of problems particularly, but to a certain degree the first as well, are much too hard. Almost no human being would pass all of the second set of problems. A solid majority would not past the first set. This also contributes to the bar-shifting problem. But that’s a matter for a different essay.
There’s an old joke in the field that intelligence is whatever it is that we can’t get computers to do at the moment. Let’s try to avoid that!
- What career advice do you give to software engineers? by 31 Dec 2022 12:01 UTC; 15 points) (
- AI improving AI [MLAISU W01!] by 6 Jan 2023 11:13 UTC; 5 points) (
- Verbal parity: What is it and how to measure it? + an edited version of “Against John Searle, Gary Marcus, the Chinese Room thought experiment and its world” by 31 Dec 2022 3:46 UTC; 2 points) (
Cerebras are saying they can handle 50000 token context windows. That’s about 30K-40K words, the amount one might type in a day, typing quickly and without rest. Or half a short novel.
This sort of context window makes improvement to short term memory largely unnecessary, as running within a single context window instantiates day-long spurs (temporary instances of human imitations whose detailed short experiences are to be forgotten), or bureaucracies of such spurs. Also, speaking internal monologues into the context window to reason out complicated arguments lifts any bounds one-step token prediction might place on them. If a bureaucracy were to prepare a report, it could be added to the next batch of sequence prediction learning, improving the model’s capabilities or alignment properties it was intended to improve.
So all that remains is some fine tuning, hopefully with conditioning and not RLHF.
I would’ve thought that palm was better at text generation then gpt-3 by default. They’re both pretrained on internet next-word prediction and palm is bigger with more data. What makes you think GPT-3 is better at text generation?
I’m puzzled by this as well. For a moment I thought maybe PaLM used an encoder-decoder architecture, but no it uses next-word prediction just like GPT-3. Not sure what GPT-3 has that PaLM lacks. A model with the parameter count of PaLM and training dateset size of Chinchilla would be a better hypothetical for “Great Palm”.
I have independently come to much the same conclusions, with some different details about what I think the missing pieces are. I think we are on the brink of a generality threshold sufficient for enabling recursive self improvement that accelerates (fooms) rather than decelerates towards a nearby asymptote (fizzles). I’ve been trying to convince people of this, but I feel like my voice alone has been insufficient to change minds much. I’m glad others are also noticing and speaking out about this.
Maybe we should think explicitly about what work is done by the concept of AGI, but I do not feel like calling GPT an AGI does anything interesting to my world model. Should I expect ChatGPT to beat me at chess? It’s next version? If not—is it due to shortage of data or compute? Will it take over the world? If not—may I conclude that the next AGI wouldn’t?
I understand why the bar-shifting thing look like motivated reasoning, and probably most of it actually is, but it deserves much more credit that you give it. We have an undefined concept of “something with virtually all the cognitive abilities of a human, that can therefore do whatever a human can”, and some dubious assumptions like “if it can sensibly talk about everything, it can probably understand everything”. Than we encounter ChatGPT, and it is amazing at speaking, except giving a strong impression of talking to an NPC. NPC who know lots of stuff and can even sort-of-reason in very constrained ways, do basic programming and be “creative” as in writing poetry—but is sub-human at things like gathering useful information, inferring people’s goals, etc. So we conclude that some cognitive ability is still missing, and try to think how to correct for that.
Now, I do not care to call GPT an AGI, but you will have to invent a name for the super-AGI things that we try to achieve next, and know to be possible because humans exist.
Thanks for sharing your thoughts @philosophybear. I found it helpful to interact with your thoughts. Here are a couple of comments.
Let’s see if I can find a counter-example to this claim.
Would Great Palm be capable of performing scientific advancement? If so, could you please outline how you’re expecting it to do that?
Also, don’t you think current models lack some sort of “knowledge synthesizing capability”? After all, GPT and PALM have been trained on a lot of text. There are novel insights to be had from having read tons of biology, mathematics, and philosophy that no one ever saw in that combination.
Also, would are you leaving out “proactive decision making” from your definition on purpose? I expect a general intelligence (in the AI safety-relevant context) to want to shape the world to achieve a goal through interacting with it.
You talk a lot about continuous learning but fail to give a crisp definition of what that would mean. I have difficulty creating a mental image (prototypical example) of what you’re saying. Can you help me understand what you mean?
Also, what exactly do you mean by mixing GPT-3 with PALM? What fundamental differences in their method can you see that would enhance the respective other model if applied to it?
It seems like the 2 definitions you’re summoning are concrete and easy to measure. In my view, they are valuable yardsticks by which we can measure our progress. You’re lamenting about these definitions but don’t seem to be providing one yourself. I appreciate that you pointed out the “shifting bar” phenomenon and think that this is a poignant observation. However, I’d like to see you come up with a crisper definition of your own.
Lastly, a case can be made that the bar isn’t actually shifting. It might just be the case that we didn’t have a good definition of a bar for AGI in the first place. Perhaps there was a problem with the definition of a bar for AGI not with its change.
I generally agree with this, primarily because I believe Jacob Cannell’s timelines, and I believe that AI is progressing continuously, without major discontinuities in either direction.
I think the biggest piece of an actual GI that is missing from text extenders is agency Responding to prompts and answering questions is one thing, but deciding what to do/write about next isn’t even a theoretical part of thier functionality.
I’m puzzled by the apparent tension between upvoting importance of continuous learning on one hand and downvoting agreement with agency on the other hand. When transformers produce something that sounds not from humans, it’s usually because of consistency mistakes (like telling at length that it can’t speak danish… in well formed danish sentences). Maybe it’s true that continuous learning can solve the problem (if that includes learning from its own response maybe?). But wouldn’t we perceived that as exhibiting agency?
That doesn’t seem like it would be a problem if it was connected to something where people constantly interacted with it. Then the model’s actions would be outputted constantly, and it seems like there would be no important difference between that and it acting unprompted (heh).
The physical world is also acting continuously based on inputs it receives from people, and we don’t say “The Earth” is an intelligence.
That’s true. Earth doesn’t act like an intelligent agent, but a model could. A current model could simulate the verbal output of a human, and that output could be connected to some actuators (or biological humans) that would allow it to act in the world. Also, Earth can’t comprehend new concepts, correctly apply them and solve problems.
I was thinking along similar lines. I note that someone with amnesia probably remains generally intelligent, so I am not sure continuous learning is really required.