Evolving language (and other forms of social learning) poses at least a bit of a chicken-and-egg problem—you need speakers putting rich conceptual information into the sounds going out, and listeners trying to match the sounds coming in to rich conceptual information. Likewise, you need certain mental capabilities to create a technological society, but if you’re not already in that technological society, there isn’t necessarily an evolutionary pressure to have those capabilities. I suspect (not having thought about it very much) that these kinds of chicken-and-egg problems are why it took evolution so long to create human-like intelligence.
AGI wouldn’t have those chicken-and-egg problems. I think GPT-3 shows that just putting an AI in an environment with human language, and flagging the language as an important target for self-supervised learning, is already enough to coax the system to develop a wide array of human-like concepts. Now, GPT-3 is not an AGI, but I think it’s held back by having the wrong architecture, not the wrong environment. (OK, well, giving it video input couldn’t hurt.)
I’m also a bit confused about your reference to “Rich Sutton’s bitter lesson”. Do you agree that Transformers learn more / better in the same environment than MLPs? That LSTMs learn more / better in the same environment than simpler RNNs? If so, why not suppose that a future yet-to-be-discovered architecture, in the same training environment, will wind up more AGI-ish? (For what little it’s worth, I have my own theory along these lines, that we’re going to wind up with systems closer to today’s probabilistic programming and PGMs than to today’s DNNs.)
I like and agree with this point, and have made a small edit to the original post to reflect that. However, while I don’t dispute that GPT-3 has some human-like concepts, I’m less sure about its reasoning abilities, and it’s pretty plausible to me that self-supervised training on language alone plateaus before we get to a GPT-N that does. I’m also fairly uncertain about this, but these types of environmental difficulties are worth considering.
I’m also a bit confused about your reference to “Rich Sutton’s bitter lesson”. Do you agree that Transformers learn more / better in the same environment than MLPs? That LSTMs learn more / better in the same environment than simpler RNNs?
Yes, but my point is that the *content* comes from the environment, not the architecture. We haven’t tried to leverage our knowledge of language by, say, using a different transformer for each part of speech. I (and I assume Sutton) agree that we’ll have increasingly powerful models, but they’ll also be increasing general—and therefore the question of whether a model with the capacity to become an AGI does so or not will depend to a significant extent on the environment.
Thanks! I’m still trying to zero in on where you’re coming from in the Rich Sutton thing, and your response only makes me more confused. Let me try something, and then you can correct me...
My (caricatured) opinion is: “Transformers-trained-by-SGD can’t reason. We’ll eventually invent a different architecture-and-learning-algorithm that is suited to reasoning, and when we run that algorithm on the same text prediction task used for GPT-3, it will become an AGI, even though GPT-3 didn’t.”
Your (caricatured) opinion is (maybe?): “We shouldn’t think of reasoning as a property of the architecture-and-learning-algorithm. Instead, it’s a property of the learned model. Therefore, if Transformers-trained-by-SGD-on-text-prediction can’t reason, that is evidence that the text-prediction task is simply not one that calls for reasoning. That in turn suggests that if we keep the same text prediction task, but substitute some unknown future architecture and unknown future learning algorithm, it also won’t be able to reason.”
Is that anywhere close to where you’re coming from? Thanks for bearing with me.
Not Richard, but I basically endorse that description as a description of my own view. (Note however that we don’t yet know that Transformers-trained-by-SGD-on-text-prediction can’t reason; I for one am not willing to claim that scaling even further will not result in reasoning.)
It’s not a certainty—it’s plausible that text prediction is enough, if you just improved the architecture and learning algorithm a little bit—but I doubt it, except in some degenerate sense that you could put a ton of information / inductive bias into the architecture and make it an AGI that way.
I endorse Steve’s description as a caricature of my view, and also Rohin’s comment. To flesh out my view a little more: I think that GPT-3 doing so well on language without (arguably) being able to reason, is the same type of evidence as Deep Blue or AlphaGo doing well at board games without being able to reason (although significantly weaker). In both cases it suggests that just optimising for this task is not sufficient to create general intelligence. While it now seems pretty unreasonable to think that a superhuman chess AI would by default be generally intelligent, that seems not too far off what people used to think.
Now, it might be the case that the task doesn’t matter very much for AGI if you “put a ton of information / inductive bias into the architecture”, as Rohin puts it. But I interpret Sutton to be arguing against our ability to do so.
We’ll eventually invent a different architecture-and-learning-algorithm that is suited to reasoning
There are two possible interpretations of which, one of which I agree with, one of which I don’t. I could either interpret you as saying that we’ll eventually develop an architecture/learning algorithm biased towards reasoning ability—I disagree with this.
Or you could be saying that future architectures will be capableof reasoning in ways that transformers aren’t, by virtue of just being generally more powerful. Which seems totally plausible to me.
Yeah, I think that reasoning, along with various other AGI prerequisites, requires an algorithm that does probabilistic programming / analysis-by-synthesis during deployment. And I think that trained Transformer models don’t do that, no matter what their size and parameters are. I guess I should write a post about why I think that—it’s a bit of a hazy tangle of ideas in my mind right now. :-)
(I’m more-or-less saying the interpretation you disagree with in your second-to-last paragraph.)
Thanks for the thought-provoking post!
Evolving language (and other forms of social learning) poses at least a bit of a chicken-and-egg problem—you need speakers putting rich conceptual information into the sounds going out, and listeners trying to match the sounds coming in to rich conceptual information. Likewise, you need certain mental capabilities to create a technological society, but if you’re not already in that technological society, there isn’t necessarily an evolutionary pressure to have those capabilities. I suspect (not having thought about it very much) that these kinds of chicken-and-egg problems are why it took evolution so long to create human-like intelligence.
AGI wouldn’t have those chicken-and-egg problems. I think GPT-3 shows that just putting an AI in an environment with human language, and flagging the language as an important target for self-supervised learning, is already enough to coax the system to develop a wide array of human-like concepts. Now, GPT-3 is not an AGI, but I think it’s held back by having the wrong architecture, not the wrong environment. (OK, well, giving it video input couldn’t hurt.)
I’m also a bit confused about your reference to “Rich Sutton’s bitter lesson”. Do you agree that Transformers learn more / better in the same environment than MLPs? That LSTMs learn more / better in the same environment than simpler RNNs? If so, why not suppose that a future yet-to-be-discovered architecture, in the same training environment, will wind up more AGI-ish? (For what little it’s worth, I have my own theory along these lines, that we’re going to wind up with systems closer to today’s probabilistic programming and PGMs than to today’s DNNs.)
I’m not very confident about any of this. :-)
I like and agree with this point, and have made a small edit to the original post to reflect that. However, while I don’t dispute that GPT-3 has some human-like concepts, I’m less sure about its reasoning abilities, and it’s pretty plausible to me that self-supervised training on language alone plateaus before we get to a GPT-N that does. I’m also fairly uncertain about this, but these types of environmental difficulties are worth considering.
Yes, but my point is that the *content* comes from the environment, not the architecture. We haven’t tried to leverage our knowledge of language by, say, using a different transformer for each part of speech. I (and I assume Sutton) agree that we’ll have increasingly powerful models, but they’ll also be increasing general—and therefore the question of whether a model with the capacity to become an AGI does so or not will depend to a significant extent on the environment.
Thanks! I’m still trying to zero in on where you’re coming from in the Rich Sutton thing, and your response only makes me more confused. Let me try something, and then you can correct me...
My (caricatured) opinion is: “Transformers-trained-by-SGD can’t reason. We’ll eventually invent a different architecture-and-learning-algorithm that is suited to reasoning, and when we run that algorithm on the same text prediction task used for GPT-3, it will become an AGI, even though GPT-3 didn’t.”
Your (caricatured) opinion is (maybe?): “We shouldn’t think of reasoning as a property of the architecture-and-learning-algorithm. Instead, it’s a property of the learned model. Therefore, if Transformers-trained-by-SGD-on-text-prediction can’t reason, that is evidence that the text-prediction task is simply not one that calls for reasoning. That in turn suggests that if we keep the same text prediction task, but substitute some unknown future architecture and unknown future learning algorithm, it also won’t be able to reason.”
Is that anywhere close to where you’re coming from? Thanks for bearing with me.
Not Richard, but I basically endorse that description as a description of my own view. (Note however that we don’t yet know that Transformers-trained-by-SGD-on-text-prediction can’t reason; I for one am not willing to claim that scaling even further will not result in reasoning.)
It’s not a certainty—it’s plausible that text prediction is enough, if you just improved the architecture and learning algorithm a little bit—but I doubt it, except in some degenerate sense that you could put a ton of information / inductive bias into the architecture and make it an AGI that way.
I endorse Steve’s description as a caricature of my view, and also Rohin’s comment. To flesh out my view a little more: I think that GPT-3 doing so well on language without (arguably) being able to reason, is the same type of evidence as Deep Blue or AlphaGo doing well at board games without being able to reason (although significantly weaker). In both cases it suggests that just optimising for this task is not sufficient to create general intelligence. While it now seems pretty unreasonable to think that a superhuman chess AI would by default be generally intelligent, that seems not too far off what people used to think.
Now, it might be the case that the task doesn’t matter very much for AGI if you “put a ton of information / inductive bias into the architecture”, as Rohin puts it. But I interpret Sutton to be arguing against our ability to do so.
There are two possible interpretations of which, one of which I agree with, one of which I don’t. I could either interpret you as saying that we’ll eventually develop an architecture/learning algorithm biased towards reasoning ability—I disagree with this.
Or you could be saying that future architectures will be capable of reasoning in ways that transformers aren’t, by virtue of just being generally more powerful. Which seems totally plausible to me.
Got it!
Yeah, I think that reasoning, along with various other AGI prerequisites, requires an algorithm that does probabilistic programming / analysis-by-synthesis during deployment. And I think that trained Transformer models don’t do that, no matter what their size and parameters are. I guess I should write a post about why I think that—it’s a bit of a hazy tangle of ideas in my mind right now. :-)
(I’m more-or-less saying the interpretation you disagree with in your second-to-last paragraph.)
Thanks again for explaining!