Sutskever’s response to Dwarkesh in their interview was a convincing refutation of this argument for me:
Dwarkesh Patel
So you could argue that next-token prediction can only help us match human performance and maybe not surpass it? What would it take to surpass human performance?
Ilya Sutskever
I challenge the claim that next-token prediction cannot surpass human performance. On the surface, it looks like it cannot. It looks like if you just learn to imitate, to predict what people do, it means that you can only copy people. But here is a counter argument for why it might not be quite so. If your base neural net is smart enough, you just ask it — What would a person with great insight, wisdom, and capability do? Maybe such a person doesn’t exist, but there’s a pretty good chance that the neural net will be able to extrapolate how such a person would behave. Do you see what I mean?
Dwarkesh Patel
Yes, although where would it get that sort of insight about what that person would do? If not from…
Ilya Sutskever
From the data of regular people. Because if you think about it, what does it mean to predict the next token well enough? It’s actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It’s not statistics. Like it is statistics but what is statistics? In order to understand those statistics to compress them, you need to understand what is it about the world that creates this set of statistics? And so then you say — Well, I have all those people. What is it about people that creates their behaviors? Well they have thoughts and their feelings, and they have ideas, and they do things in certain ways. All of those could be deduced from next-token prediction. And I’d argue that this should make it possible, not indefinitely but to a pretty decent degree to say — Well, can you guess what you’d do if you took a person with this characteristic and that characteristic? Like such a person doesn’t exist but because you’re so good at predicting the next token, you should still be able to guess what that person who would do. This hypothetical, imaginary person with far greater mental ability than the rest of us
I respect Sutskever a lot, but if he believed that he could get an equivalent world model by spending an equivalent amount of compute learning from next-token prediction using any other set of real-world data samples, why would they go to such lengths to specifically obtain human-generated text for training? They might as well just do lots of random recordings (e.g., video, audio, radio signals) and pump it all into the model. In principle that could probably work, but it’s very inefficient.
Human language is a very high density encoding of world models, so by training on human language models get much of their world model “for free“, because humanity has already done a lot of pre-work by sampling reality in a wide variety of ways and compressing it into the structure of language. However, our use of language still doesn’t capture all of reality exactly and I would argue it’s not even close. (Saying otherwise is equivalent to saying we’ve already discovered almost all possible capabilities, which would entail that AI actually has a hard cap at roughly human ability.)
In order to expand its world model beyond human ability, AI has to sample reality itself, which is much less sample-efficient than sampling human behavior, hence the “soft cap”.
Sutskever’s response to Dwarkesh in their interview was a convincing refutation of this argument for me:
Dwarkesh Patel
So you could argue that next-token prediction can only help us match human performance and maybe not surpass it? What would it take to surpass human performance?
Ilya Sutskever
I challenge the claim that next-token prediction cannot surpass human performance. On the surface, it looks like it cannot. It looks like if you just learn to imitate, to predict what people do, it means that you can only copy people. But here is a counter argument for why it might not be quite so. If your base neural net is smart enough, you just ask it — What would a person with great insight, wisdom, and capability do? Maybe such a person doesn’t exist, but there’s a pretty good chance that the neural net will be able to extrapolate how such a person would behave. Do you see what I mean?
Dwarkesh Patel
Yes, although where would it get that sort of insight about what that person would do? If not from…
Ilya Sutskever
From the data of regular people. Because if you think about it, what does it mean to predict the next token well enough? It’s actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It’s not statistics. Like it is statistics but what is statistics? In order to understand those statistics to compress them, you need to understand what is it about the world that creates this set of statistics? And so then you say — Well, I have all those people. What is it about people that creates their behaviors? Well they have thoughts and their feelings, and they have ideas, and they do things in certain ways. All of those could be deduced from next-token prediction. And I’d argue that this should make it possible, not indefinitely but to a pretty decent degree to say — Well, can you guess what you’d do if you took a person with this characteristic and that characteristic? Like such a person doesn’t exist but because you’re so good at predicting the next token, you should still be able to guess what that person who would do. This hypothetical, imaginary person with far greater mental ability than the rest of us
I respect Sutskever a lot, but if he believed that he could get an equivalent world model by spending an equivalent amount of compute learning from next-token prediction using any other set of real-world data samples, why would they go to such lengths to specifically obtain human-generated text for training? They might as well just do lots of random recordings (e.g., video, audio, radio signals) and pump it all into the model. In principle that could probably work, but it’s very inefficient.
Human language is a very high density encoding of world models, so by training on human language models get much of their world model “for free“, because humanity has already done a lot of pre-work by sampling reality in a wide variety of ways and compressing it into the structure of language. However, our use of language still doesn’t capture all of reality exactly and I would argue it’s not even close. (Saying otherwise is equivalent to saying we’ve already discovered almost all possible capabilities, which would entail that AI actually has a hard cap at roughly human ability.)
In order to expand its world model beyond human ability, AI has to sample reality itself, which is much less sample-efficient than sampling human behavior, hence the “soft cap”.