When I prompt GPT-5 it’s already out of distribution because the training data mostly isn’t GPT prompts, and none of it is GPT-5 prompts. If I prompt with “this is a rap battle between Dath Ilan and Earthsea” that’s not a high likelihood sentence in the training data. And then the response is also out of distribution, because the training data mostly isn’t GPT responses, and none of it is GPT-5 responses.
So why do we think that the responses are further out of distribution than the prompts?
Possible answer: because we try to select prompts that work well, with human ingenuity and trial and error, so they will tend to work better and be effectively closer to the distribution. Whereas the responses are not filtered in the same way.
But the responses are optimized only to be in distribution, whereas the prompts are also optimized for achieving some human objective like generating a funny rap battle. So once the optimizer achieves some threshold of reliability the error rate should go down as text is generated, not up.
“Being out of distribution” is not a yes-no answer, but a continuum. I agree that all prompts given to GPT are slightly out of distribution simply by virtue of being prompts to a language model, but the length of a prompt is generally not large enough to enable GPT to really be sure of that. If I give you 3 sentences of a made-up physics book introduction, you might guess that no textbook actually starts with those 3 sentences… but that’s really just not enough information to be sure. However, if I give you 5 pages, you then have enough information to really understand if this is really a physics textbook or not.
The point is that sequence length matters, the internet is probably large enough to populate the space of 200-token (number pulled out of my ass) text sequences densely enough that GPT can extrapolate to most other sequences of such length, but things gradually change as the sequences get longer. And certainly by the time you get to book-length or longer, any sequence that GPT could generate will be so far out of distribution that it will be complete gibberish.
Could we agree on a testable prediction of this theory? For example, looking at the chess degradation example. I think your argument predicts that if we play several games of chess against ChatGPT in a row, its performance will keep going down in later games, in terms of both quality and legality. Potentially such that the last attempt will be complete gibberish. Would that be a good test?
Certainly I would agree with that. In fact right now I can’t even get chatGPT to play a single game of chess (against stockfish)from start to finish without it at some point outputting an illegal move. I expect that future versions of GPT will be coherent for longer, but I don’t expect GPT to suddenly “get it” and be able to play legal and coherent chess for arbitrary length of sequences. (Google tells me that chess has a typical sequence length of about 40, so maybe Go would be a better choice with a typical number of moves per game in the 150). And certainly I don’t expect GPT to be able to play chess AND also write coherent chess commentary between each move, since that would greatly increase the timescale of required coherence.
GPT-4 was privately available within OpenAI long before it was publically released. It’s not necessary to be from the future to be able to interact with GPT-5 before it’s publically released.
Okay, but I’m still wondering if Randall is claiming he has private access, or is it just a typo?
Edit: looks like it was a typo?
At MIT, Altman said the letter was “missing most technical nuance about where we need the pause” and noted that an earlier version claimed that OpenAI is currently training GPT-5. “We are not and won’t for some time,” said Altman. “So in that sense it was sort of silly.”
When I prompt GPT-5 it’s already out of distribution because the training data mostly isn’t GPT prompts, and none of it is GPT-5 prompts. If I prompt with “this is a rap battle between Dath Ilan and Earthsea” that’s not a high likelihood sentence in the training data. And then the response is also out of distribution, because the training data mostly isn’t GPT responses, and none of it is GPT-5 responses.
So why do we think that the responses are further out of distribution than the prompts?
Possible answer: because we try to select prompts that work well, with human ingenuity and trial and error, so they will tend to work better and be effectively closer to the distribution. Whereas the responses are not filtered in the same way.
But the responses are optimized only to be in distribution, whereas the prompts are also optimized for achieving some human objective like generating a funny rap battle. So once the optimizer achieves some threshold of reliability the error rate should go down as text is generated, not up.
“Being out of distribution” is not a yes-no answer, but a continuum. I agree that all prompts given to GPT are slightly out of distribution simply by virtue of being prompts to a language model, but the length of a prompt is generally not large enough to enable GPT to really be sure of that. If I give you 3 sentences of a made-up physics book introduction, you might guess that no textbook actually starts with those 3 sentences… but that’s really just not enough information to be sure. However, if I give you 5 pages, you then have enough information to really understand if this is really a physics textbook or not.
The point is that sequence length matters, the internet is probably large enough to populate the space of 200-token (number pulled out of my ass) text sequences densely enough that GPT can extrapolate to most other sequences of such length, but things gradually change as the sequences get longer. And certainly by the time you get to book-length or longer, any sequence that GPT could generate will be so far out of distribution that it will be complete gibberish.
Could we agree on a testable prediction of this theory? For example, looking at the chess degradation example. I think your argument predicts that if we play several games of chess against ChatGPT in a row, its performance will keep going down in later games, in terms of both quality and legality. Potentially such that the last attempt will be complete gibberish. Would that be a good test?
Certainly I would agree with that. In fact right now I can’t even get chatGPT to play a single game of chess (against stockfish) from start to finish without it at some point outputting an illegal move. I expect that future versions of GPT will be coherent for longer, but I don’t expect GPT to suddenly “get it” and be able to play legal and coherent chess for arbitrary length of sequences. (Google tells me that chess has a typical sequence length of about 40, so maybe Go would be a better choice with a typical number of moves per game in the 150). And certainly I don’t expect GPT to be able to play chess AND also write coherent chess commentary between each move, since that would greatly increase the timescale of required coherence.
Did you mean GPT-4 here? (Or are you from the future :-)
Just a confusing writing choice, sorry. Either it’s the timeless present tense or it’s a grammar error, take your pick.
GPT-4 was privately available within OpenAI long before it was publically released. It’s not necessary to be from the future to be able to interact with GPT-5 before it’s publically released.
Okay, but I’m still wondering if Randall is claiming he has private access, or is it just a typo?
Edit: looks like it was a typo?
https://www.theverge.com/2023/4/14/23683084/openai-gpt-5-rumors-training-sam-altman