“Being out of distribution” is not a yes-no answer, but a continuum. I agree that all prompts given to GPT are slightly out of distribution simply by virtue of being prompts to a language model, but the length of a prompt is generally not large enough to enable GPT to really be sure of that. If I give you 3 sentences of a made-up physics book introduction, you might guess that no textbook actually starts with those 3 sentences… but that’s really just not enough information to be sure. However, if I give you 5 pages, you then have enough information to really understand if this is really a physics textbook or not.
The point is that sequence length matters, the internet is probably large enough to populate the space of 200-token (number pulled out of my ass) text sequences densely enough that GPT can extrapolate to most other sequences of such length, but things gradually change as the sequences get longer. And certainly by the time you get to book-length or longer, any sequence that GPT could generate will be so far out of distribution that it will be complete gibberish.
Could we agree on a testable prediction of this theory? For example, looking at the chess degradation example. I think your argument predicts that if we play several games of chess against ChatGPT in a row, its performance will keep going down in later games, in terms of both quality and legality. Potentially such that the last attempt will be complete gibberish. Would that be a good test?
Certainly I would agree with that. In fact right now I can’t even get chatGPT to play a single game of chess (against stockfish)from start to finish without it at some point outputting an illegal move. I expect that future versions of GPT will be coherent for longer, but I don’t expect GPT to suddenly “get it” and be able to play legal and coherent chess for arbitrary length of sequences. (Google tells me that chess has a typical sequence length of about 40, so maybe Go would be a better choice with a typical number of moves per game in the 150). And certainly I don’t expect GPT to be able to play chess AND also write coherent chess commentary between each move, since that would greatly increase the timescale of required coherence.
“Being out of distribution” is not a yes-no answer, but a continuum. I agree that all prompts given to GPT are slightly out of distribution simply by virtue of being prompts to a language model, but the length of a prompt is generally not large enough to enable GPT to really be sure of that. If I give you 3 sentences of a made-up physics book introduction, you might guess that no textbook actually starts with those 3 sentences… but that’s really just not enough information to be sure. However, if I give you 5 pages, you then have enough information to really understand if this is really a physics textbook or not.
The point is that sequence length matters, the internet is probably large enough to populate the space of 200-token (number pulled out of my ass) text sequences densely enough that GPT can extrapolate to most other sequences of such length, but things gradually change as the sequences get longer. And certainly by the time you get to book-length or longer, any sequence that GPT could generate will be so far out of distribution that it will be complete gibberish.
Could we agree on a testable prediction of this theory? For example, looking at the chess degradation example. I think your argument predicts that if we play several games of chess against ChatGPT in a row, its performance will keep going down in later games, in terms of both quality and legality. Potentially such that the last attempt will be complete gibberish. Would that be a good test?
Certainly I would agree with that. In fact right now I can’t even get chatGPT to play a single game of chess (against stockfish) from start to finish without it at some point outputting an illegal move. I expect that future versions of GPT will be coherent for longer, but I don’t expect GPT to suddenly “get it” and be able to play legal and coherent chess for arbitrary length of sequences. (Google tells me that chess has a typical sequence length of about 40, so maybe Go would be a better choice with a typical number of moves per game in the 150). And certainly I don’t expect GPT to be able to play chess AND also write coherent chess commentary between each move, since that would greatly increase the timescale of required coherence.