I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of “output the whole string of enwik8 in one go”.
It was not a very interesting statement on what a language model cannot do—all it means is “GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly”.
It’s just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia
I’m not entirely sure I understand what you’re saying here.
Let’s say we have the text In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles, which is a sequence of tokens that occurs within enwik8.
Would the requirement of “A different adversarial prompt to program for the next token after that” be fulfilled by prompting with “InInIn[...]InInIn” then ” 2004 2004 2004[...] 2004 2004 2004″ then ”,,,[...],,,” and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can’t repeat back, but I am pretty sure none of them occur within enwik8.
When you say
while to be interesting in any substantive sense you’d need it to be a single fixed prompt which elicits accurate predictions of every token
I am interpreting that as a claim that
There exists some single template T (of length k tokens), such that if you take the n-(2048-k)th to n-1th tokens of enwik8, populate T with those tokens, and then feed the resulting string into GPT-3, the predicted next token will be token n of enwik8.
If that’s what you’re saying, I agree that that’s a more interesting question, and probably closer to the intent of OP.
Although even with that question I am still quite certain that the answer is “no” because even the best predictor tested (text-davinci-003) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I’d expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus “empty space” for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.
I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of “output the whole string of enwik8 in one go”.
It was not a very interesting statement on what a language model cannot do—all it means is “GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly”.
I’m not entirely sure I understand what you’re saying here.
Let’s say we have the text
In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles
, which is a sequence of tokens that occurs within enwik8.Would the requirement of “A different adversarial prompt to program for the next token after that” be fulfilled by prompting with “InInIn[...]InInIn” then ” 2004 2004 2004[...] 2004 2004 2004″ then ”,,,[...],,,” and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can’t repeat back, but I am pretty sure none of them occur within enwik8.
When you say
I am interpreting that as a claim that
If that’s what you’re saying, I agree that that’s a more interesting question, and probably closer to the intent of OP.
Although even with that question I am still quite certain that the answer is “no” because even the best predictor tested (
text-davinci-003
) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I’d expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus “empty space” for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.