if any full context in the middle fails to generate the next part, we can conclude that it won’t generate the full thing on its own for any initial prompt. Which is not too surprising.
Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry. And this despite being able to work with ~50k^2048 possible inputs.
(It would be like saying that a face-generating GAN can’t produce a specific face that it was trained on, no matter how hard you optimized the z or how much vastly larger the dimensionality of the z is compared to the true face manifold. If you showed me a StyleGAN face-generator which had been trained on Obama, and that it completed half of Obama’s face with a different face, I would not be surprised; I would be very surprised if you told me you had somehow proven that not only did it not infill Obama on that specific example (fine, unsurprising), there was no infill or possible z whatsoever that it could generate Obama’s face (extremely surprising).)
Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry
I don’t think it’s surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
For example, let’s build a toy model where the tokens are words, and the context length is 10. This model has been trained on the following string.
On Monday I went to work and then I returned home. On Tuesday I went to work and then I returned home. On Wednesday I went to work and then I returned home.
If you start with the string
On Monday I went to work and then I returned
the model will complete with “home.”, at which point the context is
Monday I went to work and then I returned home.
and the model completes with “On”, changing the context to
I went to work and then I returned home. On
If the model completes with “Tuesday” it will end up stuck in a loop of
On Tuesday I went to work and then I returned home. On Tuesday I went to work and then I returned home. On Tuesday I went to work and then I returned home.
which does not succeed at the task of “complete the input”. The outcome is similar if it chooses to complete with “Wednesday” instead.
This happens because the task is specifically “output the entirety of enwik8”, where “the entirety of enwik8″ is a string that is much longer than the context window. No matter what your initial prompt was, once the prompt has succeeded at the task of “output the next 2048 tokens of enwik8”, your prompt no longer has any causal impact on the completion—“feed the model a prompt that causes it to output the first 2048 tokens of enwik8 and ask it for completions” and “feed the model the first 2048 tokens of enwik8 and ask it for completions” are operations which yield the same result.
If we had a GPT-X which somehow had a non-bounded context window, I suspect it would be possible to get it to output the full text of enwik8 (I expect that one way to do this would be “fill its context window with enwik8 repeated a bunch of times in a row”).
I don’t think it’s surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
But that’s a different question from trying to generate a WP article’s next token, which is what you defined as failure. You said even with the full context, it couldn’t predict the next token. I’m pointing out that you almost certainly can find an adversarial prompt (plus shortened prefix) which programs it to predict the right next token. It’s just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia, while to be interesting in any substantive sense you’d need it to be a single fixed prompt which elicits accurate predictions of every token. It’s whether there is one fixed prompt, not the question of is there any prompt which would’ve correctly predicted the next token for your specific article example. The latter is almost certainly true (demonstrating a weakness in your counterexample), and the former almost certainly false.
I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of “output the whole string of enwik8 in one go”.
It was not a very interesting statement on what a language model cannot do—all it means is “GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly”.
It’s just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia
I’m not entirely sure I understand what you’re saying here.
Let’s say we have the text In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles, which is a sequence of tokens that occurs within enwik8.
Would the requirement of “A different adversarial prompt to program for the next token after that” be fulfilled by prompting with “InInIn[...]InInIn” then ” 2004 2004 2004[...] 2004 2004 2004″ then ”,,,[...],,,” and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can’t repeat back, but I am pretty sure none of them occur within enwik8.
When you say
while to be interesting in any substantive sense you’d need it to be a single fixed prompt which elicits accurate predictions of every token
I am interpreting that as a claim that
There exists some single template T (of length k tokens), such that if you take the n-(2048-k)th to n-1th tokens of enwik8, populate T with those tokens, and then feed the resulting string into GPT-3, the predicted next token will be token n of enwik8.
If that’s what you’re saying, I agree that that’s a more interesting question, and probably closer to the intent of OP.
Although even with that question I am still quite certain that the answer is “no” because even the best predictor tested (text-davinci-003) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I’d expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus “empty space” for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.
Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry. And this despite being able to work with ~50k^2048 possible inputs.
(It would be like saying that a face-generating GAN can’t produce a specific face that it was trained on, no matter how hard you optimized the z or how much vastly larger the dimensionality of the z is compared to the true face manifold. If you showed me a StyleGAN face-generator which had been trained on Obama, and that it completed half of Obama’s face with a different face, I would not be surprised; I would be very surprised if you told me you had somehow proven that not only did it not infill Obama on that specific example (fine, unsurprising), there was no infill or possible z whatsoever that it could generate Obama’s face (extremely surprising).)
I don’t think it’s surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
For example, let’s build a toy model where the tokens are words, and the context length is 10. This model has been trained on the following string.
If you start with the string
the model will complete with “home.”, at which point the context is
and the model completes with “On”, changing the context to
If the model completes with “Tuesday” it will end up stuck in a loop of
which does not succeed at the task of “complete the input”. The outcome is similar if it chooses to complete with “Wednesday” instead.
This happens because the task is specifically “output the entirety of enwik8”, where “the entirety of enwik8″ is a string that is much longer than the context window. No matter what your initial prompt was, once the prompt has succeeded at the task of “output the next 2048 tokens of enwik8”, your prompt no longer has any causal impact on the completion—“feed the model a prompt that causes it to output the first 2048 tokens of enwik8 and ask it for completions” and “feed the model the first 2048 tokens of enwik8 and ask it for completions” are operations which yield the same result.
If we had a GPT-X which somehow had a non-bounded context window, I suspect it would be possible to get it to output the full text of enwik8 (I expect that one way to do this would be “fill its context window with enwik8 repeated a bunch of times in a row”).
But that’s a different question from trying to generate a WP article’s next token, which is what you defined as failure. You said even with the full context, it couldn’t predict the next token. I’m pointing out that you almost certainly can find an adversarial prompt (plus shortened prefix) which programs it to predict the right next token. It’s just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia, while to be interesting in any substantive sense you’d need it to be a single fixed prompt which elicits accurate predictions of every token. It’s whether there is one fixed prompt, not the question of is there any prompt which would’ve correctly predicted the next token for your specific article example. The latter is almost certainly true (demonstrating a weakness in your counterexample), and the former almost certainly false.
I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of “output the whole string of enwik8 in one go”.
It was not a very interesting statement on what a language model cannot do—all it means is “GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly”.
I’m not entirely sure I understand what you’re saying here.
Let’s say we have the text
In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles
, which is a sequence of tokens that occurs within enwik8.Would the requirement of “A different adversarial prompt to program for the next token after that” be fulfilled by prompting with “InInIn[...]InInIn” then ” 2004 2004 2004[...] 2004 2004 2004″ then ”,,,[...],,,” and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can’t repeat back, but I am pretty sure none of them occur within enwik8.
When you say
I am interpreting that as a claim that
If that’s what you’re saying, I agree that that’s a more interesting question, and probably closer to the intent of OP.
Although even with that question I am still quite certain that the answer is “no” because even the best predictor tested (
text-davinci-003
) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I’d expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus “empty space” for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.