There are a few things that resemble what I think you might be asking for.
The first thing it sounds like you may be asking is as follows:
There exists a file enwik8, which contains 10^8 bytes of English Wikipedia. The language models everyone has been playing with have almost certainly been fed all or most of that content. Does there exist some prompt which will cause those models to emit either the entirety of enwik8 with high probability?
The answer is a flat “no” for the entirety of enwik8 for any of the main openai models, which can be verified by passing in a random full-context-worth (2048 tokens) of enwik8 and verifying that the completion is not “the rest of enwik8″. The models cannot look back before their context, so if any full context in the middle fails to generate the next part, we can conclude that it won’t generate the full thing on its own for any initial prompt. Which is not too surprising.
The second thing, and the thing I expect you mean, is
Can we compute how well these language models encode enwik8
to which the answer is “yes”. It is possible to get a measure of bits per character for a given model to generate a given sequence of tokens, which equally means it is possible to generate that exact text using the model and the corresponding number of bits. Some examples:
A model that said all bytes are equally likely to occur would cost 8 bits per byte, or ~8 bits per char (since it’s mostly ascii).
Using the character frequency observed in enwik8, and nothing else, would cost 5.17 bits per char, with a model that weighs a few kB (markov_chars_0 below).
Tokenizing using the same tokenizer GPT-2 uses, and going by token frequency, gets you down to 3.18, at the cost of tens too hundreds of kB of model (markov_toks_0 below).
Building a simple markov model that looks at the last 3 chars to get your next char gets you down to 2.48 bits per char. But storing a 4 char sequence + a 3 byte int for each 4 char sequence that occurs now weighs almost 10 MB (markov_chars_3 below).
Building a markov model of tokens (again using the gpt2 tokenizer) can drop you down to 0.66 bits per char, below Shannon’s lower estimate. Except that now your model is recording counts for 16 million distinct token sequences, which will weigh in at something on the order of 180 MB (markov_toks_3 below).
Likewise, using OpenAI’s text-davinci-003 can get you down to 0.56 bits per token. But the model itself weighs several GB (openai:text-davinci-003 below).
Anyway, I threw together a quick script to get approximate bits-per-char to reproduce enwik8, excluding model size. Here are the results, high-to-low (std deviation across 10 trials in parens). Keep in mind that “store the entirety of enwik8 uncompressed” costs 100MB
If this was what you had in mind, you would probably also be interested in the post Chinchilla’s wild implications, which goes a lot deeper into this sort of stuff.
if any full context in the middle fails to generate the next part, we can conclude that it won’t generate the full thing on its own for any initial prompt. Which is not too surprising.
Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry. And this despite being able to work with ~50k^2048 possible inputs.
(It would be like saying that a face-generating GAN can’t produce a specific face that it was trained on, no matter how hard you optimized the z or how much vastly larger the dimensionality of the z is compared to the true face manifold. If you showed me a StyleGAN face-generator which had been trained on Obama, and that it completed half of Obama’s face with a different face, I would not be surprised; I would be very surprised if you told me you had somehow proven that not only did it not infill Obama on that specific example (fine, unsurprising), there was no infill or possible z whatsoever that it could generate Obama’s face (extremely surprising).)
Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry
I don’t think it’s surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
For example, let’s build a toy model where the tokens are words, and the context length is 10. This model has been trained on the following string.
On Monday I went to work and then I returned home. On Tuesday I went to work and then I returned home. On Wednesday I went to work and then I returned home.
If you start with the string
On Monday I went to work and then I returned
the model will complete with “home.”, at which point the context is
Monday I went to work and then I returned home.
and the model completes with “On”, changing the context to
I went to work and then I returned home. On
If the model completes with “Tuesday” it will end up stuck in a loop of
On Tuesday I went to work and then I returned home. On Tuesday I went to work and then I returned home. On Tuesday I went to work and then I returned home.
which does not succeed at the task of “complete the input”. The outcome is similar if it chooses to complete with “Wednesday” instead.
This happens because the task is specifically “output the entirety of enwik8”, where “the entirety of enwik8″ is a string that is much longer than the context window. No matter what your initial prompt was, once the prompt has succeeded at the task of “output the next 2048 tokens of enwik8”, your prompt no longer has any causal impact on the completion—“feed the model a prompt that causes it to output the first 2048 tokens of enwik8 and ask it for completions” and “feed the model the first 2048 tokens of enwik8 and ask it for completions” are operations which yield the same result.
If we had a GPT-X which somehow had a non-bounded context window, I suspect it would be possible to get it to output the full text of enwik8 (I expect that one way to do this would be “fill its context window with enwik8 repeated a bunch of times in a row”).
I don’t think it’s surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
But that’s a different question from trying to generate a WP article’s next token, which is what you defined as failure. You said even with the full context, it couldn’t predict the next token. I’m pointing out that you almost certainly can find an adversarial prompt (plus shortened prefix) which programs it to predict the right next token. It’s just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia, while to be interesting in any substantive sense you’d need it to be a single fixed prompt which elicits accurate predictions of every token. It’s whether there is one fixed prompt, not the question of is there any prompt which would’ve correctly predicted the next token for your specific article example. The latter is almost certainly true (demonstrating a weakness in your counterexample), and the former almost certainly false.
I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of “output the whole string of enwik8 in one go”.
It was not a very interesting statement on what a language model cannot do—all it means is “GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly”.
It’s just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia
I’m not entirely sure I understand what you’re saying here.
Let’s say we have the text In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles, which is a sequence of tokens that occurs within enwik8.
Would the requirement of “A different adversarial prompt to program for the next token after that” be fulfilled by prompting with “InInIn[...]InInIn” then ” 2004 2004 2004[...] 2004 2004 2004″ then ”,,,[...],,,” and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can’t repeat back, but I am pretty sure none of them occur within enwik8.
When you say
while to be interesting in any substantive sense you’d need it to be a single fixed prompt which elicits accurate predictions of every token
I am interpreting that as a claim that
There exists some single template T (of length k tokens), such that if you take the n-(2048-k)th to n-1th tokens of enwik8, populate T with those tokens, and then feed the resulting string into GPT-3, the predicted next token will be token n of enwik8.
If that’s what you’re saying, I agree that that’s a more interesting question, and probably closer to the intent of OP.
Although even with that question I am still quite certain that the answer is “no” because even the best predictor tested (text-davinci-003) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I’d expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus “empty space” for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.
Regarding the first one i am not expecting a single-prompt to generate the entirity of enwiki8/9. I am more interested in finding a set of prompts with a lookup table if possible to replicate enwiki data.
Thanks for the pointer for chincilla post, will look into it.
There are a few things that resemble what I think you might be asking for.
The first thing it sounds like you may be asking is as follows:
The answer is a flat “no” for the entirety of enwik8 for any of the main openai models, which can be verified by passing in a random full-context-worth (2048 tokens) of
enwik8
and verifying that the completion is not “the rest of enwik8″. The models cannot look back before their context, so if any full context in the middle fails to generate the next part, we can conclude that it won’t generate the full thing on its own for any initial prompt. Which is not too surprising.The second thing, and the thing I expect you mean, is
to which the answer is “yes”. It is possible to get a measure of bits per character for a given model to generate a given sequence of tokens, which equally means it is possible to generate that exact text using the model and the corresponding number of bits. Some examples:
A model that said all bytes are equally likely to occur would cost 8 bits per byte, or ~8 bits per char (since it’s mostly ascii).
Using the character frequency observed in
enwik8
, and nothing else, would cost 5.17 bits per char, with a model that weighs a few kB (markov_chars_0
below).Tokenizing using the same tokenizer GPT-2 uses, and going by token frequency, gets you down to 3.18, at the cost of tens too hundreds of kB of model (
markov_toks_0
below).Building a simple markov model that looks at the last 3 chars to get your next char gets you down to 2.48 bits per char. But storing a 4 char sequence + a 3 byte int for each 4 char sequence that occurs now weighs almost 10 MB (
markov_chars_3
below).Building a markov model of tokens (again using the gpt2 tokenizer) can drop you down to 0.66 bits per char, below Shannon’s lower estimate. Except that now your model is recording counts for 16 million distinct token sequences, which will weigh in at something on the order of 180 MB (
markov_toks_3
below).Likewise, using OpenAI’s
text-davinci-003
can get you down to 0.56 bits per token. But the model itself weighs several GB (openai:text-davinci-003 below).Anyway, I threw together a quick script to get approximate bits-per-char to reproduce
enwik8
, excluding model size. Here are the results, high-to-low (std deviation across 10 trials in parens). Keep in mind that “store the entirety of enwik8 uncompressed” costs 100MBIf this was what you had in mind, you would probably also be interested in the post Chinchilla’s wild implications, which goes a lot deeper into this sort of stuff.
Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry. And this despite being able to work with ~50k^2048 possible inputs.
(It would be like saying that a face-generating GAN can’t produce a specific face that it was trained on, no matter how hard you optimized the z or how much vastly larger the dimensionality of the z is compared to the true face manifold. If you showed me a StyleGAN face-generator which had been trained on Obama, and that it completed half of Obama’s face with a different face, I would not be surprised; I would be very surprised if you told me you had somehow proven that not only did it not infill Obama on that specific example (fine, unsurprising), there was no infill or possible z whatsoever that it could generate Obama’s face (extremely surprising).)
I don’t think it’s surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
For example, let’s build a toy model where the tokens are words, and the context length is 10. This model has been trained on the following string.
If you start with the string
the model will complete with “home.”, at which point the context is
and the model completes with “On”, changing the context to
If the model completes with “Tuesday” it will end up stuck in a loop of
which does not succeed at the task of “complete the input”. The outcome is similar if it chooses to complete with “Wednesday” instead.
This happens because the task is specifically “output the entirety of enwik8”, where “the entirety of enwik8″ is a string that is much longer than the context window. No matter what your initial prompt was, once the prompt has succeeded at the task of “output the next 2048 tokens of enwik8”, your prompt no longer has any causal impact on the completion—“feed the model a prompt that causes it to output the first 2048 tokens of enwik8 and ask it for completions” and “feed the model the first 2048 tokens of enwik8 and ask it for completions” are operations which yield the same result.
If we had a GPT-X which somehow had a non-bounded context window, I suspect it would be possible to get it to output the full text of enwik8 (I expect that one way to do this would be “fill its context window with enwik8 repeated a bunch of times in a row”).
But that’s a different question from trying to generate a WP article’s next token, which is what you defined as failure. You said even with the full context, it couldn’t predict the next token. I’m pointing out that you almost certainly can find an adversarial prompt (plus shortened prefix) which programs it to predict the right next token. It’s just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia, while to be interesting in any substantive sense you’d need it to be a single fixed prompt which elicits accurate predictions of every token. It’s whether there is one fixed prompt, not the question of is there any prompt which would’ve correctly predicted the next token for your specific article example. The latter is almost certainly true (demonstrating a weakness in your counterexample), and the former almost certainly false.
I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of “output the whole string of enwik8 in one go”.
It was not a very interesting statement on what a language model cannot do—all it means is “GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly”.
I’m not entirely sure I understand what you’re saying here.
Let’s say we have the text
In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles
, which is a sequence of tokens that occurs within enwik8.Would the requirement of “A different adversarial prompt to program for the next token after that” be fulfilled by prompting with “InInIn[...]InInIn” then ” 2004 2004 2004[...] 2004 2004 2004″ then ”,,,[...],,,” and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can’t repeat back, but I am pretty sure none of them occur within enwik8.
When you say
I am interpreting that as a claim that
If that’s what you’re saying, I agree that that’s a more interesting question, and probably closer to the intent of OP.
Although even with that question I am still quite certain that the answer is “no” because even the best predictor tested (
text-davinci-003
) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I’d expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus “empty space” for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.Regarding the first one i am not expecting a single-prompt to generate the entirity of enwiki8/9. I am more interested in finding a set of prompts with a lookup table if possible to replicate enwiki data.
Thanks for the pointer for chincilla post, will look into it.