I don’t think humans would put much probability on such sequences, even conditionally: we’d think that at some point the sequence would stop, because why would there be such gibberish?
I think the intuition behind your remark “why would there be such gibberish?” actually goes most of the way to explaining the repetition trap.
The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.
Once you’re already in one, I don’t think it’s really so obvious that the repetition should eventually stop. Yes, that’s what a human writer would do—but a human writer wouldn’t have produced the conditioning sequence to begin with.
We start out with a prior that puts high weight on “this belongs to some natural genre of text,” and low weight on “this belongs to a weird hyper-repetitive ‘genre’ of text.” But eventually, after enough bad predictions from the former and enough accurate predictions from the latter, we really ought to yield to the evidence and update. Eventually it should become clear that the question “why would there be such gibberish?” has some answer, since we keep observing “such gibberish” and not anything else.
But why does LM sampling enter the trap to begin with? I think there needs to be some “initial misstep,” where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.
In other words, repetition is special because it’s a way of going off-distribution where there is, nonetheless, a single “obvious” way to continue the text, and continuing it thus will keep you in the same off-distribution region. Whereas most ways of going off-distribution are just confusing, and don’t have a legible structure the LM would have learned from in-distribution training.
I would expect scale to lower the probability of the “initial mistake,” and thus reduce the fraction of samples that are repetitive (is this borne out in practice?). I don’t expect scale to make LMs stop assigning high likelihood to repetitive continuations of unnaturally repetitive prefixes, since I think that’s a conditionally correct judgment.
For practical purposes, I’ve found my custom sampler to be pretty effective solution, though sometimes the LM still “wins the fight,” as in this amusing example.
Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don’t believe you, I just would have expected otherwise, so I’m curious.
But why does LM sampling enter the trap to begin with? I think there needs to be some “initial misstep,” where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.
I think that’s a possible interpretation. I’m still not sure why it wouldn’t affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the ‘misstep’ estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too?
I would expect scale to lower the probability of the “initial mistake,” and thus reduce the fraction of samples that are repetitive (is this borne out in practice?).
It doesn’t really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don’t think it’s perfect, and it’s clearly a hack which doesn’t fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.)
Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don’t believe you, I just would have expected otherwise, so I’m curious.
I’ve seen hundreds of iGPT completions and random examples, and not a single one ever just ‘starts repeating’ ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the ‘tessellation’ that naive CLIP maximization causes is extremely obvious, and you can’t sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I’m wrong and there is repetition at some level which isn’t obvious to the naked eye, like high-frequency Fourier components (although I’m not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I’ll change my mind but it’s been years now, so I’m not holding my breath.
If repetitions arise from sampling merely due to high conditional probability given an initial “misstep”, they should be avoidable in an MCTS that sought to maximize unconditional probability of the output sequence (or rather conditional upon its input but not upon its own prior output). After entering the “trap” once or a few times, it would simply avoid the unfortunate misstep in subsequent “playouts”. From my understanding, that is.
I think the intuition behind your remark “why would there be such gibberish?” actually goes most of the way to explaining the repetition trap.
The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.
Once you’re already in one, I don’t think it’s really so obvious that the repetition should eventually stop. Yes, that’s what a human writer would do—but a human writer wouldn’t have produced the conditioning sequence to begin with.
We start out with a prior that puts high weight on “this belongs to some natural genre of text,” and low weight on “this belongs to a weird hyper-repetitive ‘genre’ of text.” But eventually, after enough bad predictions from the former and enough accurate predictions from the latter, we really ought to yield to the evidence and update. Eventually it should become clear that the question “why would there be such gibberish?” has some answer, since we keep observing “such gibberish” and not anything else.
But why does LM sampling enter the trap to begin with? I think there needs to be some “initial misstep,” where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.
In other words, repetition is special because it’s a way of going off-distribution where there is, nonetheless, a single “obvious” way to continue the text, and continuing it thus will keep you in the same off-distribution region. Whereas most ways of going off-distribution are just confusing, and don’t have a legible structure the LM would have learned from in-distribution training.
I would expect scale to lower the probability of the “initial mistake,” and thus reduce the fraction of samples that are repetitive (is this borne out in practice?). I don’t expect scale to make LMs stop assigning high likelihood to repetitive continuations of unnaturally repetitive prefixes, since I think that’s a conditionally correct judgment.
For practical purposes, I’ve found my custom sampler to be pretty effective solution, though sometimes the LM still “wins the fight,” as in this amusing example.
Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don’t believe you, I just would have expected otherwise, so I’m curious.
I think that’s a possible interpretation. I’m still not sure why it wouldn’t affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the ‘misstep’ estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too?
It doesn’t really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don’t think it’s perfect, and it’s clearly a hack which doesn’t fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.)
I’ve seen hundreds of iGPT completions and random examples, and not a single one ever just ‘starts repeating’ ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the ‘tessellation’ that naive CLIP maximization causes is extremely obvious, and you can’t sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I’m wrong and there is repetition at some level which isn’t obvious to the naked eye, like high-frequency Fourier components (although I’m not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I’ll change my mind but it’s been years now, so I’m not holding my breath.
If repetitions arise from sampling merely due to high conditional probability given an initial “misstep”, they should be avoidable in an MCTS that sought to maximize unconditional probability of the output sequence (or rather conditional upon its input but not upon its own prior output). After entering the “trap” once or a few times, it would simply avoid the unfortunate misstep in subsequent “playouts”. From my understanding, that is.