But why does LM sampling enter the trap to begin with? I think there needs to be some “initial misstep,” where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.
I think that’s a possible interpretation. I’m still not sure why it wouldn’t affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the ‘misstep’ estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too?
I would expect scale to lower the probability of the “initial mistake,” and thus reduce the fraction of samples that are repetitive (is this borne out in practice?).
It doesn’t really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don’t think it’s perfect, and it’s clearly a hack which doesn’t fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.)
Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don’t believe you, I just would have expected otherwise, so I’m curious.
I’ve seen hundreds of iGPT completions and random examples, and not a single one ever just ‘starts repeating’ ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the ‘tessellation’ that naive CLIP maximization causes is extremely obvious, and you can’t sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I’m wrong and there is repetition at some level which isn’t obvious to the naked eye, like high-frequency Fourier components (although I’m not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I’ll change my mind but it’s been years now, so I’m not holding my breath.
I think that’s a possible interpretation. I’m still not sure why it wouldn’t affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the ‘misstep’ estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too?
It doesn’t really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don’t think it’s perfect, and it’s clearly a hack which doesn’t fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.)
I’ve seen hundreds of iGPT completions and random examples, and not a single one ever just ‘starts repeating’ ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the ‘tessellation’ that naive CLIP maximization causes is extremely obvious, and you can’t sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I’m wrong and there is repetition at some level which isn’t obvious to the naked eye, like high-frequency Fourier components (although I’m not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I’ll change my mind but it’s been years now, so I’m not holding my breath.