Oh whoops my bad, yeah didn’t double check that, added a few OOM to #tokens. (And in retrospect should expect training cost to be ~params^2 for this kind of model)
As to the 2nd question, at ~10^12 flops/token, which then becomes 10^15 for a batch of 1000 independent tokens, it already requires a number of GPUs or a small cluster to run in real-time, so even 10^3 MCTS is a significant ask—for what I assume isn’t worth the cost (it may only be a sentence or two of lookahead, and I recall they already experimented with some simple non MCTS forms of rollouts, although that may have been the cortex variant).
Naively I’d expect that GPT-3 spending 1000x more compute per token on rollouts or MCTS would produce text that is noticeably more coherent, correct, etc. At least with some fine-tuning, I’d expect that. If it doesn’t help much, or at all, that would be good to know!
So to be clear I do think some form of planning is an obvious route foward for NLMs like GPT-3, and that MCTS is the current default choice, but the benefit mostly comes from full training of the policy with MCTS. And 1000x GPT-3 training cost seems a tad steep; I’m pretty confident that better algorithms are available for much less than billions $.
Wanna venture any predictions? So far as far as I know there hasn’t been any serious attempt to train a large language model with MCTS or planning; are you predicting there will be one in the next three years and that it will be exciting / set SOTAs?
I would argue that we have seen a lot of dramatic performance gains with large models using planning. We just call the random-shooting method of planning ‘sampling’ or ‘best-of sampling’. When you roll out 20 or 100 answers to a math or translation problem with GPT-3 etc and you score them with likelihood to rank or feed them into the model for a classification to accept/reject, this is not MCTS, but is this not planning?
Good point, I do recall that happening and I do recall there were gainz… If even that simple method produces dramatic performance improvements, wanna venture any predictions about whether more advanced methods will be tried in the next three years and whether they’ll lead to further dramatic improvements?
I’m less sure about that. There’s the strange and troubling repetition trap, which still afflicts the largest models as far as we know. And as you know from the MCTS & DRL literature, planning has rapidly diminishing returns compared to using a better base model even in problems with extremely clear objectives. The current random-shoot/criticize/self-distill approach may get you most of the gains already.
Thanks. Why is the repetition trap strange and troubling? If I saw a word or phrase repeated thrice, and were asked to predict the next word, I’d just keep repeating that word or phrase. Heck, even if it were only repeated twice, I’d probably seriously consider it. Whose to say these LMs aren’t behaving optimally?
It’s strange and troubling because it means that any tree search using likelihood ultimately gets sucked into an attractor of a repetition trap, which is pragmatically bad (there are no complete/consistent search procedures—you can’t just plan deeper to get monotonically more quality, no matter how much compute you’re willing to spend), raises a lot of questions about what the predictions mean or do (if you can’t score a sequence or tree by likelihood, no matter how high-quality your starting point, what does the likelihood even mean?!), means even the simplest sampling procedure may send the model haywire if it ever chances to repeat a word, doesn’t seem to be improving with scaling, and doesn’t afflict non-likelihood non-autoregressive* non-text models (eg if you want to argue ‘oh sure, it’s totally plausible to just predict sequences consisting of the same token like ‘aaaaaa’ repeated eternally, that’s fine’, why don’t the, say, diffusion text models or image-GPT models also suffer from it?). I don’t think humans would put much probability on such sequences, even conditionally: we’d think that at some point the sequence would stop, because why would there be such gibberish?
* I’m not even sure of this. I’ve noticed that I’ve never seen anyone talk about the repetition trap in bidirectional text models’ generation of text, but does that mean they are immune for some reason where unidirectional models fall to it, or that just no one uses them for freeform text generation enough to bother noting it?
I don’t think humans would put much probability on such sequences, even conditionally: we’d think that at some point the sequence would stop, because why would there be such gibberish?
I think the intuition behind your remark “why would there be such gibberish?” actually goes most of the way to explaining the repetition trap.
The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.
Once you’re already in one, I don’t think it’s really so obvious that the repetition should eventually stop. Yes, that’s what a human writer would do—but a human writer wouldn’t have produced the conditioning sequence to begin with.
We start out with a prior that puts high weight on “this belongs to some natural genre of text,” and low weight on “this belongs to a weird hyper-repetitive ‘genre’ of text.” But eventually, after enough bad predictions from the former and enough accurate predictions from the latter, we really ought to yield to the evidence and update. Eventually it should become clear that the question “why would there be such gibberish?” has some answer, since we keep observing “such gibberish” and not anything else.
But why does LM sampling enter the trap to begin with? I think there needs to be some “initial misstep,” where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.
In other words, repetition is special because it’s a way of going off-distribution where there is, nonetheless, a single “obvious” way to continue the text, and continuing it thus will keep you in the same off-distribution region. Whereas most ways of going off-distribution are just confusing, and don’t have a legible structure the LM would have learned from in-distribution training.
I would expect scale to lower the probability of the “initial mistake,” and thus reduce the fraction of samples that are repetitive (is this borne out in practice?). I don’t expect scale to make LMs stop assigning high likelihood to repetitive continuations of unnaturally repetitive prefixes, since I think that’s a conditionally correct judgment.
For practical purposes, I’ve found my custom sampler to be pretty effective solution, though sometimes the LM still “wins the fight,” as in this amusing example.
Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don’t believe you, I just would have expected otherwise, so I’m curious.
But why does LM sampling enter the trap to begin with? I think there needs to be some “initial misstep,” where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.
I think that’s a possible interpretation. I’m still not sure why it wouldn’t affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the ‘misstep’ estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too?
I would expect scale to lower the probability of the “initial mistake,” and thus reduce the fraction of samples that are repetitive (is this borne out in practice?).
It doesn’t really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don’t think it’s perfect, and it’s clearly a hack which doesn’t fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.)
Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don’t believe you, I just would have expected otherwise, so I’m curious.
I’ve seen hundreds of iGPT completions and random examples, and not a single one ever just ‘starts repeating’ ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the ‘tessellation’ that naive CLIP maximization causes is extremely obvious, and you can’t sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I’m wrong and there is repetition at some level which isn’t obvious to the naked eye, like high-frequency Fourier components (although I’m not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I’ll change my mind but it’s been years now, so I’m not holding my breath.
If repetitions arise from sampling merely due to high conditional probability given an initial “misstep”, they should be avoidable in an MCTS that sought to maximize unconditional probability of the output sequence (or rather conditional upon its input but not upon its own prior output). After entering the “trap” once or a few times, it would simply avoid the unfortunate misstep in subsequent “playouts”. From my understanding, that is.
Thanks! Very interesting point about the lack of this in image-GPT etc. I have no comment there, not understanding them on a technical level.
I totally think that humans would put lots of (conditional) probability on such sequences. It’s true that we’d predict the sequence would stop eventually, but that’s not relevant; what’s relevant is: You see it repeating for N times so far. What’s the probability that it goes on for at least N+1 times total? That probability goes up and up with N, not down and down, even though you are supremely confident that N is finite and even though your (unconditional) credence in the sequence as a whole goes down and down with N.
I think you may be correct that even humans would increase probability of a repetition continuance with N up to a point. The difference could be that humans are using a much larger compressed historical context, so when reading something like Moby Dick, the prior for any serious repetition is absurdly low, and it never comes up.
Also humans read fundamentally differently through vision, and even when the retina is focusing on just a word or two at a time, you are also getting some bits of signal for surrounding future text, and big repetitions would be fairly obvious.
Oh whoops my bad, yeah didn’t double check that, added a few OOM to #tokens. (And in retrospect should expect training cost to be ~params^2 for this kind of model)
As to the 2nd question, at ~10^12 flops/token, which then becomes 10^15 for a batch of 1000 independent tokens, it already requires a number of GPUs or a small cluster to run in real-time, so even 10^3 MCTS is a significant ask—for what I assume isn’t worth the cost (it may only be a sentence or two of lookahead, and I recall they already experimented with some simple non MCTS forms of rollouts, although that may have been the cortex variant).
Naively I’d expect that GPT-3 spending 1000x more compute per token on rollouts or MCTS would produce text that is noticeably more coherent, correct, etc. At least with some fine-tuning, I’d expect that. If it doesn’t help much, or at all, that would be good to know!
So to be clear I do think some form of planning is an obvious route foward for NLMs like GPT-3, and that MCTS is the current default choice, but the benefit mostly comes from full training of the policy with MCTS. And 1000x GPT-3 training cost seems a tad steep; I’m pretty confident that better algorithms are available for much less than billions $.
Wanna venture any predictions? So far as far as I know there hasn’t been any serious attempt to train a large language model with MCTS or planning; are you predicting there will be one in the next three years and that it will be exciting / set SOTAs?
I would argue that we have seen a lot of dramatic performance gains with large models using planning. We just call the random-shooting method of planning ‘sampling’ or ‘best-of sampling’. When you roll out 20 or 100 answers to a math or translation problem with GPT-3 etc and you score them with likelihood to rank or feed them into the model for a classification to accept/reject, this is not MCTS, but is this not planning?
Good point, I do recall that happening and I do recall there were gainz… If even that simple method produces dramatic performance improvements, wanna venture any predictions about whether more advanced methods will be tried in the next three years and whether they’ll lead to further dramatic improvements?
I’m less sure about that. There’s the strange and troubling repetition trap, which still afflicts the largest models as far as we know. And as you know from the MCTS & DRL literature, planning has rapidly diminishing returns compared to using a better base model even in problems with extremely clear objectives. The current random-shoot/criticize/self-distill approach may get you most of the gains already.
Thanks. Why is the repetition trap strange and troubling? If I saw a word or phrase repeated thrice, and were asked to predict the next word, I’d just keep repeating that word or phrase. Heck, even if it were only repeated twice, I’d probably seriously consider it. Whose to say these LMs aren’t behaving optimally?
It’s strange and troubling because it means that any tree search using likelihood ultimately gets sucked into an attractor of a repetition trap, which is pragmatically bad (there are no complete/consistent search procedures—you can’t just plan deeper to get monotonically more quality, no matter how much compute you’re willing to spend), raises a lot of questions about what the predictions mean or do (if you can’t score a sequence or tree by likelihood, no matter how high-quality your starting point, what does the likelihood even mean?!), means even the simplest sampling procedure may send the model haywire if it ever chances to repeat a word, doesn’t seem to be improving with scaling, and doesn’t afflict non-likelihood non-autoregressive* non-text models (eg if you want to argue ‘oh sure, it’s totally plausible to just predict sequences consisting of the same token like ‘aaaaaa’ repeated eternally, that’s fine’, why don’t the, say, diffusion text models or image-GPT models also suffer from it?). I don’t think humans would put much probability on such sequences, even conditionally: we’d think that at some point the sequence would stop, because why would there be such gibberish?
* I’m not even sure of this. I’ve noticed that I’ve never seen anyone talk about the repetition trap in bidirectional text models’ generation of text, but does that mean they are immune for some reason where unidirectional models fall to it, or that just no one uses them for freeform text generation enough to bother noting it?
I think the intuition behind your remark “why would there be such gibberish?” actually goes most of the way to explaining the repetition trap.
The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.
Once you’re already in one, I don’t think it’s really so obvious that the repetition should eventually stop. Yes, that’s what a human writer would do—but a human writer wouldn’t have produced the conditioning sequence to begin with.
We start out with a prior that puts high weight on “this belongs to some natural genre of text,” and low weight on “this belongs to a weird hyper-repetitive ‘genre’ of text.” But eventually, after enough bad predictions from the former and enough accurate predictions from the latter, we really ought to yield to the evidence and update. Eventually it should become clear that the question “why would there be such gibberish?” has some answer, since we keep observing “such gibberish” and not anything else.
But why does LM sampling enter the trap to begin with? I think there needs to be some “initial misstep,” where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.
In other words, repetition is special because it’s a way of going off-distribution where there is, nonetheless, a single “obvious” way to continue the text, and continuing it thus will keep you in the same off-distribution region. Whereas most ways of going off-distribution are just confusing, and don’t have a legible structure the LM would have learned from in-distribution training.
I would expect scale to lower the probability of the “initial mistake,” and thus reduce the fraction of samples that are repetitive (is this borne out in practice?). I don’t expect scale to make LMs stop assigning high likelihood to repetitive continuations of unnaturally repetitive prefixes, since I think that’s a conditionally correct judgment.
For practical purposes, I’ve found my custom sampler to be pretty effective solution, though sometimes the LM still “wins the fight,” as in this amusing example.
Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don’t believe you, I just would have expected otherwise, so I’m curious.
I think that’s a possible interpretation. I’m still not sure why it wouldn’t affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the ‘misstep’ estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too?
It doesn’t really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don’t think it’s perfect, and it’s clearly a hack which doesn’t fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.)
I’ve seen hundreds of iGPT completions and random examples, and not a single one ever just ‘starts repeating’ ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the ‘tessellation’ that naive CLIP maximization causes is extremely obvious, and you can’t sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I’m wrong and there is repetition at some level which isn’t obvious to the naked eye, like high-frequency Fourier components (although I’m not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I’ll change my mind but it’s been years now, so I’m not holding my breath.
If repetitions arise from sampling merely due to high conditional probability given an initial “misstep”, they should be avoidable in an MCTS that sought to maximize unconditional probability of the output sequence (or rather conditional upon its input but not upon its own prior output). After entering the “trap” once or a few times, it would simply avoid the unfortunate misstep in subsequent “playouts”. From my understanding, that is.
Thanks! Very interesting point about the lack of this in image-GPT etc. I have no comment there, not understanding them on a technical level.
I totally think that humans would put lots of (conditional) probability on such sequences. It’s true that we’d predict the sequence would stop eventually, but that’s not relevant; what’s relevant is: You see it repeating for N times so far. What’s the probability that it goes on for at least N+1 times total? That probability goes up and up with N, not down and down, even though you are supremely confident that N is finite and even though your (unconditional) credence in the sequence as a whole goes down and down with N.
I think you may be correct that even humans would increase probability of a repetition continuance with N up to a point. The difference could be that humans are using a much larger compressed historical context, so when reading something like Moby Dick, the prior for any serious repetition is absurdly low, and it never comes up.
Also humans read fundamentally differently through vision, and even when the retina is focusing on just a word or two at a time, you are also getting some bits of signal for surrounding future text, and big repetitions would be fairly obvious.