Adding filler tokens seems like it should always be neutral or harm a model’s performance: a fixed prefix designed to be meaningless across all tasks cannot provide any information about each task to locate the task (so no meta-learning) and cannot store any information about the in-progress task (so no amortized computation combining results from multiple forward passes).
I thought the idea was that in a single forward pass, the model has more tokens to think in. That is, the task description on its own is, say, 100 tokens long. With the filler tokens, it’s now, say, 200 tokens long. In principle, because of the uselessness/unnecessariness of the filler tokens, the model can just put task-relevant computation into the residual stream for those positions.
Yes, I probably should’ve mentioned that in an optimized causal Transformer, you might expect this kind of behavior: the tokens are in fact useless for the reasons outlined, but they are undoing your optimization where you skipped running the parts of the Transformer which correspond to the empty context window inputs. That is, the tokens do nothing, but you literally are not running the same model—you are running a subset of the ‘full’ model and then more of it accidentally as you increase tokens. And that is where the extra computation comes from; it wasn’t really ‘extra’ to begin with. If you were running the causal Transformer in the naive way, equivalent to always running with a full context window padded out, with no packing or shortcut tricks, then there would be no benefit to padding, because now you’re always running the ‘full’ model regardless of padded token count.
However, I expect that if it was this obvious, someone would’ve already demonstrated it (surely filling out the context window with padding would be the very first thing anyone would try? or just looking at the activation patterns to see if the better-performing results are causally influenced by the null parts of the model’s forward pass), and this also can’t really explain OP’s observation of why the apparent benefit appears in GSM8K in the MoE model (GPT-4) but none of the others (GPT-3s, Claude). It should appear everywhere that one has variable context windows and should apply to pretty much all models or none, if that’s the explanation. I would also expect the benefits to show up more broadly across benchmarks, rather than affect a very few dramatically.
So, since it doesn’t explain any of the observations (while my two suggestions at least could), it seems unlikely.
It should appear everywhere that one has variable context windows and should apply to pretty much all models or none, if that’s the explanation. I would also expect the benefits to show up more broadly across benchmarks, rather than affect a very few dramatically.
It seems plausible to me that you’d see some sort of reasonably sharp improvement curve with scale like CoT. So, I think it’s conceivable that GPT4 is just big enough to learn during pretraining an algorithm for benefitting a bit from filler while other models aren’t.
Filler tokens are probably an insanely inefficient way of using more compute for the LLM and algorithmic improvement should dominate. But, there’s no reason to expect strong efficiency here right now.
You can always ‘save the appearances’ by invoking an emergence, yes. I don’t like it, though, because Claude isn’t that far from GPT-4 in capabilities, so that hypothesis is threading a narrow window indeed. A more natural split would have been GPT-3 vs Claude+GPT-4. (We know lots of stuff happens in between GPT-3 and Claude+GPT-4 levels, so speculations about something happening in there, or a floor effect, are much less finetuned.) I’m also a bit suspicious of the emergence idea here at all: I mean, Transformers are usually trained with full context windows, simply filling them up and separating documents with EOT, aren’t they? So why would GPT-4 learn anything special about simple generic sorts of computation inside its forward pass that GPT-3s or Claudes couldn’t?
Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
I thought the idea was that in a single forward pass, the model has more tokens to think in. That is, the task description on its own is, say, 100 tokens long. With the filler tokens, it’s now, say, 200 tokens long. In principle, because of the uselessness/unnecessariness of the filler tokens, the model can just put task-relevant computation into the residual stream for those positions.
Yes, I probably should’ve mentioned that in an optimized causal Transformer, you might expect this kind of behavior: the tokens are in fact useless for the reasons outlined, but they are undoing your optimization where you skipped running the parts of the Transformer which correspond to the empty context window inputs. That is, the tokens do nothing, but you literally are not running the same model—you are running a subset of the ‘full’ model and then more of it accidentally as you increase tokens. And that is where the extra computation comes from; it wasn’t really ‘extra’ to begin with. If you were running the causal Transformer in the naive way, equivalent to always running with a full context window padded out, with no packing or shortcut tricks, then there would be no benefit to padding, because now you’re always running the ‘full’ model regardless of padded token count.
However, I expect that if it was this obvious, someone would’ve already demonstrated it (surely filling out the context window with padding would be the very first thing anyone would try? or just looking at the activation patterns to see if the better-performing results are causally influenced by the null parts of the model’s forward pass), and this also can’t really explain OP’s observation of why the apparent benefit appears in GSM8K in the MoE model (GPT-4) but none of the others (GPT-3s, Claude). It should appear everywhere that one has variable context windows and should apply to pretty much all models or none, if that’s the explanation. I would also expect the benefits to show up more broadly across benchmarks, rather than affect a very few dramatically.
So, since it doesn’t explain any of the observations (while my two suggestions at least could), it seems unlikely.
It seems plausible to me that you’d see some sort of reasonably sharp improvement curve with scale like CoT. So, I think it’s conceivable that GPT4 is just big enough to learn during pretraining an algorithm for benefitting a bit from filler while other models aren’t.
Filler tokens are probably an insanely inefficient way of using more compute for the LLM and algorithmic improvement should dominate. But, there’s no reason to expect strong efficiency here right now.
You can always ‘save the appearances’ by invoking an emergence, yes. I don’t like it, though, because Claude isn’t that far from GPT-4 in capabilities, so that hypothesis is threading a narrow window indeed. A more natural split would have been GPT-3 vs Claude+GPT-4. (We know lots of stuff happens in between GPT-3 and Claude+GPT-4 levels, so speculations about something happening in there, or a floor effect, are much less finetuned.) I’m also a bit suspicious of the emergence idea here at all: I mean, Transformers are usually trained with full context windows, simply filling them up and separating documents with EOT, aren’t they? So why would GPT-4 learn anything special about simple generic sorts of computation inside its forward pass that GPT-3s or Claudes couldn’t?
Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
(Aside: Why do you think GPT3.5-turbo (most recent release) isn’t MOE? I’d guess that if GPT4 is MOE, GPT3.5 is also.)
Because GPT-3.5 is a fine-tuned version of GPT-3, which is known to be a vanilla dense transformer.
GPT-4 is probably, in a very funny turn of events, a few dozen fine-tuned GPT-3.5 clones glued together (as a MoE).