It should appear everywhere that one has variable context windows and should apply to pretty much all models or none, if that’s the explanation. I would also expect the benefits to show up more broadly across benchmarks, rather than affect a very few dramatically.
It seems plausible to me that you’d see some sort of reasonably sharp improvement curve with scale like CoT. So, I think it’s conceivable that GPT4 is just big enough to learn during pretraining an algorithm for benefitting a bit from filler while other models aren’t.
Filler tokens are probably an insanely inefficient way of using more compute for the LLM and algorithmic improvement should dominate. But, there’s no reason to expect strong efficiency here right now.
You can always ‘save the appearances’ by invoking an emergence, yes. I don’t like it, though, because Claude isn’t that far from GPT-4 in capabilities, so that hypothesis is threading a narrow window indeed. A more natural split would have been GPT-3 vs Claude+GPT-4. (We know lots of stuff happens in between GPT-3 and Claude+GPT-4 levels, so speculations about something happening in there, or a floor effect, are much less finetuned.) I’m also a bit suspicious of the emergence idea here at all: I mean, Transformers are usually trained with full context windows, simply filling them up and separating documents with EOT, aren’t they? So why would GPT-4 learn anything special about simple generic sorts of computation inside its forward pass that GPT-3s or Claudes couldn’t?
Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
It seems plausible to me that you’d see some sort of reasonably sharp improvement curve with scale like CoT. So, I think it’s conceivable that GPT4 is just big enough to learn during pretraining an algorithm for benefitting a bit from filler while other models aren’t.
Filler tokens are probably an insanely inefficient way of using more compute for the LLM and algorithmic improvement should dominate. But, there’s no reason to expect strong efficiency here right now.
You can always ‘save the appearances’ by invoking an emergence, yes. I don’t like it, though, because Claude isn’t that far from GPT-4 in capabilities, so that hypothesis is threading a narrow window indeed. A more natural split would have been GPT-3 vs Claude+GPT-4. (We know lots of stuff happens in between GPT-3 and Claude+GPT-4 levels, so speculations about something happening in there, or a floor effect, are much less finetuned.) I’m also a bit suspicious of the emergence idea here at all: I mean, Transformers are usually trained with full context windows, simply filling them up and separating documents with EOT, aren’t they? So why would GPT-4 learn anything special about simple generic sorts of computation inside its forward pass that GPT-3s or Claudes couldn’t?
Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
(Aside: Why do you think GPT3.5-turbo (most recent release) isn’t MOE? I’d guess that if GPT4 is MOE, GPT3.5 is also.)
Because GPT-3.5 is a fine-tuned version of GPT-3, which is known to be a vanilla dense transformer.
GPT-4 is probably, in a very funny turn of events, a few dozen fine-tuned GPT-3.5 clones glued together (as a MoE).