You can always ‘save the appearances’ by invoking an emergence, yes. I don’t like it, though, because Claude isn’t that far from GPT-4 in capabilities, so that hypothesis is threading a narrow window indeed. A more natural split would have been GPT-3 vs Claude+GPT-4. (We know lots of stuff happens in between GPT-3 and Claude+GPT-4 levels, so speculations about something happening in there, or a floor effect, are much less finetuned.) I’m also a bit suspicious of the emergence idea here at all: I mean, Transformers are usually trained with full context windows, simply filling them up and separating documents with EOT, aren’t they? So why would GPT-4 learn anything special about simple generic sorts of computation inside its forward pass that GPT-3s or Claudes couldn’t?
Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
You can always ‘save the appearances’ by invoking an emergence, yes. I don’t like it, though, because Claude isn’t that far from GPT-4 in capabilities, so that hypothesis is threading a narrow window indeed. A more natural split would have been GPT-3 vs Claude+GPT-4. (We know lots of stuff happens in between GPT-3 and Claude+GPT-4 levels, so speculations about something happening in there, or a floor effect, are much less finetuned.) I’m also a bit suspicious of the emergence idea here at all: I mean, Transformers are usually trained with full context windows, simply filling them up and separating documents with EOT, aren’t they? So why would GPT-4 learn anything special about simple generic sorts of computation inside its forward pass that GPT-3s or Claudes couldn’t?
Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
(Aside: Why do you think GPT3.5-turbo (most recent release) isn’t MOE? I’d guess that if GPT4 is MOE, GPT3.5 is also.)
Because GPT-3.5 is a fine-tuned version of GPT-3, which is known to be a vanilla dense transformer.
GPT-4 is probably, in a very funny turn of events, a few dozen fine-tuned GPT-3.5 clones glued together (as a MoE).