Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
Yep, agreed. But it’s worth noting that other hypotheses for why this happens still have to explain why no other model has this behavior (including GPT3.5). So I’m not sure we take that much additional surprise from the capability having sudden onset.
To elaborate, my other two hypotheses are less finetuned and can handle explaining why GPT-3.5 doesn’t, because it’s denser or smaller than GPT-4: in the second hypothesis, GPT-3/GPT-3.5/Claude dense models vs GPT-4 MoE is a fundamental architecture difference, so that alone is eyebrow-raising—the only MoE out of 4 or 5 models is also the one acting weirdest using padding tokens? (And I would note that GPT-4 acts weird in other ways: aside from the temperature nondeterminism, I’ve noted that GPT-4 has bizarre blind spots where it is unable to do simple tasks that it ought to be able to do. So we don’t need to additionally hypothesize that the internals of GPT-4 are doing something weird—we already know for a fact that something is weird in GPT-4.) The difference between GPT-3/GPT-3.5 and GPT-4 is larger than between Claude and GPT-4, so if you can believe a major GSM8K difference emerging between Claude/GPT-4, then you must find it far easier to believe the first hypothesis a difference due to the scale from GPT-3/GPT-3.5 to GPT-4.
(Aside: Why do you think GPT3.5-turbo (most recent release) isn’t MOE? I’d guess that if GPT4 is MOE, GPT3.5 is also.)
Because GPT-3.5 is a fine-tuned version of GPT-3, which is known to be a vanilla dense transformer.
GPT-4 is probably, in a very funny turn of events, a few dozen fine-tuned GPT-3.5 clones glued together (as a MoE).