(Aside: Why do you think GPT3.5-turbo (most recent release) isn’t MOE? I’d guess that if GPT4 is MOE, GPT3.5 is also.)
Because GPT-3.5 is a fine-tuned version of GPT-3, which is known to be a vanilla dense transformer.
GPT-4 is probably, in a very funny turn of events, a few dozen fine-tuned GPT-3.5 clones glued together (as a MoE).
(Aside: Why do you think GPT3.5-turbo (most recent release) isn’t MOE? I’d guess that if GPT4 is MOE, GPT3.5 is also.)
Because GPT-3.5 is a fine-tuned version of GPT-3, which is known to be a vanilla dense transformer.
GPT-4 is probably, in a very funny turn of events, a few dozen fine-tuned GPT-3.5 clones glued together (as a MoE).