From what I remember, the training-compute optimal number of experts was like 64
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so “number of experts” doesn’t really capture the ratio of total to activated, probably not a good anchor by itself.
Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
This still doesn’t help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can’t expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it’s 1:32.
that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this)
Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn’t significantly increase, there’s even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps.
32B active parameters instead of likely ~ 220 280B for GPT4 ⇒ 6.8 8.7x lower training cost per token.
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so “number of experts” doesn’t really capture the ratio of total to activated, probably not a good anchor by itself.
This still doesn’t help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can’t expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it’s 1:32.
Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn’t significantly increase, there’s even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps.
It’s 37B active parameters, not 32B.