> 32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training … cost
Doesn’t follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by “training cost”, I mostly had in mind “training cost per token”:
32B active parameters instead of likely ~ 220 280B for GPT4 ⇒ 6.8 8.7x lower training cost per token.
If this wasn’t an issue, why not 8B active parameters, or 1M active parameters?
From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don’t remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
You still train on every token.
Right, that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this).
From what I remember, the training-compute optimal number of experts was like 64
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so “number of experts” doesn’t really capture the ratio of total to activated, probably not a good anchor by itself.
Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
This still doesn’t help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can’t expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it’s 1:32.
that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this)
Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn’t significantly increase, there’s even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps.
32B active parameters instead of likely ~ 220 280B for GPT4 ⇒ 6.8 8.7x lower training cost per token.
Thanks for your corrections, that’s welcome
Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by “training cost”, I mostly had in mind “training cost per token”:
32B active parameters instead of likely ~
220280B for GPT4 ⇒6.88.7x lower training cost per token.From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don’t remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
Right, that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this).
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so “number of experts” doesn’t really capture the ratio of total to activated, probably not a good anchor by itself.
This still doesn’t help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can’t expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it’s 1:32.
Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn’t significantly increase, there’s even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps.
It’s 37B active parameters, not 32B.