Chinchilla’s 20 tokens/param (at 6e23 FLOPs) change significantly when working with different datasets, architectures, or amounts of compute. For Llama-3-405B, it’s 37 tokens/param at 4e25 FLOPs and increasing 1.5x for every 1000x of compute (Figure 3). When training on data repeated 60 times, optimal tokens/param increase about 2.5x (Figure 3).
For MoE models with 87% (1:8) sparsity, optimal tokens/param increase 3x, and at 97% (1:32) sparsity by 6x (Figure 12, left). This suggests that if Llama-3-405B was instead a MoE model with 97% sparsity, it would have 220 tokens/param optimal and not 37.
Overtraining or undertraining is use of a suboptimal tokens/param ratio. The effect is not that large, rule of thumb is that a compute multiplier penalty is given by a degree of overtraining raised to the power 1⁄3. So 30x overtraining (using 600 tokens/param instead of 20 tokens/param) results in the same penalty as training a compute optimal model with 3x less compute, and 10x overtraining (or undertraining) corresponds to using 2x less compute (which can be compensated by using 2x more compute instead, in order to maintain the same performance).
This curiously suggests that original GPT-4 was also undertrained, similarly to GPT-3. Rumored compute is 2e25 FLOPs, and rumored architecture is 1.8T total parameter MoE with 2:16 sparsity, so 220B params for active experts, and say another 40B for non-expert params, for the total of 260B. This gives 13T tokens or 50 tokens/param. If the dataset has Llama-3′s 37 tokens/param optimal for a dense model at 2e25 FLOPs, then with 1:8 sparsity the optimal ratio would be 110 tokens/param, so at 50 tokens/param it’s undertrained about 2x. The effect of this is losing 1.3x in effective compute, not a whole lot but more than nothing.
These examples seem to contradict note 2 where D/N falls for larger C. Now I’m not sure what the trend should be.
It feels like you could derive a rule of thumb based on the loss and the entropy of the dataset e.g. “If my model starts at a loss of 4 bits/token and the asymptote is 2 bits/token, I need X tokens of data to fully specify a model with Y bits stored in the parameters.”
For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it’s not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won’t hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW $150bn training systems of 2028), we get maybe 700 tokens/param optimal (counting effect of sparsity, effect of repetition, and effect of more compute), so that’s 3.5T active and 110T total params trained for 2.5e15 tokens (maybe 80T tokens repeated 30 times). Not sure if this kind of total params can be made to work.
Chinchilla’s 20 tokens/param (at 6e23 FLOPs) change significantly when working with different datasets, architectures, or amounts of compute. For Llama-3-405B, it’s 37 tokens/param at 4e25 FLOPs and increasing 1.5x for every 1000x of compute (Figure 3). When training on data repeated 60 times, optimal tokens/param increase about 2.5x (Figure 3).
For MoE models with 87% (1:8) sparsity, optimal tokens/param increase 3x, and at 97% (1:32) sparsity by 6x (Figure 12, left). This suggests that if Llama-3-405B was instead a MoE model with 97% sparsity, it would have 220 tokens/param optimal and not 37.
Overtraining or undertraining is use of a suboptimal tokens/param ratio. The effect is not that large, rule of thumb is that a compute multiplier penalty is given by a degree of overtraining raised to the power 1⁄3. So 30x overtraining (using 600 tokens/param instead of 20 tokens/param) results in the same penalty as training a compute optimal model with 3x less compute, and 10x overtraining (or undertraining) corresponds to using 2x less compute (which can be compensated by using 2x more compute instead, in order to maintain the same performance).
This curiously suggests that original GPT-4 was also undertrained, similarly to GPT-3. Rumored compute is 2e25 FLOPs, and rumored architecture is 1.8T total parameter MoE with 2:16 sparsity, so 220B params for active experts, and say another 40B for non-expert params, for the total of 260B. This gives 13T tokens or 50 tokens/param. If the dataset has Llama-3′s 37 tokens/param optimal for a dense model at 2e25 FLOPs, then with 1:8 sparsity the optimal ratio would be 110 tokens/param, so at 50 tokens/param it’s undertrained about 2x. The effect of this is losing 1.3x in effective compute, not a whole lot but more than nothing.
Wonderful to get more numbers on this!
These examples seem to contradict note 2 where D/N falls for larger C. Now I’m not sure what the trend should be.
It feels like you could derive a rule of thumb based on the loss and the entropy of the dataset e.g. “If my model starts at a loss of 4 bits/token and the asymptote is 2 bits/token, I need X tokens of data to fully specify a model with Y bits stored in the parameters.”
For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it’s not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won’t hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW $150bn training systems of 2028), we get maybe 700 tokens/param optimal (counting effect of sparsity, effect of repetition, and effect of more compute), so that’s 3.5T active and 110T total params trained for 2.5e15 tokens (maybe 80T tokens repeated 30 times). Not sure if this kind of total params can be made to work.