For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it’s not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won’t hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW $150bn training systems of 2028), we get maybe 700 tokens/param optimal (counting effect of sparsity, effect of repetition, and effect of more compute), so that’s 3.5T active and 110T total params trained for 2.5e15 tokens (maybe 80T tokens repeated 30 times). Not sure if this kind of total params can be made to work.
For scaling to larger training systems, the trend is probably increasing, since larger datasets have lower quality, and soon repetition in training will become necessary, lowering quality per trained-on token. Also, MoE is a large compute multiplier (3x-6x, Figure 11 in the above MoE scaling paper), it’s not going to be ignored if at all possible. There are other studies that show a decreasing trend, but this probably won’t hold up in practice as we get to 250T and then 750T tokens within a few years even for a dense model.
For 1:32 MoE at 5e28 FLOPs (5 GW $150bn training systems of 2028), we get maybe 700 tokens/param optimal (counting effect of sparsity, effect of repetition, and effect of more compute), so that’s 3.5T active and 110T total params trained for 2.5e15 tokens (maybe 80T tokens repeated 30 times). Not sure if this kind of total params can be made to work.