DeepSeek-V3 might be the only example (and it’s from the future, released after I asked the question). Not sure if it generalizes to expecting more FP8 training, as it’s a MoE model with 257 experts and uses relatively small 7Kx2K matrices in its experts, while GPT-3-175B tested in FP8 in the Sep 2022 paper has much larger matrices, and that result wasn’t sufficient to promote widespread adoption (at least where it’s possible to observe).
On the other hand, if DeepSeek-V3 really is as good for its compute (4e24-6e24 FLOPs) as the benchmarks indicate, it might motivate more training with a huge number of smaller experts (it activates 8 experts per token, so the number of experts is even higher than one would expect from its ratio of total to active parameters). There was a Feb 2024 paper claiming 20x or higher compute multipliers for MoE models compared to dense (Figure 1b), appearing only if they activate a lot of experts per token, predicting 64 to be optimal at 1e24-1e25 FLOPs (the usual practice is to activate 2 experts). So DeepSeek-V3 weakly supports this surprising claim, though actual experimental results with more compute than that paper’s 3e19-4e20 FLOPs per datapoint would be better. The paper also predicts reduction in tokens per parameter with more compute (Table 2), reaching 8 tokens per active parameter at 5e25 FLOPs (in a MoE model with 4096 experts, 64 of which get activated per token). If this too is somehow correct, natural text data can be sufficient for 10 times more compute than with dense models.
DeepSeek-V3 might be the only example (and it’s from the future, released after I asked the question). Not sure if it generalizes to expecting more FP8 training, as it’s a MoE model with 257 experts and uses relatively small 7Kx2K matrices in its experts, while GPT-3-175B tested in FP8 in the Sep 2022 paper has much larger matrices, and that result wasn’t sufficient to promote widespread adoption (at least where it’s possible to observe).
On the other hand, if DeepSeek-V3 really is as good for its compute (4e24-6e24 FLOPs) as the benchmarks indicate, it might motivate more training with a huge number of smaller experts (it activates 8 experts per token, so the number of experts is even higher than one would expect from its ratio of total to active parameters). There was a Feb 2024 paper claiming 20x or higher compute multipliers for MoE models compared to dense (Figure 1b), appearing only if they activate a lot of experts per token, predicting 64 to be optimal at 1e24-1e25 FLOPs (the usual practice is to activate 2 experts). So DeepSeek-V3 weakly supports this surprising claim, though actual experimental results with more compute than that paper’s 3e19-4e20 FLOPs per datapoint would be better. The paper also predicts reduction in tokens per parameter with more compute (Table 2), reaching 8 tokens per active parameter at 5e25 FLOPs (in a MoE model with 4096 experts, 64 of which get activated per token). If this too is somehow correct, natural text data can be sufficient for 10 times more compute than with dense models.
This makes sense, I think you could be right. Llama 4 should give us more evidence on numerical precision and scaling of experts.