Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.
It’s a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it’s very overtrained, that is not even compute optimal.
It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters kept the same (Figure 5b), the rest is extrapolation from fitted scaling laws.
A new architecture has a compute multiplier M (at a given level of compute) if it would take M times more compute to train a compute optimal model with a reference architecture (in this case, a dense transformer) to match the perplexity it achieves when trained on data sampled from the same dataset.
Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.
It’s a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it’s very overtrained, that is not even compute optimal.
It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters kept the same (Figure 5b), the rest is extrapolation from fitted scaling laws.
A new architecture has a compute multiplier M (at a given level of compute) if it would take M times more compute to train a compute optimal model with a reference architecture (in this case, a dense transformer) to match the perplexity it achieves when trained on data sampled from the same dataset.