I’m not sure how to square those results with the Chinchilla paper though
Apples and oranges. The Chinchilla paper simply optimizes the final trained model’s loss given a fixed compute budget. It doesn’t say anything about any downstream uses—similar to how it doesn’t tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you’ve probably seen some “overtraining” analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run—but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that’s hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)
Apples and oranges. The Chinchilla paper simply optimizes the final trained model’s loss given a fixed compute budget. It doesn’t say anything about any downstream uses—similar to how it doesn’t tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you’ve probably seen some “overtraining” analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run—but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that’s hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)