No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference)
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.
But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model.
Only if you switch to a dense model, which again doesn’t save you that much inference compute.
But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.
Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).
Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise
Fair, that seems like the most plausible explanation.
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy.
I’m not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn’t happen so often.
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.
Only if you switch to a dense model, which again doesn’t save you that much inference compute. But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.
Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).
Fair, that seems like the most plausible explanation.
I’m not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn’t happen so often.