I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?
Of course, both inference and training become more efficient and I’m not sure if the ratio between them is changing over time.
As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).
I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training. In fact they need so much more parameters that it’s often worth distilling them into a dense model after training. The benefit of the sparse MoE architecture seems to be about faster, parallelizable training, not lower inference cost (same link).
Distillation seems to be the main source of cheaper inference then. How much does it help? I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model. Perhaps that’s why as of May 2021, the OpenAI API does not seem to have a model that is nearly as good as the large GPT-3 but cheaper. (Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.)
I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training.
No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference). They’re generally more challenging to train because of the discrete gating, imbalanced experts, and sheer size—the Switch paper discusses the problems, and even the original Shazeer MoE emphasizes all of the challenges in training a MoE compared to a small dense model. Now, if you solve those problems (as Switch does), then yes, the cheaper inference would also make cheaper training (as long as you don’t have to do too much more training to compensate for the remaining problems), and that is an additional justification for Switch. But the primary motivation for researching MoE NMT etc has always been that it’d be a lot more economical to deploy at scale after training.
I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model.
Those results are sparse->dense, so they are not necessarily relevant (I would be thinking more applying distillation to the original MoE and distill each expert—the MoE is what you want for deployment at scale anyway, that’s the point!). But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model. Like I said, the most reliable way to a small powerful model is through a big slow model.
Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.
Yeah, we don’t know what’s going on there. They’ve mentioned further finetuning of the models, but no details. They decline to specify even what the parameter counts are, hence EAI needing to reverse-engineer guesses from their benchmarks. (Perhaps the small models are now distilled models? At least early on, people were quite contemptuous of the small models, but these days people find they can be quite handy. Did we just underrate them initially, or did they actually get better?) They have an ‘instruction’ series they’ve never explained what it is (probably something like T0/FLAN?). Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise, and I can say that whenever I hear OAers talk about bottlenecks, they’re usually complaining about lack of people, which dabbling in distillation/sparsification wouldn’t help much with. Plus, of course, OA’s public output of research seems to be low since the API launched, which makes you wonder what they all spend their time doing. The API hasn’t changed all that much that I’ve noticed, and after this much time you’d think the sysadmin/SRE stuff would be fairly routine and handling itself. So… yeah, I dunno what’s going on behind the API, and wouldn’t treat it as evidence either way.
No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference)
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.
But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model.
Only if you switch to a dense model, which again doesn’t save you that much inference compute.
But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.
Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).
Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise
Fair, that seems like the most plausible explanation.
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy.
I’m not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn’t happen so often.
I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?
Of course, both inference and training become more efficient and I’m not sure if the ratio between them is changing over time.
As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).
I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training. In fact they need so much more parameters that it’s often worth distilling them into a dense model after training. The benefit of the sparse MoE architecture seems to be about faster, parallelizable training, not lower inference cost (same link).
Distillation seems to be the main source of cheaper inference then. How much does it help? I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model. Perhaps that’s why as of May 2021, the OpenAI API does not seem to have a model that is nearly as good as the large GPT-3 but cheaper. (Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.)
(An additional source of cheaper inference is by the way low-precision hardware (https://dl.acm.org/doi/pdf/10.1145/3079856.3080246).)
No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference). They’re generally more challenging to train because of the discrete gating, imbalanced experts, and sheer size—the Switch paper discusses the problems, and even the original Shazeer MoE emphasizes all of the challenges in training a MoE compared to a small dense model. Now, if you solve those problems (as Switch does), then yes, the cheaper inference would also make cheaper training (as long as you don’t have to do too much more training to compensate for the remaining problems), and that is an additional justification for Switch. But the primary motivation for researching MoE NMT etc has always been that it’d be a lot more economical to deploy at scale after training.
Those results are sparse->dense, so they are not necessarily relevant (I would be thinking more applying distillation to the original MoE and distill each expert—the MoE is what you want for deployment at scale anyway, that’s the point!). But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model. Like I said, the most reliable way to a small powerful model is through a big slow model.
Yeah, we don’t know what’s going on there. They’ve mentioned further finetuning of the models, but no details. They decline to specify even what the parameter counts are, hence EAI needing to reverse-engineer guesses from their benchmarks. (Perhaps the small models are now distilled models? At least early on, people were quite contemptuous of the small models, but these days people find they can be quite handy. Did we just underrate them initially, or did they actually get better?) They have an ‘instruction’ series they’ve never explained what it is (probably something like T0/FLAN?). Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise, and I can say that whenever I hear OAers talk about bottlenecks, they’re usually complaining about lack of people, which dabbling in distillation/sparsification wouldn’t help much with. Plus, of course, OA’s public output of research seems to be low since the API launched, which makes you wonder what they all spend their time doing. The API hasn’t changed all that much that I’ve noticed, and after this much time you’d think the sysadmin/SRE stuff would be fairly routine and handling itself. So… yeah, I dunno what’s going on behind the API, and wouldn’t treat it as evidence either way.
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.
Only if you switch to a dense model, which again doesn’t save you that much inference compute. But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.
Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).
Fair, that seems like the most plausible explanation.
I’m not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn’t happen so often.