You’re missing a lot of the hardware overhang arguments—for example, that DL models can be distilled, sparsified, and compressed to a tremendous degree. The most reliable way to a cheap fast small model is through an expensive slow big model.
Even in the OA API, people make heavy use of the smallest models like Ada, which is <1b parameters (estimated by EAI). The general strategy is to play around with Davinci (175b) until you get a feel for working with GPT-3, refine a prompt on it, and then once you’ve established a working prototype prompt, bring it down to Ada/Babbage/Curie, going as low as possible.
You can also do things like use the largest model to generate examples to finetune much smaller models on: “Unsupervised Neural Machine Translation with Generative Language Models Only”, Han et al 2021 is a very striking recent paper I’ve linked before about self-distillation, but in this case I would emphasize their findings about using the largest GPT-3 to teach the smaller GPT-3s much better translation skills. Or, MoEs implicitly save a ton of compute by shortcutting using cheap sub-models, and that’s why you see a lot of them these days.
Of course, the future will bring efficiency improvements
More broadly, you’re missing all the possibilities of a ‘merely human-level’ AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment, and enables things which are simply impossible for humans—there is no equivalent of ‘generating embeddings’ which can be plugged directly into other models and algorithms. Kaj Sotala’s old paper https://philpapers.org/archive/SOTAOA covers some of this but could stand updating with a DL centric view about all the ways in which a model which achieves human-level performance on some task is far more desirable than an actual human, in much the same way that a car rate-limited to go only as fast as a horse is still more useful and valuable than a horse.
I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?
Of course, both inference and training become more efficient and I’m not sure if the ratio between them is changing over time.
As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).
I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training. In fact they need so much more parameters that it’s often worth distilling them into a dense model after training. The benefit of the sparse MoE architecture seems to be about faster, parallelizable training, not lower inference cost (same link).
Distillation seems to be the main source of cheaper inference then. How much does it help? I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model. Perhaps that’s why as of May 2021, the OpenAI API does not seem to have a model that is nearly as good as the large GPT-3 but cheaper. (Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.)
I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training.
No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference). They’re generally more challenging to train because of the discrete gating, imbalanced experts, and sheer size—the Switch paper discusses the problems, and even the original Shazeer MoE emphasizes all of the challenges in training a MoE compared to a small dense model. Now, if you solve those problems (as Switch does), then yes, the cheaper inference would also make cheaper training (as long as you don’t have to do too much more training to compensate for the remaining problems), and that is an additional justification for Switch. But the primary motivation for researching MoE NMT etc has always been that it’d be a lot more economical to deploy at scale after training.
I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model.
Those results are sparse->dense, so they are not necessarily relevant (I would be thinking more applying distillation to the original MoE and distill each expert—the MoE is what you want for deployment at scale anyway, that’s the point!). But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model. Like I said, the most reliable way to a small powerful model is through a big slow model.
Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.
Yeah, we don’t know what’s going on there. They’ve mentioned further finetuning of the models, but no details. They decline to specify even what the parameter counts are, hence EAI needing to reverse-engineer guesses from their benchmarks. (Perhaps the small models are now distilled models? At least early on, people were quite contemptuous of the small models, but these days people find they can be quite handy. Did we just underrate them initially, or did they actually get better?) They have an ‘instruction’ series they’ve never explained what it is (probably something like T0/FLAN?). Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise, and I can say that whenever I hear OAers talk about bottlenecks, they’re usually complaining about lack of people, which dabbling in distillation/sparsification wouldn’t help much with. Plus, of course, OA’s public output of research seems to be low since the API launched, which makes you wonder what they all spend their time doing. The API hasn’t changed all that much that I’ve noticed, and after this much time you’d think the sysadmin/SRE stuff would be fairly routine and handling itself. So… yeah, I dunno what’s going on behind the API, and wouldn’t treat it as evidence either way.
No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference)
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.
But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model.
Only if you switch to a dense model, which again doesn’t save you that much inference compute.
But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.
Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).
Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise
Fair, that seems like the most plausible explanation.
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy.
I’m not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn’t happen so often.
you’re missing all the possibilities of a ‘merely human-level’ AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment
I agree this post could benefit from discussing the advantages of silicon-based intelligence, thanks for bringing them up. I’d add that (scaled up versions of current) ML systems have disadvantages compared to humans, such as a lacking actuators and being cumbersome to fine-tune. Not to speak of the switching cost of moving from an economy based on humans to one based on ML systems. I’m not disputing that a human-level model could be transformative in years or decades—I just argue that it may not be in the short-term.
You’re missing a lot of the hardware overhang arguments—for example, that DL models can be distilled, sparsified, and compressed to a tremendous degree. The most reliable way to a cheap fast small model is through an expensive slow big model.
Even in the OA API, people make heavy use of the smallest models like Ada, which is <1b parameters (estimated by EAI). The general strategy is to play around with Davinci (175b) until you get a feel for working with GPT-3, refine a prompt on it, and then once you’ve established a working prototype prompt, bring it down to Ada/Babbage/Curie, going as low as possible.
You can also do things like use the largest model to generate examples to finetune much smaller models on: “Unsupervised Neural Machine Translation with Generative Language Models Only”, Han et al 2021 is a very striking recent paper I’ve linked before about self-distillation, but in this case I would emphasize their findings about using the largest GPT-3 to teach the smaller GPT-3s much better translation skills. Or, MoEs implicitly save a ton of compute by shortcutting using cheap sub-models, and that’s why you see a lot of them these days.
Indeed, the experience curves for AI are quite steep: https://openai.com/blog/ai-and-efficiency/ Once you can do something at all… (There was an era where AI Go masters cost more to run than human Go masters. It was a few months in mid-2016.)
More broadly, you’re missing all the possibilities of a ‘merely human-level’ AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment, and enables things which are simply impossible for humans—there is no equivalent of ‘generating embeddings’ which can be plugged directly into other models and algorithms. Kaj Sotala’s old paper https://philpapers.org/archive/SOTAOA covers some of this but could stand updating with a DL centric view about all the ways in which a model which achieves human-level performance on some task is far more desirable than an actual human, in much the same way that a car rate-limited to go only as fast as a horse is still more useful and valuable than a horse.
I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?
Of course, both inference and training become more efficient and I’m not sure if the ratio between them is changing over time.
As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).
I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training. In fact they need so much more parameters that it’s often worth distilling them into a dense model after training. The benefit of the sparse MoE architecture seems to be about faster, parallelizable training, not lower inference cost (same link).
Distillation seems to be the main source of cheaper inference then. How much does it help? I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model. Perhaps that’s why as of May 2021, the OpenAI API does not seem to have a model that is nearly as good as the large GPT-3 but cheaper. (Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.)
(An additional source of cheaper inference is by the way low-precision hardware (https://dl.acm.org/doi/pdf/10.1145/3079856.3080246).)
No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference). They’re generally more challenging to train because of the discrete gating, imbalanced experts, and sheer size—the Switch paper discusses the problems, and even the original Shazeer MoE emphasizes all of the challenges in training a MoE compared to a small dense model. Now, if you solve those problems (as Switch does), then yes, the cheaper inference would also make cheaper training (as long as you don’t have to do too much more training to compensate for the remaining problems), and that is an additional justification for Switch. But the primary motivation for researching MoE NMT etc has always been that it’d be a lot more economical to deploy at scale after training.
Those results are sparse->dense, so they are not necessarily relevant (I would be thinking more applying distillation to the original MoE and distill each expert—the MoE is what you want for deployment at scale anyway, that’s the point!). But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model. Like I said, the most reliable way to a small powerful model is through a big slow model.
Yeah, we don’t know what’s going on there. They’ve mentioned further finetuning of the models, but no details. They decline to specify even what the parameter counts are, hence EAI needing to reverse-engineer guesses from their benchmarks. (Perhaps the small models are now distilled models? At least early on, people were quite contemptuous of the small models, but these days people find they can be quite handy. Did we just underrate them initially, or did they actually get better?) They have an ‘instruction’ series they’ve never explained what it is (probably something like T0/FLAN?). Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise, and I can say that whenever I hear OAers talk about bottlenecks, they’re usually complaining about lack of people, which dabbling in distillation/sparsification wouldn’t help much with. Plus, of course, OA’s public output of research seems to be low since the API launched, which makes you wonder what they all spend their time doing. The API hasn’t changed all that much that I’ve noticed, and after this much time you’d think the sysadmin/SRE stuff would be fairly routine and handling itself. So… yeah, I dunno what’s going on behind the API, and wouldn’t treat it as evidence either way.
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.
Only if you switch to a dense model, which again doesn’t save you that much inference compute. But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.
Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).
Fair, that seems like the most plausible explanation.
I’m not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn’t happen so often.
MoEs?
Mixture of Experts, pretty sure.
I agree this post could benefit from discussing the advantages of silicon-based intelligence, thanks for bringing them up. I’d add that (scaled up versions of current) ML systems have disadvantages compared to humans, such as a lacking actuators and being cumbersome to fine-tune. Not to speak of the switching cost of moving from an economy based on humans to one based on ML systems. I’m not disputing that a human-level model could be transformative in years or decades—I just argue that it may not be in the short-term.