Inference cost limits the impact of ever larger models
I sometimes notice that people in my community (myself included) assume that the first “generally human-level” model will lead to a transformative takeoff scenario almost immediately. The assumption seems to be that training is expensive but inference is cheap so once you’re done training you can deploy an essentially unlimited number of cheap copies of the model. I think this is far from obvious
[edit: This post should be read as “inference cost may turn out to be a bottleneck. Don’t forget about them. But we don’t know how inference costs will develop in the future. Additionally, it may take a while before we can run lots of copies of an extremely large model because we’d need to build new computers first.]
Inference refers to the deployment of a trained model on a new input. According to OpenAI’s report from 2018, most compute used for deep learning is spent not on training but on inference. It is true that one inference step is much cheaper than a training run consisting of many training steps. But many inference steps together can make up the bulk of compute.
To gain some intuition, consider that writing 750 words with GPT-3 costs 6 cents. If we made a model with 1000x more parameters, similar to the difference between GPT-1 and GPT-3, the 750 words would cost $60, comparable to the cost of a good human writer. But to start an immediate economic transformation, I expect we need something significantly cheaper (or smarter) than humans.
Of course, the future will bring efficiency improvements. But also increases in cost. For example, future models may look at a context window longer than 2048 tokens, and I’ve assumed greedy sampling here which is cheap but suboptimal (it’s like typing without getting to revise). I’m unsure how these factors balance out.
To have a transformative impact, as a heuristic, the number of copies of our human-level model should probably exceed the human population (~8 billion). But to run billions of copies, we’d need to dramatically increase the world’s number of supercomputers. You can’t just repurpose all consumer GPUs for inferencing, let alone run GPT-3 on your smartphone. GPT-3 needs hundreds of GPUs just to fit the model into GPU memory.[1] These GPUs must then be linked through a web of fast interconnects professionally fitted in a data center. And if we’re talking about a 1000x larger model, today’s supercomputers may not be ready to store even a single copy of it.[2]
This is not to say that a generally human-level model wouldn’t have some drastic impacts, or be closely followed by generally super-human models; it just makes me pause before assuming that the first human-level model is the end of the world as we know it. In order run enough copies of the model, depending on its exact size, we’d first need to make it more efficient and build many, many new supercomputers.
- ↩︎
You can theoretically run a model on fewer GPUs by putting just the first layer into GPU memory, forward passing on it, then deleting it and loading the second layer from RAM, and so forth (see ZeRO-Infinity). But this comes with high latency which rules out many applications.
- ↩︎
I’m told that the largest clusters these days have tens of thousands of GPUs.
You’re missing a lot of the hardware overhang arguments—for example, that DL models can be distilled, sparsified, and compressed to a tremendous degree. The most reliable way to a cheap fast small model is through an expensive slow big model.
Even in the OA API, people make heavy use of the smallest models like Ada, which is <1b parameters (estimated by EAI). The general strategy is to play around with Davinci (175b) until you get a feel for working with GPT-3, refine a prompt on it, and then once you’ve established a working prototype prompt, bring it down to Ada/Babbage/Curie, going as low as possible.
You can also do things like use the largest model to generate examples to finetune much smaller models on: “Unsupervised Neural Machine Translation with Generative Language Models Only”, Han et al 2021 is a very striking recent paper I’ve linked before about self-distillation, but in this case I would emphasize their findings about using the largest GPT-3 to teach the smaller GPT-3s much better translation skills. Or, MoEs implicitly save a ton of compute by shortcutting using cheap sub-models, and that’s why you see a lot of them these days.
Indeed, the experience curves for AI are quite steep: https://openai.com/blog/ai-and-efficiency/ Once you can do something at all… (There was an era where AI Go masters cost more to run than human Go masters. It was a few months in mid-2016.)
More broadly, you’re missing all the possibilities of a ‘merely human-level’ AI. It can be parallelized, scaled up and down (both in instances and parameters), ultra-reliable, immortal, consistently improved by new training datasets, low-latency, ultimately amortizes to zero capital investment, and enables things which are simply impossible for humans—there is no equivalent of ‘generating embeddings’ which can be plugged directly into other models and algorithms. Kaj Sotala’s old paper https://philpapers.org/archive/SOTAOA covers some of this but could stand updating with a DL centric view about all the ways in which a model which achieves human-level performance on some task is far more desirable than an actual human, in much the same way that a car rate-limited to go only as fast as a horse is still more useful and valuable than a horse.
I broadly agree with your first point, that inference can be made more efficient. Though we may have different views on how much?
Of course, both inference and training become more efficient and I’m not sure if the ratio between them is changing over time.
As I mentioned there are also reasons why inference could become more expensive than in the numbers I gave. Given this uncertainty, my median guess is that the cost of inference will continue to exceed the cost of training (averaged across the whole economy).
I don’t think sparse (mixture of expert) models are an example of lowering inference cost. They mostly help with training. In fact they need so much more parameters that it’s often worth distilling them into a dense model after training. The benefit of the sparse MoE architecture seems to be about faster, parallelizable training, not lower inference cost (same link).
Distillation seems to be the main source of cheaper inference then. How much does it help? I’m not sure in general but e.g. in the Switch Transformer paper (same link again), distilling into a 5x smaller model means losing most of the performance gained by using the larger model. Perhaps that’s why as of May 2021, the OpenAI API does not seem to have a model that is nearly as good as the large GPT-3 but cheaper. (Unless the large GPT-3 is no longer available and has been replaced with something cheaper but equally good.)
(An additional source of cheaper inference is by the way low-precision hardware (https://dl.acm.org/doi/pdf/10.1145/3079856.3080246).)
No, they don’t. The primary justification for introducing them in the first place was to make a cheaper forward pass (=inference). They’re generally more challenging to train because of the discrete gating, imbalanced experts, and sheer size—the Switch paper discusses the problems, and even the original Shazeer MoE emphasizes all of the challenges in training a MoE compared to a small dense model. Now, if you solve those problems (as Switch does), then yes, the cheaper inference would also make cheaper training (as long as you don’t have to do too much more training to compensate for the remaining problems), and that is an additional justification for Switch. But the primary motivation for researching MoE NMT etc has always been that it’d be a lot more economical to deploy at scale after training.
Those results are sparse->dense, so they are not necessarily relevant (I would be thinking more applying distillation to the original MoE and distill each expert—the MoE is what you want for deployment at scale anyway, that’s the point!). But the glass is half-full: they also report that you can throw away 99% of the model, and still get a third of the boost over the baseline small model. Like I said, the most reliable way to a small powerful model is through a big slow model.
Yeah, we don’t know what’s going on there. They’ve mentioned further finetuning of the models, but no details. They decline to specify even what the parameter counts are, hence EAI needing to reverse-engineer guesses from their benchmarks. (Perhaps the small models are now distilled models? At least early on, people were quite contemptuous of the small models, but these days people find they can be quite handy. Did we just underrate them initially, or did they actually get better?) They have an ‘instruction’ series they’ve never explained what it is (probably something like T0/FLAN?). Paul’s estimate of TFLOPS cost vs API billing suggests that compute is not a major priority for them cost-wise, and I can say that whenever I hear OAers talk about bottlenecks, they’re usually complaining about lack of people, which dabbling in distillation/sparsification wouldn’t help much with. Plus, of course, OA’s public output of research seems to be low since the API launched, which makes you wonder what they all spend their time doing. The API hasn’t changed all that much that I’ve noticed, and after this much time you’d think the sysadmin/SRE stuff would be fairly routine and handling itself. So… yeah, I dunno what’s going on behind the API, and wouldn’t treat it as evidence either way.
The motivation to make inference cheaper doesn’t seem to be mentioned in the Switch Transformer paper nor in the original Shazeer paper. They do mention improving training cost, training time (from being much easier to parallelize), and peak accuracy. Whatever the true motivation may be, it doesn’t seem that MoEs change the ratio of training to inference cost, except insofar as they’re currently finicky to train.
Only if you switch to a dense model, which again doesn’t save you that much inference compute. But as you said, they should instead distill into an MoE with smaller experts. It’s still unclear to me how much inference cost this could save, and at what loss of accuracy.
Either way, distilling would make it harder to further improve the model, so you lose one of the key benefits of silicon-based intelligence (the high serial speed which lets your model do a lot of ‘thinking’ in a short wallclock time).
Fair, that seems like the most plausible explanation.
I’m not sure what you mean. They refer all over the place to greater computational efficiency and the benefits of constant compute cost even as one scales up experts. And this was front and center in the original MoE paper emphasizing the cheapness of the forward pass and positioning it as an improvement on the GNMT NMT RNN Google Translate had just rolled out the year before or so (including benchmarking the actual internal Google Translate datasets), and which was probably a major TPUv1 user (judging from the % of RNN workload reported in the TPU paper). Training costs are important, of course, but a user like Google Translate, the customer of the MoE work, cares more about the deployment costs because they want to serve literally billions of users, while the training doesn’t happen so often.
MoEs?
Mixture of Experts, pretty sure.
I agree this post could benefit from discussing the advantages of silicon-based intelligence, thanks for bringing them up. I’d add that (scaled up versions of current) ML systems have disadvantages compared to humans, such as a lacking actuators and being cumbersome to fine-tune. Not to speak of the switching cost of moving from an economy based on humans to one based on ML systems. I’m not disputing that a human-level model could be transformative in years or decades—I just argue that it may not be in the short-term.
It costs well under $1/hour to rent hardware that performs 100 trillion operations per second. If a model using that much compute (something like 3 orders of magnitude more than gpt-3) were competitive with trained humans, it seems like it would be transformative. Even if you needed 3 more orders of magnitude to be human-level at typical tasks, it still looks like it would be transformative in a short period of time owing to its other advantages (quickly reaching and then surpassing the top end of the human range, and running at much larger serial speed—more likely you’d be paying 1000x as much to run your model 1000x faster than a human). If this were literally dropped in our laps right now it would fortunately be slowed down for a while because there just isn’t enough hardware, but that probably won’t be the case for long.
I’m trying to reconcile:
vs
That’s easy to reconcile! OpenAI is selling access to GPT-3 wayyyy above its own marginal hardware rental cost. Right? That would hardly be surprising; usually pricing decisions involve other things besides marginal costs, like price elasticity of demand, and capacity to scale up, and so on. (And/or OpenAI’s marginal costs includes things that are not hardware rental, e.g. human monitoring and approval processes.) But as soon as there’s some competition (especially competition from open-source projects) I expect price to rapidly approach the hardware rental cost (including electricity).
Someone can correct me if I’m misunderstanding.
That estimate puts GPT-3 at about 500 billion floating point operations per word, 200x less than 100 trillion. If you think a human reads at 250 words per minute, then 6 cents for 750 words is $1.20/hour. So the two estimates differ by about 250x.
As a citation for the hardware cost:
P4d instances on EC2 currently cost $11.57/h if reserved for 3 years. They contain 8 A100s.
An A100 does about 624 trillion half-precision ops/second.
So that’s 430 trillion (operations per second) per ($/hour).
You shouldn’t expect to be able to get full utilization out of that for a variety of reasons, but in the very long run you should be getting reasonably close, certainly more than 100 trillion operations per second.
(ETA: But note that a service like the OpenAI API using EC2 would need to use on demand prices which are about 10x higher per flop if you want reasonable availability.)
Limitation:
Cost of compute + addition to pricing for:
a) Profit
b) To recuperate costs from training or acquiring the model
Having an additional feature, human monitoring/approval, does make things higher. (In principle maybe it could increase quality.)
You may have better info, but I’m not sure I expect 1000x better serial speed than humans (at least not with innovations in the next decade). Latency is already a bottleneck in practice, despite efforts to reduce it. Width-wise parallelism has its limits and depth- or data-wise parallelism doesn’t improve latency. For example, GPT-3 already has high latency compared to smaller models and it won’t help if you make it 10^3x or 10^6x bigger.
I’m trying to figure out a principled way to calculate/estimate how long it would take to cross the human range in a situation like this. How do you think about it? Taking the history of Go as a precedent, it would seem that we’d get AGI capable of competing with the average human first, and then several years (decades?) later we’d get an AGI architecture+project that blows through the entire human range in a few months. That feels like it can’t be right.
Depends on what you mean by “human range.” Go was decades only if you talk about crossing the range between people who don’t play Go at all to those who play as a hobby to those who have trained very extensively. If you restrict to the range of “how good would this human be if they trained extensively at Go?” then I’d guess the range is much smaller—I’d guess that the median person could reach a few amateur dan with practice, so maybe you are looking at like 10 stones of range between “unusually bad human” and “best human.”
My rough guess when I looked into it before was that doubling model size is worth about 1 stone around AlphaZero’s size/strength, so that’s about a factor of 1000 in model size.
I think this is mostly an artifact of scaling up R&D effort really quickly. If you have a 50th percentile human and then radically scale up R&D, it wouldn’t be that surprising if you got to “best human” within a year. The reason it would seem surprising to me for AGI is that investment will already be high enough that it won’t be possible to scale up R&D that much / that fast as you approach the average human.
As Steven noted, your $1/hour number is cheaper than my numbers and probably more realistic. That makes a significant difference.
I agree that transformative impact is possible once we’ve built enough GPUs and connected them up into many, many new supercomputers bigger than the ones we have today. In a <=10 year timeline scenario, this seems like a bottleneck. But maybe not with longer timelines.
My general prior on inference cost is that it is the same order of magnitude as training cost, and thus neither dominates the other in general, due to tradeoffs.
I don’t remember where I got that idea from, though.
Latency shouldn’t be a problem, as you can pipeline. At least as long as you don’t run into Little’s Law problems.
(Depending on the structure of the connection matrix, you may be able to even pipeline at a sub-layer granularity.)
GPU bus bandwidth is likely more of a problem. PCIe gen3x16 is “only” ~16GB/s.
That would be interesting if true. I thought that pipelining doesn’t help with latency. Can you expand?
Generically, pipelining increases throughput without lowering latency. Say you want to compute f(x) where f is a NN. Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.
NB, GPT-3 used pipelining for training (in combination with model- and data parallelism) and still the large GPT-3 has higher latency than the small ones in the OA API.
To give a concrete example:
Say each layer takes 10ms to process. The NN has 100 layers. It takes 40ms to round-trip weight data from the host (say it’s on spinning rust or something). You can fit 5 layers worth of weights on a gpu, in addition to activation data / etc.
On a GPU with a “sufficiently large” amount of memory, such that you can fit everything on-GPU, this will have 1.04s latency overall. 40ms to grab all of the weights into the GPU, then 1s to process.
On a GPU, with no pipelining, loading five layers at a time then processing them, this will take 1.8 seconds latency overall. 40ms to load from disk, then 50 ms to process, for each group of 5 layers.
On a GPU, with pipelining, this will take… 1.04s overall latency. t=0ms, start loading layer 1 weights. t=10ms, start loading layer 2 weights. … t=40ms, start loading layer 5 weights & compute layer 1, t=50ms, start loading layer 6 weights & compute layer 2, etc. (Note that this has a max of 5 ‘active’ sets of weights at once, like in the no-pipelining case.)
(A better example would split this into request latency and bandwidth.)
> Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.
To be clear: I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer’s computation.
I can be loading the NN weights for layer N+1 while I’m working on layer N. There’s no dependency on the activations of the previous layer.
> pipelining doesn’t help with latency
Let me give an example (incorrect) exchange that hopefully illustrates the issue.
”You can never stream video from a remote server, because your server roundtrip is 100ms and you only have 20ms per frame”.
”You can pipeline requests”
″...but I thought pipelining doesn’t help with latency?”
(This example is oversimplified. Video streaming is not done on a per-frame basis, for one.)
The key is: pipelining doesn’t help with latency of individual requests. But that’s not what we care about here. What we care about is the latency from starting request 1 to finishing request N—which pipelining absolutely does help with. (Assuming that you don’t have pipeline hazards at least—which we don’t.)
*****
All of the above being said, this only helps with the “my weights don’t fit in my GPU’s RAM” portion of things (which is what my original comment was responding to). If running an inference takes a billion floating-point ops and your GPU runs at a gigaflop, you’re never going to be able to run it in under a second on a single GPU. (Ditto, if your weights are 16GB and your GPU interface is 16GB/s, you’re never going to be able to run it in under a second on a single GPU… assuming you’re not doing something fancy like decompressing on-GPU at least.)
Thanks for the examples. Your point seems to be about throughput, not latency (which to my knowledge is defined on a per-request basis). The latency per request may not matter for training but it does matter for inference if you want your model to be fast enough to interact with the world in real time or faster.
Hm. Could you please reread my post? You’re repeatedly stating assertions that I explicitly state and show are not the case.
> Your point seems to be about throughput, not latency
I gave an explicit example where a single inference is lower latency with pipelining here versus without.
Hm. I think I understand where you seem to be misunderstanding. Let me try to explain a little more.
> latency (which to my knowledge is defined on a per-request basis)
The key here is that one “request” is composed of multiple requests.
From the end user point of view, a single “request” means “a single full end-to-end inference”. And the latency they care about is issuing the input data to getting the inference result out.
But from the internal point of view, that single full end-to-end inference has multiple requests (essentially, “load weights for layer 0; run calculation on inputs and layer 0 weights to get layer 1 input; load weights for layer 1; run calculation on layer 0 output and layer 1 weights to get layer 2 input; etc, etc”).
And you can reduce the latency of that one external request (the inference) by piplining multiple internal subrequests. You are absolutely correct in that the latency of each of the subrequests is not reduced—but the latency of the external request absolutely is reduced compared to if you didn’t pipeline! (At least assuming the internal subrequests can be pipelined—which they can be in this case as I’ve repeatedly noted.)
Thanks for elaborating I think I know what you mean now. I missed this:
My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don’t have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer’s weights from RAM at the same as computing the previous layer’s output. This helps IF loading the weights is at least as fast as computing the outputs. If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
My original claim was therefore misconceived. I’ll revise it to a different claim: bigger neural nets ought to have higher inference latency in general—regardless of the whether we use Zero-infinity or not. As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency. However, adding more layers increases latency, and it’s hard to compensate with other forms of parallelism. (Width-wise parallelism could help but its communication cost scales unfavorably. It grows as we grow the NN’s width, and then again when we try to reduce latency by reducing the number of neurons per GPU [edit: it’s not quadratic, I was thinking of the parameter count].) Does that seem right to you?
The consequence then would be that inference latency (if not inference cost) becomes a constraint as we grow NNs, at least for applications where latency matters.
Incidentally, the latency cost of width vs depth is something I’ve thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs.* (Why was Gato 1 so small and so unimpressive in many ways? Well, they kept it small because they wanted it to run in realtime for a real robot. A much wider Transformer could’ve still met the deadline… but cost a lot more parameters and training than usual by going off the optimal scaling curves.) It does not matter how many watts or neurons you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you’ve been eaten by a stupider but faster-thinking predator.
So a biological brain might be forced to be deep into an unfavorable point on width vs depth—which might be extremely expensive—in order to meet its subset of robotics-related deadlines, as it were.
* With a striking counterexample, in both tininess of brain and largeness of latency, being Portia. What is particularly striking to me is not that it is so intelligent while being so tiny, but that this seems to be directly due to its particular ecological niche: there are very few creatures out there who need extremely flexible intelligent behavior but also are allowed to have minutes or hours to plan many of its actions… but Portia is one of them, as it is a stealthy predator attacking static prey. So Portia spiders are allowed to do things like spend hours circumnavigating a web to strike its prey spider from the right direction or gradually test out mimicry until it finds the right cue to trick its prey spider. So it’s fascinating to see that in this highly unusual niche, it is possible to have a tiny biological brain execute extremely slow but intelligent strategies, and it suggests that if latency were not a problem, biological brains could be far more intelligent and we would not need to see such architecturally-huge biological brains to reach human-level performance, and then we would no longer have any paradox of why highly-optimized human brains seem to need so many parameters to do the same thing as tiny ANNs.
I am glad we were able to work out the matter!
> If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does...
(There are tricks here to an extent—such as compressing the model and decompressing it on-target—but they seldom save much. (And if they do, that just means your model is inefficient...))
According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what’s already stored in RAM.)
That being said, as far as I can tell it is—in theory—possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.)
(In practice I strongly suspect you’ll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.)
(It’s a pity that there’s no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...)
(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
> As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency.
Indeed. And indeed, increases it, as you’re adding GPU-->GPU trips to the critical path.
Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you’ll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.
Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you’re willing to use a small batch size.
Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
Hence: beware bandwidth bottlenecks.
My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)
Perhaps what you meant is that latency will be high but this isn’t a problem as long as you have high throughput. That’s is basically true for training. But this post is about inference where latency matters a lot more.
(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don’t want to interact with it in real time, even at GPT-3 scale)