Generically, pipelining increases throughput without lowering latency. Say you want to compute f(x) where f is a NN. Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.
NB, GPT-3 used pipelining for training (in combination with model- and data parallelism) and still the large GPT-3 has higher latency than the small ones in the OA API.
Say each layer takes 10ms to process. The NN has 100 layers. It takes 40ms to round-trip weight data from the host (say it’s on spinning rust or something). You can fit 5 layers worth of weights on a gpu, in addition to activation data / etc.
On a GPU with a “sufficiently large” amount of memory, such that you can fit everything on-GPU, this will have 1.04s latency overall. 40ms to grab all of the weights into the GPU, then 1s to process.
On a GPU, with no pipelining, loading five layers at a time then processing them, this will take 1.8 seconds latency overall. 40ms to load from disk, then 50 ms to process, for each group of 5 layers.
On a GPU, with pipelining, this will take… 1.04s overall latency. t=0ms, start loading layer 1 weights. t=10ms, start loading layer 2 weights. … t=40ms, start loading layer 5 weights & compute layer 1, t=50ms, start loading layer 6 weights & compute layer 2, etc. (Note that this has a max of 5 ‘active’ sets of weights at once, like in the no-pipelining case.)
(A better example would split this into request latency and bandwidth.)
> Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.
To be clear: I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer’s computation.
I can be loading the NN weights for layer N+1 while I’m working on layer N. There’s no dependency on the activations of the previous layer.
Let me give an example (incorrect) exchange that hopefully illustrates the issue.
”You can never stream video from a remote server, because your server roundtrip is 100ms and you only have 20ms per frame”.
”You can pipeline requests”
″...but I thought pipelining doesn’t help with latency?”
(This example is oversimplified. Video streaming is not done on a per-frame basis, for one.)
The key is: pipelining doesn’t help with latency of individual requests. But that’s not what we care about here. What we care about is the latency from starting request 1 to finishing request N—which pipelining absolutely does help with. (Assuming that you don’t have pipeline hazards at least—which we don’t.)
*****
All of the above being said, this only helps with the “my weights don’t fit in my GPU’s RAM” portion of things (which is what my original comment was responding to). If running an inference takes a billion floating-point ops and your GPU runs at a gigaflop, you’re never going to be able to run it in under a second on a single GPU. (Ditto, if your weights are 16GB and your GPU interface is 16GB/s, you’re never going to be able to run it in under a second on a single GPU… assuming you’re not doing something fancy like decompressing on-GPU at least.)
The key is: pipelining doesn’t help with latency of individual requests. But that’s not what we care about here. What we care about is the latency from starting request 1 to finishing request N
Thanks for the examples. Your point seems to be about throughput, not latency (which to my knowledge is defined on a per-request basis). The latency per request may not matter for training but it does matter for inference if you want your model to be fast enough to interact with the world in real time or faster.
Hm. Could you please reread my post? You’re repeatedly stating assertions that I explicitly state and show are not the case.
> Your point seems to be about throughput, not latency
I gave an explicit example where a single inference is lower latency with pipelining here versus without.
Hm. I think I understand where you seem to be misunderstanding. Let me try to explain a little more.
> latency (which to my knowledge is defined on a per-request basis)
The key here is that one “request” is composed of multiple requests.
From the end user point of view, a single “request” means “a single full end-to-end inference”. And the latency they care about is issuing the input data to getting the inference result out.
But from the internal point of view, that single full end-to-end inference has multiple requests (essentially, “load weights for layer 0; run calculation on inputs and layer 0 weights to get layer 1 input; load weights for layer 1; run calculation on layer 0 output and layer 1 weights to get layer 2 input; etc, etc”).
And you can reduce the latency of that one external request (the inference) by piplining multiple internal subrequests. You are absolutely correct in that the latency of each of the subrequests is not reduced—but the latency of the external request absolutely is reduced compared to if you didn’t pipeline! (At least assuming the internal subrequests can be pipelined—which they can be in this case as I’ve repeatedly noted.)
Thanks for elaborating I think I know what you mean now. I missed this:
I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer’s computation.
My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don’t have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer’s weights from RAM at the same as computing the previous layer’s output. This helps IF loading the weights is at least as fast as computing the outputs. If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
My original claim was therefore misconceived. I’ll revise it to a different claim: bigger neural nets ought to have higher inference latency in general—regardless of the whether we use Zero-infinity or not. As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency. However, adding more layers increases latency, and it’s hard to compensate with other forms of parallelism. (Width-wise parallelism could help but its communication cost scales unfavorably. It grows as we grow the NN’s width, and then again when we try to reduce latency by reducing the number of neurons per GPU [edit: it’s not quadratic, I was thinking of the parameter count].) Does that seem right to you?
The consequence then would be that inference latency (if not inference cost) becomes a constraint as we grow NNs, at least for applications where latency matters.
Width-wise parallelism could help but its communication cost scales unfavorably. It grows quadratically as we grow the NN’s width, and then quadratically again when we try to reduce latency by reducing the number of neurons per GPU.
Incidentally, the latency cost of width vs depth is something I’ve thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs.* (Why was Gato 1 so small and so unimpressive in many ways? Well, they kept it small because they wanted it to run in realtime for a real robot. A much wider Transformer could’ve still met the deadline… but cost a lot more parameters and training than usual by going off the optimal scaling curves.) It does not matter how many watts or neurons you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you’ve been eaten by a stupider but faster-thinking predator.
So a biological brain might be forced to be deep into an unfavorable point on width vs depth—which might be extremely expensive—in order to meet its subset of robotics-related deadlines, as it were.
* With a striking counterexample, in both tininess of brain and largeness of latency, being Portia. What is particularly striking to me is not that it is so intelligent while being so tiny, but that this seems to be directly due to its particular ecological niche: there are very few creatures out there who need extremely flexible intelligent behavior but also are allowed to have minutes or hours to plan many of its actions… but Portia is one of them, as it is a stealthy predator attacking static prey. So Portia spiders are allowed to do things like spend hours circumnavigating a web to strike its prey spider from the right direction or gradually test out mimicry until it finds the right cue to trick its prey spider. So it’s fascinating to see that in this highly unusual niche, it is possible to have a tiny biological brain execute extremely slow but intelligent strategies, and it suggests that if latency were not a problem, biological brains could be far more intelligent and we would not need to see such architecturally-huge biological brains to reach human-level performance, and then we would no longer have any paradox of why highly-optimized human brains seem to need so many parameters to do the same thing as tiny ANNs.
> If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does...
(There are tricks here to an extent—such as compressing the model and decompressing it on-target—but they seldom save much. (And if they do, that just means your model is inefficient...))
According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what’s already stored in RAM.)
That being said, as far as I can tell it is—in theory—possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.)
(In practice I strongly suspect you’ll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.)
(It’s a pity that there’s no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...)
(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
> As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency.
Indeed. And indeed, increases it, as you’re adding GPU-->GPU trips to the critical path.
Beware bandwidth bottlenecks, as I mentioned in my original post.
Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you’ll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.
(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you’re willing to use a small batch size.
Presumably bandwidth requirements can be reduced a lot through width-wise parallelism.
Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)
Perhaps what you meant is that latency will be high but this isn’t a problem as long as you have high throughput. That’s is basically true for training. But this post is about inference where latency matters a lot more.
(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don’t want to interact with it in real time, even at GPT-3 scale)
That would be interesting if true. I thought that pipelining doesn’t help with latency. Can you expand?
Generically, pipelining increases throughput without lowering latency. Say you want to compute f(x) where f is a NN. Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.
NB, GPT-3 used pipelining for training (in combination with model- and data parallelism) and still the large GPT-3 has higher latency than the small ones in the OA API.
To give a concrete example:
Say each layer takes 10ms to process. The NN has 100 layers. It takes 40ms to round-trip weight data from the host (say it’s on spinning rust or something). You can fit 5 layers worth of weights on a gpu, in addition to activation data / etc.
On a GPU with a “sufficiently large” amount of memory, such that you can fit everything on-GPU, this will have 1.04s latency overall. 40ms to grab all of the weights into the GPU, then 1s to process.
On a GPU, with no pipelining, loading five layers at a time then processing them, this will take 1.8 seconds latency overall. 40ms to load from disk, then 50 ms to process, for each group of 5 layers.
On a GPU, with pipelining, this will take… 1.04s overall latency. t=0ms, start loading layer 1 weights. t=10ms, start loading layer 2 weights. … t=40ms, start loading layer 5 weights & compute layer 1, t=50ms, start loading layer 6 weights & compute layer 2, etc. (Note that this has a max of 5 ‘active’ sets of weights at once, like in the no-pipelining case.)
(A better example would split this into request latency and bandwidth.)
> Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That’s why the latency to compute f(x) is high.
To be clear: I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer’s computation.
I can be loading the NN weights for layer N+1 while I’m working on layer N. There’s no dependency on the activations of the previous layer.
> pipelining doesn’t help with latency
Let me give an example (incorrect) exchange that hopefully illustrates the issue.
”You can never stream video from a remote server, because your server roundtrip is 100ms and you only have 20ms per frame”.
”You can pipeline requests”
″...but I thought pipelining doesn’t help with latency?”
(This example is oversimplified. Video streaming is not done on a per-frame basis, for one.)
The key is: pipelining doesn’t help with latency of individual requests. But that’s not what we care about here. What we care about is the latency from starting request 1 to finishing request N—which pipelining absolutely does help with. (Assuming that you don’t have pipeline hazards at least—which we don’t.)
*****
All of the above being said, this only helps with the “my weights don’t fit in my GPU’s RAM” portion of things (which is what my original comment was responding to). If running an inference takes a billion floating-point ops and your GPU runs at a gigaflop, you’re never going to be able to run it in under a second on a single GPU. (Ditto, if your weights are 16GB and your GPU interface is 16GB/s, you’re never going to be able to run it in under a second on a single GPU… assuming you’re not doing something fancy like decompressing on-GPU at least.)
Thanks for the examples. Your point seems to be about throughput, not latency (which to my knowledge is defined on a per-request basis). The latency per request may not matter for training but it does matter for inference if you want your model to be fast enough to interact with the world in real time or faster.
Hm. Could you please reread my post? You’re repeatedly stating assertions that I explicitly state and show are not the case.
> Your point seems to be about throughput, not latency
I gave an explicit example where a single inference is lower latency with pipelining here versus without.
Hm. I think I understand where you seem to be misunderstanding. Let me try to explain a little more.
> latency (which to my knowledge is defined on a per-request basis)
The key here is that one “request” is composed of multiple requests.
From the end user point of view, a single “request” means “a single full end-to-end inference”. And the latency they care about is issuing the input data to getting the inference result out.
But from the internal point of view, that single full end-to-end inference has multiple requests (essentially, “load weights for layer 0; run calculation on inputs and layer 0 weights to get layer 1 input; load weights for layer 1; run calculation on layer 0 output and layer 1 weights to get layer 2 input; etc, etc”).
And you can reduce the latency of that one external request (the inference) by piplining multiple internal subrequests. You are absolutely correct in that the latency of each of the subrequests is not reduced—but the latency of the external request absolutely is reduced compared to if you didn’t pipeline! (At least assuming the internal subrequests can be pipelined—which they can be in this case as I’ve repeatedly noted.)
Thanks for elaborating I think I know what you mean now. I missed this:
My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don’t have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer’s weights from RAM at the same as computing the previous layer’s output. This helps IF loading the weights is at least as fast as computing the outputs. If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
My original claim was therefore misconceived. I’ll revise it to a different claim: bigger neural nets ought to have higher inference latency in general—regardless of the whether we use Zero-infinity or not. As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency. However, adding more layers increases latency, and it’s hard to compensate with other forms of parallelism. (Width-wise parallelism could help but its communication cost scales unfavorably. It grows as we grow the NN’s width, and then again when we try to reduce latency by reducing the number of neurons per GPU [edit: it’s not quadratic, I was thinking of the parameter count].) Does that seem right to you?
The consequence then would be that inference latency (if not inference cost) becomes a constraint as we grow NNs, at least for applications where latency matters.
Incidentally, the latency cost of width vs depth is something I’ve thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs.* (Why was Gato 1 so small and so unimpressive in many ways? Well, they kept it small because they wanted it to run in realtime for a real robot. A much wider Transformer could’ve still met the deadline… but cost a lot more parameters and training than usual by going off the optimal scaling curves.) It does not matter how many watts or neurons you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you’ve been eaten by a stupider but faster-thinking predator.
So a biological brain might be forced to be deep into an unfavorable point on width vs depth—which might be extremely expensive—in order to meet its subset of robotics-related deadlines, as it were.
* With a striking counterexample, in both tininess of brain and largeness of latency, being Portia. What is particularly striking to me is not that it is so intelligent while being so tiny, but that this seems to be directly due to its particular ecological niche: there are very few creatures out there who need extremely flexible intelligent behavior but also are allowed to have minutes or hours to plan many of its actions… but Portia is one of them, as it is a stealthy predator attacking static prey. So Portia spiders are allowed to do things like spend hours circumnavigating a web to strike its prey spider from the right direction or gradually test out mimicry until it finds the right cue to trick its prey spider. So it’s fascinating to see that in this highly unusual niche, it is possible to have a tiny biological brain execute extremely slow but intelligent strategies, and it suggests that if latency were not a problem, biological brains could be far more intelligent and we would not need to see such architecturally-huge biological brains to reach human-level performance, and then we would no longer have any paradox of why highly-optimized human brains seem to need so many parameters to do the same thing as tiny ANNs.
I am glad we were able to work out the matter!
> If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does...
(There are tricks here to an extent—such as compressing the model and decompressing it on-target—but they seldom save much. (And if they do, that just means your model is inefficient...))
According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what’s already stored in RAM.)
That being said, as far as I can tell it is—in theory—possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.)
(In practice I strongly suspect you’ll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.)
(It’s a pity that there’s no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...)
(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
> As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency.
Indeed. And indeed, increases it, as you’re adding GPU-->GPU trips to the critical path.
Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you’ll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.
Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you’re willing to use a small batch size.
Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
Hence: beware bandwidth bottlenecks.
My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)
Perhaps what you meant is that latency will be high but this isn’t a problem as long as you have high throughput. That’s is basically true for training. But this post is about inference where latency matters a lot more.
(It depends on the application of course, but the ZeRO Infinity approach can make your model so slow that you don’t want to interact with it in real time, even at GPT-3 scale)