Contemporary GPUs often have very imbalanced memory vs arithmetic operation capabilities. For instance, an H100 can do around 3e15 8-bit FLOP/s, but the speed at which information can move between the cores and the GPU memory is only 3 TB/s. As 8 bits = 1 byte, there is a mismatch of three orders of magnitude between the arithmetic operation capabilities of the GPU and its memory bandwidth.
This imbalance ends up substantially lowering the utilization rate of ML hardware when batch sizes are small. For instance, suppose we have a model parametrized by 1.6 trillion 8-bit floating point numbers. To just fit the parameters of the model onto the GPUs, we’ll need at least 20 H100s, as each H100 has a VRAM of 80 GB. Suppose we split our model into 20 layers and use 20-way tensor parallelism: this means that we slice the parameters of the model “vertically”, such that the first GPU holds the first 5% of the parameters in every layer, the second GPU holds the second 5%, et cetera.
This sounds good, but now think of what happens when we try to run this model. In this case, roughly speaking, each parameter comes with one addition and one multiplication operation, so we do around 3.2 trillion arithmetic operations in one forward pass. As each H100 does 3e15 8-bit FLOP/s and we have 20 of them running tensor parallel, we can do this in a mere ~ 0.05 milliseconds. However, each parameter also has to be read into memory, and here our total memory bandwidth is only 60 TB/s, meaning for a model of size 1.6 TB we must spend (1.6 TB)/(60 TB/s) ~= 27 ms just because of the memory bottlenecks! This bottlenecks inference and we end up with an abysmal utilization rate of approximately (0.05 ms)/(27 ms) ~= 0.2%. This becomes even worse when we also take in inter-GPU communication costs into account, which would be at around 1 TB/s if the GPUs are using NVLink.
Well, this is not very good. Most of our arithmetic operation capability is being wasted because the ALUs spend most of their time idling and waiting for the parameters to be moved to the GPU cores. Can we somehow improve this?
A crucial observation is that if getting the parameters to the GPU cores is the bottleneck, we want to somehow amortize this over many calls to the model. For instance, imagine we could move a batch of parameters to the cores and use them a thousand times before moving on to the next batch. This would do much to remedy the imbalance between memory read and compute times.
If our model is an LLM, then unfortunately we cannot do this for a single user because text is generated serially: even though each token needs its own LLM call and so the user needs to make many calls to the model to generate text, we can’t parallelize these calls because each future token call needs to know all the past tokens. This inherently serial nature of text generation makes it infeasible to improve the memory read and compute time balance if only a single user is being serviced by the model.
However, things are different if we get to batch requests from multiple users together. For instance, suppose that our model is being asked to generate tokens by thousands of users at any given time. Then, we can parallelize these calls: every time we load some parameters onto the GPU cores, we perform the operations associated with those parameters for all user calls at once. This way, we amortize the reading cost of the parameters over many users, greatly improving our situation. Eventually this hits diminishing returns because we must also read the hidden state of each user’s calls into GPU memory, but the hidden states are usually significantly smaller than the whole model, so parallelization still results in huge gains before we enter this regime.
For instance, if we could batch requests from 100 users together in our above setup, we might be able to achieve a utilization rate of 20% - note that in a realistic setup this would be much lower due to many sources of overhead the simplistic calculation is ignoring, but morally the calculation still gives the right result.
The result is massive economies of scale not just in training AI models, but also in running them. If an individual user wanted to run a large model at a reasonable speed, they might have to pay a thousand times what they would pay to a centralized API provider which relies on large GPU clusters to batch requests from many different users.
Some simple math on this: if you need 1000 concurrent users for reasonable utilization rates because of the 1000:1 imbalance between ALU ops and memory bandwidth in GPUs, and each user on average spends 10 minutes per day using your service, then you need a total user base of at least (1000 users)/(10 minutes/day) ~= 144K users. If you also want the service to be consistent, i.e. low latency and high throughput 24 hours a day, you probably need to exceed this by some substantial margin, perhaps even approach 1M total users. This is of course much smaller than the scale of a search engine such as Google, but still probably outside the realm where individual hobbyists or enthusiasts can hope to compete with the cost-effectiveness of centralized providers.
The contrast with the human brain is instructive. A H100 GPU draws 700 W of power to do 3e15 8-bit FLOP/s, which we think is similar to the computational power of the brain, though with ~ 30x the power draw. However, a H100 GPU has a mere 80 GB of VRAM, compared to the human brain’s storage of the “parameter values” of around ~ 100 trillion synapses, which would probably take up ~ 100 TB of memory. On top of this, the human brain can run a (trivially) human equivalent intelligence at reasonable latency and throughput at a batch size of one: no parallelization across brains is needed. This suggests the human brain does not suffer from the same memory bandwidth versus arithmetic operation imbalance problem that modern GPUs have.
Whether this imbalance can possibly be cheaply engineered away or not might determine the extent to which the market for AI deployment (which may or may not become vertically disintegrated from AI R&D and training) is dominated by a few small actors, and seems like an important question about hardware R&D. I don’t have the expertise to judge to what extent engineering away these memory bottlenecks is feasible and would be interested to hear from people who do have expertise in this domain.
Memory bandwidth constraints imply economies of scale in AI inference
Contemporary GPUs often have very imbalanced memory vs arithmetic operation capabilities. For instance, an H100 can do around 3e15 8-bit FLOP/s, but the speed at which information can move between the cores and the GPU memory is only 3 TB/s. As 8 bits = 1 byte, there is a mismatch of three orders of magnitude between the arithmetic operation capabilities of the GPU and its memory bandwidth.
This imbalance ends up substantially lowering the utilization rate of ML hardware when batch sizes are small. For instance, suppose we have a model parametrized by 1.6 trillion 8-bit floating point numbers. To just fit the parameters of the model onto the GPUs, we’ll need at least 20 H100s, as each H100 has a VRAM of 80 GB. Suppose we split our model into 20 layers and use 20-way tensor parallelism: this means that we slice the parameters of the model “vertically”, such that the first GPU holds the first 5% of the parameters in every layer, the second GPU holds the second 5%, et cetera.
This sounds good, but now think of what happens when we try to run this model. In this case, roughly speaking, each parameter comes with one addition and one multiplication operation, so we do around 3.2 trillion arithmetic operations in one forward pass. As each H100 does 3e15 8-bit FLOP/s and we have 20 of them running tensor parallel, we can do this in a mere ~ 0.05 milliseconds. However, each parameter also has to be read into memory, and here our total memory bandwidth is only 60 TB/s, meaning for a model of size 1.6 TB we must spend (1.6 TB)/(60 TB/s) ~= 27 ms just because of the memory bottlenecks! This bottlenecks inference and we end up with an abysmal utilization rate of approximately (0.05 ms)/(27 ms) ~= 0.2%. This becomes even worse when we also take in inter-GPU communication costs into account, which would be at around 1 TB/s if the GPUs are using NVLink.
Well, this is not very good. Most of our arithmetic operation capability is being wasted because the ALUs spend most of their time idling and waiting for the parameters to be moved to the GPU cores. Can we somehow improve this?
A crucial observation is that if getting the parameters to the GPU cores is the bottleneck, we want to somehow amortize this over many calls to the model. For instance, imagine we could move a batch of parameters to the cores and use them a thousand times before moving on to the next batch. This would do much to remedy the imbalance between memory read and compute times.
If our model is an LLM, then unfortunately we cannot do this for a single user because text is generated serially: even though each token needs its own LLM call and so the user needs to make many calls to the model to generate text, we can’t parallelize these calls because each future token call needs to know all the past tokens. This inherently serial nature of text generation makes it infeasible to improve the memory read and compute time balance if only a single user is being serviced by the model.
However, things are different if we get to batch requests from multiple users together. For instance, suppose that our model is being asked to generate tokens by thousands of users at any given time. Then, we can parallelize these calls: every time we load some parameters onto the GPU cores, we perform the operations associated with those parameters for all user calls at once. This way, we amortize the reading cost of the parameters over many users, greatly improving our situation. Eventually this hits diminishing returns because we must also read the hidden state of each user’s calls into GPU memory, but the hidden states are usually significantly smaller than the whole model, so parallelization still results in huge gains before we enter this regime.
For instance, if we could batch requests from 100 users together in our above setup, we might be able to achieve a utilization rate of 20% - note that in a realistic setup this would be much lower due to many sources of overhead the simplistic calculation is ignoring, but morally the calculation still gives the right result.
The result is massive economies of scale not just in training AI models, but also in running them. If an individual user wanted to run a large model at a reasonable speed, they might have to pay a thousand times what they would pay to a centralized API provider which relies on large GPU clusters to batch requests from many different users.
Some simple math on this: if you need 1000 concurrent users for reasonable utilization rates because of the 1000:1 imbalance between ALU ops and memory bandwidth in GPUs, and each user on average spends 10 minutes per day using your service, then you need a total user base of at least (1000 users)/(10 minutes/day) ~= 144K users. If you also want the service to be consistent, i.e. low latency and high throughput 24 hours a day, you probably need to exceed this by some substantial margin, perhaps even approach 1M total users. This is of course much smaller than the scale of a search engine such as Google, but still probably outside the realm where individual hobbyists or enthusiasts can hope to compete with the cost-effectiveness of centralized providers.
The contrast with the human brain is instructive. A H100 GPU draws 700 W of power to do 3e15 8-bit FLOP/s, which we think is similar to the computational power of the brain, though with ~ 30x the power draw. However, a H100 GPU has a mere 80 GB of VRAM, compared to the human brain’s storage of the “parameter values” of around ~ 100 trillion synapses, which would probably take up ~ 100 TB of memory. On top of this, the human brain can run a (trivially) human equivalent intelligence at reasonable latency and throughput at a batch size of one: no parallelization across brains is needed. This suggests the human brain does not suffer from the same memory bandwidth versus arithmetic operation imbalance problem that modern GPUs have.
Whether this imbalance can possibly be cheaply engineered away or not might determine the extent to which the market for AI deployment (which may or may not become vertically disintegrated from AI R&D and training) is dominated by a few small actors, and seems like an important question about hardware R&D. I don’t have the expertise to judge to what extent engineering away these memory bottlenecks is feasible and would be interested to hear from people who do have expertise in this domain.