But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations make you resort to tensor or pipeline parallelism instead of data parallelism.
Well that’s not quite right—otherwise everyone would be training on single GPUs using very different techniques, which is not what we observe. Every parallel system has communication, but it doesn’t necessarily ‘spend time’ on that in the blocking sense, it typically happens in parallel with computation.
SOTA models do now seem often limited by RAM, so model parallelism is increasingly important as it is RAM efficient. This is actually why cerebras’s strategy doesn’t make sense: GPUs are optimized heavily for the sweet spot in terms of RAM capacity/$ and RAM bandwidth. The wafer scale approach instead tries to use on-chip SRAM to replace off-chip RAM, which is just enormously more expensive—at least an OOM more expensive in practice.
Then the pitch for Andromeda, unlike a GPU pod, is that those 120 PFLOPS are “real” in the sense that training speed increases linearly with the PFLOPS.
This of course is bogus because with model parallelism you can tune the interconnect requirements based on the model design, and nvidia has been tuning their interconnect tradeoffs for years in tandem with researchers cotuning their software/models for nvidia hardware. So current training setups are not strongly limited by interconnect vs other factors—some probably are, some underutilize interconnect and are limited by something else, but nvidia knows all of this, has all that data, and has been optimizing for these use cases weighted by value for years now (and is empirically better at this game than anybody else).
Fast off-chip memory combined with high memory bandwidth on the chip itself?
The upside of a wafer scale chip is fast on-chip transfer, the downside is slower off-chip transfer (as that is limited by the 2d perimeter of the much larger chip). For equal flops and or $$, the GPU design of breaking up the large tile into alternating logic and RAM subsections has higher total off chip RAM and off-chip transfer bandwidth.
The more ideal wafer design would be one where you had RAM stacked above in 3D, but cerebras doesn’t do that presumably because they need that whole surface for heat transfer. If you look inside the engine block of the CS-2 form their nice virtual tour you can see that the wafer is sandwiched directly between the massive voltage regulator array that pumps in power and the cooling system that pumps out heat. There is no off-chip RAM next to that wafer, the off-chip RAM access all has to go through the long range IO modules on the edge of the chip.
So a single CS-2 - even though it has the cost and nearly the flops you’d expect of the equivalent GPU die area of 100 individual GPUs—has only 40GB of RAM: half the 80GB of an A100 or H100, less even than an RTX A6000! So it has over 100x less RAM than an equivalent size (cost, flops, die area) GPU system. Worse yet it has only a pathetic 150GB/s of IO bandwidth out to any external RAM or SSD, vs the 3TB/s RAM bandwidth per H100 GPU, so you can’t supplement with external RAM.
This machine is an autistic savant. It maxes out local on chip interconnect (which GPUs aren’t strongly constrained by) at the expense of precious RAM. So like I said it’s only really good for running small models (which fit in 40GB) at very high speeds.
I am certainly not an expert, but I am still not sure about your claim that it’s only good for running small models. The main advantage they claim to have is “storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model.” (https://www.cerebras.net/product-cluster/ , weight streaming). So they explicitly claim that it should perform well with large models.
This would also align with tweets by OpenAI that Trillion is the new billion, and rumors about the new GPT-4 being similarly big jump as GPT-2 → GPT-3 was—having colossal number of parameters and sparse paradigm (https://thealgorithmicbridge.substack.com/p/gpt-4-rumors-from-silicon-valley). I could imagine that sparse parameters deliver much stronger results than normal parameters, and this might change scaling laws a bit.
The main advantage they claim to have is “storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model.”
This is almost a joke, because the equivalent GPU architecture has both greater total IO bandwidth to any external SSD/RAM array, and the massive near-die GPU RAM that can function as a cache for any streaming approach. So if streaming works as well as Cereberas claims, GPUs can do that as well or better.
I agree sparsity (and also probably streaming) will be increasing important; I’ve actually developed new techniques for sparse matrix multiplication on GPUs.
So if streaming works as well as Cereberas claims, GPUs can do that as well or better.
Hmm, I’m still not sure I buy this, after spending some more time thinking about it. GPUs can’t stream a matrix multiplication efficiently, as far as I’m aware. My understanding is that they’re not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.
Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there’s no round-trip latency gunking up the works like a GPU has when it wants data from RAM.
I agree sparsity (and also probably streaming) will be increasing important; I’ve actually developed new techniques for sparse matrix multiplication on GPUs.
Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.
And the weights are getting streamed from external RAM, GPUs can’t stream a matrix multiplication efficiently, as far as I’m aware.
Of course GPUs can and do stream a larger matrix multiplication from RAM—the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now − 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.
The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2′s pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2.
Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 ‘weight streaming’ - towards more optimal neurmorphic computing—where the weights stay in place and the activations flow through them.
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big
Well that’s not quite right—otherwise everyone would be training on single GPUs using very different techniques, which is not what we observe. Every parallel system has communication, but it doesn’t necessarily ‘spend time’ on that in the blocking sense, it typically happens in parallel with computation.
SOTA models do now seem often limited by RAM, so model parallelism is increasingly important as it is RAM efficient. This is actually why cerebras’s strategy doesn’t make sense: GPUs are optimized heavily for the sweet spot in terms of RAM capacity/$ and RAM bandwidth. The wafer scale approach instead tries to use on-chip SRAM to replace off-chip RAM, which is just enormously more expensive—at least an OOM more expensive in practice.
This of course is bogus because with model parallelism you can tune the interconnect requirements based on the model design, and nvidia has been tuning their interconnect tradeoffs for years in tandem with researchers cotuning their software/models for nvidia hardware. So current training setups are not strongly limited by interconnect vs other factors—some probably are, some underutilize interconnect and are limited by something else, but nvidia knows all of this, has all that data, and has been optimizing for these use cases weighted by value for years now (and is empirically better at this game than anybody else).
The upside of a wafer scale chip is fast on-chip transfer, the downside is slower off-chip transfer (as that is limited by the 2d perimeter of the much larger chip). For equal flops and or $$, the GPU design of breaking up the large tile into alternating logic and RAM subsections has higher total off chip RAM and off-chip transfer bandwidth.
The more ideal wafer design would be one where you had RAM stacked above in 3D, but cerebras doesn’t do that presumably because they need that whole surface for heat transfer. If you look inside the engine block of the CS-2 form their nice virtual tour you can see that the wafer is sandwiched directly between the massive voltage regulator array that pumps in power and the cooling system that pumps out heat. There is no off-chip RAM next to that wafer, the off-chip RAM access all has to go through the long range IO modules on the edge of the chip.
So a single CS-2 - even though it has the cost and nearly the flops you’d expect of the equivalent GPU die area of 100 individual GPUs—has only 40GB of RAM: half the 80GB of an A100 or H100, less even than an RTX A6000! So it has over 100x less RAM than an equivalent size (cost, flops, die area) GPU system. Worse yet it has only a pathetic 150GB/s of IO bandwidth out to any external RAM or SSD, vs the 3TB/s RAM bandwidth per H100 GPU, so you can’t supplement with external RAM.
This machine is an autistic savant. It maxes out local on chip interconnect (which GPUs aren’t strongly constrained by) at the expense of precious RAM. So like I said it’s only really good for running small models (which fit in 40GB) at very high speeds.
I am certainly not an expert, but I am still not sure about your claim that it’s only good for running small models. The main advantage they claim to have is “storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model.” (https://www.cerebras.net/product-cluster/ , weight streaming). So they explicitly claim that it should perform well with large models.
Furthermore, in their white paper (https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper%20111521.pdf), they claim that the CS-2 architecture is much better suited for sparse models(e.g. by Lottery Ticket Hypothesis) and on page 16 they show that Sparse GPT-3 could be trained in 2-5 days.
This would also align with tweets by OpenAI that Trillion is the new billion, and rumors about the new GPT-4 being similarly big jump as GPT-2 → GPT-3 was—having colossal number of parameters and sparse paradigm (https://thealgorithmicbridge.substack.com/p/gpt-4-rumors-from-silicon-valley). I could imagine that sparse parameters deliver much stronger results than normal parameters, and this might change scaling laws a bit.
This is almost a joke, because the equivalent GPU architecture has both greater total IO bandwidth to any external SSD/RAM array, and the massive near-die GPU RAM that can function as a cache for any streaming approach. So if streaming works as well as Cereberas claims, GPUs can do that as well or better.
I agree sparsity (and also probably streaming) will be increasing important; I’ve actually developed new techniques for sparse matrix multiplication on GPUs.
Hmm, I’m still not sure I buy this, after spending some more time thinking about it. GPUs can’t stream a matrix multiplication efficiently, as far as I’m aware. My understanding is that they’re not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.
Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there’s no round-trip latency gunking up the works like a GPU has when it wants data from RAM.
Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.
Of course GPUs can and do stream a larger matrix multiplication from RAM—the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now − 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.
The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2′s pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2.
Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 ‘weight streaming’ - towards more optimal neurmorphic computing—where the weights stay in place and the activations flow through them.
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big