So if streaming works as well as Cereberas claims, GPUs can do that as well or better.
Hmm, I’m still not sure I buy this, after spending some more time thinking about it. GPUs can’t stream a matrix multiplication efficiently, as far as I’m aware. My understanding is that they’re not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.
Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there’s no round-trip latency gunking up the works like a GPU has when it wants data from RAM.
I agree sparsity (and also probably streaming) will be increasing important; I’ve actually developed new techniques for sparse matrix multiplication on GPUs.
Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.
And the weights are getting streamed from external RAM, GPUs can’t stream a matrix multiplication efficiently, as far as I’m aware.
Of course GPUs can and do stream a larger matrix multiplication from RAM—the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now − 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.
The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2′s pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2.
Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 ‘weight streaming’ - towards more optimal neurmorphic computing—where the weights stay in place and the activations flow through them.
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big
Hmm, I’m still not sure I buy this, after spending some more time thinking about it. GPUs can’t stream a matrix multiplication efficiently, as far as I’m aware. My understanding is that they’re not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.
Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there’s no round-trip latency gunking up the works like a GPU has when it wants data from RAM.
Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.
Of course GPUs can and do stream a larger matrix multiplication from RAM—the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now − 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.
The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2′s pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2.
Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 ‘weight streaming’ - towards more optimal neurmorphic computing—where the weights stay in place and the activations flow through them.
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big