And the weights are getting streamed from external RAM, GPUs can’t stream a matrix multiplication efficiently, as far as I’m aware.
Of course GPUs can and do stream a larger matrix multiplication from RAM—the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now − 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.
The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2′s pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2.
Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 ‘weight streaming’ - towards more optimal neurmorphic computing—where the weights stay in place and the activations flow through them.
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big
Of course GPUs can and do stream a larger matrix multiplication from RAM—the difference is that the GPU design has multiple OOM more bandwidth to the equivalent external RAM (about 3 OOM to be more specific). Also the latest lovelace/hopper GPUs have more SRAM now − 50MB per chip, so about 1GB of SRAM for a 200 GPU pod similar to the cerebras wafer.
The CS-2 is only good at matrix-vector operations that fit in its SRAM capacity. As a thought experiment, consider running a brain-like ANN with 10B neurons and 10T sparse weights. Simulating one second of activity requires only on order 10T sparse ops or a couple OOM more dense ops which is already within current single GPU capability. The problem is that streaming in the 10TB of weight data would take several minutes on the CS-2′s pathetic slow IO path. Meanwhile the equivalent priced 200 GPU pod can fit the weights in GPU RAM and has the performance to simulate about a hundred instances of that brain sized model in real time, so about 10000x higher performance than the CS-2.
Weights outnumber activations by 3 or 4 OOM so moving weights over long distances as in CS-2 is enormously inefficient, compared to moving the activations around (as in the GPU design) which uses very little bandwidth. The future is in the opposite direction of that CS-2 ‘weight streaming’ - towards more optimal neurmorphic computing—where the weights stay in place and the activations flow through them.
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big