Importantly, by “faster” here we are talking about latency. the ops per sec possible on a cerebras style chip will not be that drastically higher than a similar area of silicon for a gpu—it’s conceivable you could get 100x higher ops/sec due to locality if I remember my numbers right. But with drastically lower latency, the model only has to decide which part of itself to run at full frequency, so it would in fact be able to think drastically faster than humans.
though, you can also do a similar trick on GPUs, and it wouldn’t be too hard to design an architecture that uses a recently released block to do it by having a small very-high-frequency portion of the network, since there’s a recent block design that is already implemented correctly to pull it off.
Importantly, by “faster” here we are talking about latency. the ops per sec possible on a cerebras style chip will not be that drastically higher than a similar area of silicon for a gpu—it’s conceivable you could get 100x higher ops/sec due to locality if I remember my numbers right. But with drastically lower latency, the model only has to decide which part of itself to run at full frequency, so it would in fact be able to think drastically faster than humans.
though, you can also do a similar trick on GPUs, and it wouldn’t be too hard to design an architecture that uses a recently released block to do it by having a small very-high-frequency portion of the network, since there’s a recent block design that is already implemented correctly to pull it off.