I don’t actually think this is going to make that big of a difference, at least for current AI research. The main reason is because I think the main hardware bottlenecks to better AI performance are performance/$ and performance/W and memory bandwidth. This is because, so far, most large scale DL algorithms have shown almost embarrassingly parallel scaling, and a good amount of time is wasted just saving and loading NN activations for the back-prop algorithm.
This technology probably won’t lead to any major performance improvements in terms of performance/$ or performance/W. Those will have already come from dedicated DL chips such as Google’s TPUs, because this essentially a really big dedicated DL chip. The major place for improvement is memory bandwidth, which according to the article, is an impressive 9PB per second, and 10,000 times than what’s on a V100 GPU, but with only 18GB of ram, that’s going to severely constrain the size of models that can be trained, so I don’t think it will be useful for training better models.
Allow me to speculate wildly.
I don’t actually think this is going to make that big of a difference, at least for current AI research. The main reason is because I think the main hardware bottlenecks to better AI performance are performance/$ and performance/W and memory bandwidth. This is because, so far, most large scale DL algorithms have shown almost embarrassingly parallel scaling, and a good amount of time is wasted just saving and loading NN activations for the back-prop algorithm.
This technology probably won’t lead to any major performance improvements in terms of performance/$ or performance/W. Those will have already come from dedicated DL chips such as Google’s TPUs, because this essentially a really big dedicated DL chip. The major place for improvement is memory bandwidth, which according to the article, is an impressive 9PB per second, and 10,000 times than what’s on a V100 GPU, but with only 18GB of ram, that’s going to severely constrain the size of models that can be trained, so I don’t think it will be useful for training better models.
Might be good for inference though.
They also claim increased performance in term of energy as they eliminate useless multiplications on zero which are often in matrix multiplication.