Papers like the one involving elimination of matrix-multiplication suggest that there is no need for warehouses full of GPUs to train advanced AI systems.
The paper is about getting rid of multiplication in inference, not in training (specifically, in focuses on attention rather than MLP). Quantization aware training creates models with extreme levels of quantization that are not much worse than full precision models (this is currently impossible to do post-training, if training itself wasn’t built around targeting this outcome). The important recent result is ternary quantization where weights in MLP become {-1, 0, 1}, and thus multiplication by such a matrix no longer needs multiplication by weights. So this is relevant for making inference cheaper or running models locally.
The paper is about getting rid of multiplication in inference, not in training (specifically, in focuses on attention rather than MLP). Quantization aware training creates models with extreme levels of quantization that are not much worse than full precision models (this is currently impossible to do post-training, if training itself wasn’t built around targeting this outcome). The important recent result is ternary quantization where weights in MLP become {-1, 0, 1}, and thus multiplication by such a matrix no longer needs multiplication by weights. So this is relevant for making inference cheaper or running models locally.
Good point.