If I understand correctly (I very well might not), A “one bit LLM” has to be trained as a “one bit LLM” in order to then run inference on it as a “one bit LLM”. I.e this isn’t a new Quantization scheme.
So I think training and inference are tied together here, meaning; if this replicates, works, etc. we will probably have new hardware for both stages
Training doesn’t become more efficient, gradients and activations are still full precision, and I’m guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.
It’s apparently not a novel idea, quantization aware training was explored before there were transformers:
If I understand correctly (I very well might not), A “one bit LLM” has to be trained as a “one bit LLM” in order to then run inference on it as a “one bit LLM”. I.e this isn’t a new Quantization scheme.
So I think training and inference are tied together here, meaning; if this replicates, works, etc. we will probably have new hardware for both stages
I don’t see them mention anything about training efficiency anywhere so I don’t think it is really legit 1.58 bit training in a meaningful sense
Training doesn’t become more efficient, gradients and activations are still full precision, and I’m guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.
It’s apparently not a novel idea, quantization aware training was explored before there were transformers:
P Merolla, R Appuswamy, J Arthur, SK Esser, D Modha (2016) Deep neural networks are robust to weight binarization and other non-linear distortions.