I don’t think it can be patched for training to make training itself 1.58 bit (95% confident). I think training (not inference) is where most the money goes to and comes from, so hardware market will not be affected (90%).
Even in the small inference market, chip companies already have 4-8 bit inference accelerators in the oven (99%); they will not estimate the benefits of 1.58 bit to justify the risk of such specialized hardware, so nobody will build more than 100 1-bit or 1.58-bit inference chips (80%).
Old fashioned CPUs have at most 32 threads so will still be slow as heck to run NNs (90%).
If I understand correctly (I very well might not), A “one bit LLM” has to be trained as a “one bit LLM” in order to then run inference on it as a “one bit LLM”. I.e this isn’t a new Quantization scheme.
So I think training and inference are tied together here, meaning; if this replicates, works, etc. we will probably have new hardware for both stages
Training doesn’t become more efficient, gradients and activations are still full precision, and I’m guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.
It’s apparently not a novel idea, quantization aware training was explored before there were transformers:
I don’t think it can be patched
for trainingto make training itself 1.58 bit (95% confident). I think training (not inference) is where most the money goes to and comes from, so hardware market will not be affected (90%).Even in the small inference market, chip companies already have 4-8 bit inference accelerators in the oven (99%); they will not estimate the benefits of 1.58 bit to justify the risk of such specialized hardware, so nobody will build more than 100 1-bit or 1.58-bit inference chips (80%).
Old fashioned CPUs have at most 32 threads so will still be slow as heck to run NNs (90%).
I think your question is quite important.
If I understand correctly (I very well might not), A “one bit LLM” has to be trained as a “one bit LLM” in order to then run inference on it as a “one bit LLM”. I.e this isn’t a new Quantization scheme.
So I think training and inference are tied together here, meaning; if this replicates, works, etc. we will probably have new hardware for both stages
I don’t see them mention anything about training efficiency anywhere so I don’t think it is really legit 1.58 bit training in a meaningful sense
Training doesn’t become more efficient, gradients and activations are still full precision, and I’m guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.
It’s apparently not a novel idea, quantization aware training was explored before there were transformers:
P Merolla, R Appuswamy, J Arthur, SK Esser, D Modha (2016) Deep neural networks are robust to weight binarization and other non-linear distortions.