Fergus Argyll comments on Supposing the 1bit LLM paper pans out

Fergus Argyll 29 Feb 2024 11:17 UTC
1 point
4
If I understand correctly (I very well might not), A “one bit LLM” has to be trained as a “one bit LLM” in order to then run inference on it as a “one bit LLM”. I.e this isn’t a new Quantization scheme.
So I think training and inference are tied together here, meaning; if this replicates, works, etc. we will probably have new hardware for both stages
- lemonhope 2 Mar 2024 1:05 UTC
  1 point
  0
  Parent
  I don’t see them mention anything about training efficiency anywhere so I don’t think it is really legit 1.58 bit training in a meaningful sense
  - Vladimir_Nesov 2 Mar 2024 9:53 UTC
    2 points
    0
    Parent
    Training doesn’t become more efficient, gradients and activations are still full precision, and I’m guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.
    
    It’s apparently not a novel idea, quantization aware training was explored before there were transformers:
    
    P Merolla, R Appuswamy, J Arthur, SK Esser, D Modha (2016) Deep neural networks are robust to weight binarization and other non-linear distortions.