Vladimir_Nesov answers Supposing the 1bit LLM paper pans out

Vladimir_Nesov 29 Feb 2024 15:23 UTC
8 points
2
The paper is not about post-training quantization, instead it’s quantization aware training (this is more clearly discussed in the original BitNet paper). The representation is ternary {-1, 0, 1} from the start, the network learns to cope with that constraint throughout pre-training instead of getting subjected to brain damage of quantization after training.

Compare this with
- BD Rouhani et al. (Oct 2023) Microscaling Data Formats for Deep Learning
where the Microscaling block number format is used to train a transformer at essentially 4 bits per weight, achieving the same perplexity as with 32 bit floating point weights, see Figure 4 on page 7. If perplexity doesn’t change for quantization aware training when going down to 4 bits, it’s not too shocking that it doesn’t significantly change at 1.6 bits either.