The paper is not about post-training quantization, instead it’s quantization aware training (this is more clearly discussed in the original BitNet paper). The representation is ternary {-1, 0, 1} from the start, the network learns to cope with that constraint throughout pre-training instead of getting subjected to brain damage of quantization after training.
where the Microscaling block number format is used to train a transformer at essentially 4 bits per weight, achieving the same perplexity as with 32 bit floating point weights, see Figure 4 on page 7. If perplexity doesn’t change for quantization aware training when going down to 4 bits, it’s not too shocking that it doesn’t significantly change at 1.6 bits either.
The paper is not about post-training quantization, instead it’s quantization aware training (this is more clearly discussed in the original BitNet paper). The representation is ternary {-1, 0, 1} from the start, the network learns to cope with that constraint throughout pre-training instead of getting subjected to brain damage of quantization after training.
Compare this with
BD Rouhani et al. (Oct 2023) Microscaling Data Formats for Deep Learning
where the Microscaling block number format is used to train a transformer at essentially 4 bits per weight, achieving the same perplexity as with 32 bit floating point weights, see Figure 4 on page 7. If perplexity doesn’t change for quantization aware training when going down to 4 bits, it’s not too shocking that it doesn’t significantly change at 1.6 bits either.