Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don’t seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.
Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don’t seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.
Deepseek v3 is one example, and semianalysis has claimed that most labs use FP8.