So $1 buys 7e17 useful FLOPs, or inference with 75-120B[1] active parameters for 1 million tokens.
Is this right? My impression was that the 6ND (or 9.6 ND) estimate was for training, not inference. E.g. in the original scaling law paper, it states
C ~ 6 NBS – an estimate of the total non-embedding training compute, where B is the batch size, and S is the number of training steps (ie parameter updates).
Is this right? My impression was that the 6ND (or 9.6 ND) estimate was for training, not inference. E.g. in the original scaling law paper, it states
Yes, my mistake, thank you. Should be 2ND or something when not computing gradients. I’ll track down the details shortly.