Sully likes Claude Haiku 3.5 but notes that it’s in a weird spot after the price increase—it costs a lot more than other small models
The price is $1 per million input tokens (which are compute bound, so easier to evaluate than output tokens), while Llama-3-405B costs $3.5. At $2 per H100-hour we buy 3600 seconds of 1e15 FLOP/s at say 40% utilization, $1.4e-18 per useful FLOP. So $1 buys 7e17 useful FLOPs, or inference with 75-120B[1] active parameters for 1 million tokens. That’s with zero margin and perfect batch size, so should be smaller.
Edit: 6ND is wrong, counts computation of gradients that’s not done during inference. So the corrected estimate would suggest that the model could be even larger, but anchoring to open weights API providers says otherwise, still points to about 100B.
Estimate of compute for a dense transformer is 6ND (N is number of active parameters, D number of tokens), a recent Tencent paper says they estimate about 9.6ND for a MoE model (see Section 2.3.1). I get 420B with the same calculation for $3.5 of Llama-3-405B (using 6ND, since it’s dense), so that checks out.
So $1 buys 7e17 useful FLOPs, or inference with 75-120B[1] active parameters for 1 million tokens.
Is this right? My impression was that the 6ND (or 9.6 ND) estimate was for training, not inference. E.g. in the original scaling law paper, it states
C ~ 6 NBS – an estimate of the total non-embedding training compute, where B is the batch size, and S is the number of training steps (ie parameter updates).
The price is $1 per million input tokens (which are compute bound, so easier to evaluate than output tokens), while Llama-3-405B costs $3.5. At $2 per H100-hour we buy 3600 seconds of 1e15 FLOP/s at say 40% utilization, $1.4e-18 per useful FLOP. So $1 buys 7e17 useful FLOPs, or inference with 75-120B[1] active parameters for 1 million tokens. That’s with zero margin and perfect batch size, so should be smaller.
Edit: 6ND is wrong, counts computation of gradients that’s not done during inference. So the corrected estimate would suggest that the model could be even larger, but anchoring to open weights API providers says otherwise, still points to about 100B.
Estimate of compute for a dense transformer is 6ND (N is number of active parameters, D number of tokens), a recent Tencent paper says they estimate about 9.6ND for a MoE model (see Section 2.3.1). I get 420B with the same calculation for $3.5 of Llama-3-405B (using 6ND, since it’s dense), so that checks out.
Is this right? My impression was that the 6ND (or 9.6 ND) estimate was for training, not inference. E.g. in the original scaling law paper, it states
Yes, my mistake, thank you. Should be 2ND or something when not computing gradients. I’ll track down the details shortly.