there’s a reasonable chance that it won’t make sense for the companies to release the ‘true’ level-5 models because of inference expense and speed
Not really, Llama-3-405b goes for $3-5 per million output tokens with good speed, and it’s Chinchilla optimal for 4e25 FLOPs (at 40 tokens/parameter, moving higher than Chinchilla’s 20, also consistent with findings in Imbue’s CARBS). At 1e27 FLOPs (feasible compute with 100KH100s when training in FP8 for 6 months), we are only 25 times up from this in compute, which is 5 times up in model size (square root of compute increase), maybe 2 times up in model depth (square root of model size increase).
So a dense model at this scale should cost about $15-50 per million tokens (Claude 3 Opus goes for $75 per million output tokens) and get maybe 2-3 times slower, there is still some room for margin even at reasonable prices. With the more effective choice to train a MoE model (which is smarter at the same training compute cost, but harder to setup and requires more users to become efficient to serve), the inference cost might get somewhat higher, but it can still stay within last year’s precedent. So it doesn’t even need to be game-changingly better to be worth the price, just notably better. Also, next year’s Blackwell is 2x faster and can do inference in FP4 an additional 2x faster on top of that (which Hopper can’t), though that’s more relevant for input tokens.
Not really, Llama-3-405b goes for $3-5 per million output tokens with good speed, and it’s Chinchilla optimal for 4e25 FLOPs (at 40 tokens/parameter, moving higher than Chinchilla’s 20, also consistent with findings in Imbue’s CARBS). At 1e27 FLOPs (feasible compute with 100K H100s when training in FP8 for 6 months), we are only 25 times up from this in compute, which is 5 times up in model size (square root of compute increase), maybe 2 times up in model depth (square root of model size increase).
So a dense model at this scale should cost about $15-50 per million tokens (Claude 3 Opus goes for $75 per million output tokens) and get maybe 2-3 times slower, there is still some room for margin even at reasonable prices. With the more effective choice to train a MoE model (which is smarter at the same training compute cost, but harder to setup and requires more users to become efficient to serve), the inference cost might get somewhat higher, but it can still stay within last year’s precedent. So it doesn’t even need to be game-changingly better to be worth the price, just notably better. Also, next year’s Blackwell is 2x faster and can do inference in FP4 an additional 2x faster on top of that (which Hopper can’t), though that’s more relevant for input tokens.