The amount of inference compute isn’t baked-in at pretraining time, so there is no tradeoff.
This doesn’t make sense to me.
In a subscription based model, for example, companies would want to provide users the strongest completions for the least amount of compute.
If they estimate customers in total will use 1 quadrillion tokens before the release of their next model, they have to decide how much of the compute they are going to be dedicating to training versus inference. As one changes the parameters (subscription price, anticipated users, fixed costs for a training run, etc.) you’d expect to find the optimal ratio to change.
Test-time compute on one trace comes with a recommendation to cap reasoning tokens at 25K, so there might be 1-2 orders of magnitude more there with better context lengths. They are still not offering repeated sampling filtered by consensus or a reward model. If o1 proves sufficiently popular given its price, they might offer even more expensive options.
There are open weights Llama 3 models, using them doesn’t involve paying for pretraining. The compute used in frontier models is determined by the size of the largest cluster with the latest AI accelerators that hyperscaler money can buy, subject to the time it takes the engineers to get used to the next level of scale, not by any tradeoff with cost of inference. Currently that’s about 100K H100s. This is the sense in which there is no tradeoff.
If somehow each model needed to be pretrained for a specific inference setup with specific inference costs and for it alone, then there could’ve been a tradeoff, but there is no such correspondence. The same model that’s used in a complicated costly inference heavy technique can also be used for the cheapest inference its number of active parameters allows. If progress slows down in a few years and it becomes technologically feasible to do pretraining runs that cost over $50bn, it will make sense to consider the shape of the resulting equilibrium and the largest scale of pretraining it endorses, but that’s a very different world.
This doesn’t make sense to me.
In a subscription based model, for example, companies would want to provide users the strongest completions for the least amount of compute.
If they estimate customers in total will use 1 quadrillion tokens before the release of their next model, they have to decide how much of the compute they are going to be dedicating to training versus inference. As one changes the parameters (subscription price, anticipated users, fixed costs for a training run, etc.) you’d expect to find the optimal ratio to change.
Thanks, this is a really good find!
There are open weights Llama 3 models, using them doesn’t involve paying for pretraining. The compute used in frontier models is determined by the size of the largest cluster with the latest AI accelerators that hyperscaler money can buy, subject to the time it takes the engineers to get used to the next level of scale, not by any tradeoff with cost of inference. Currently that’s about 100K H100s. This is the sense in which there is no tradeoff.
If somehow each model needed to be pretrained for a specific inference setup with specific inference costs and for it alone, then there could’ve been a tradeoff, but there is no such correspondence. The same model that’s used in a complicated costly inference heavy technique can also be used for the cheapest inference its number of active parameters allows. If progress slows down in a few years and it becomes technologically feasible to do pretraining runs that cost over $50bn, it will make sense to consider the shape of the resulting equilibrium and the largest scale of pretraining it endorses, but that’s a very different world.