There are open weights Llama 3 models, using them doesn’t involve paying for pretraining. The compute used in frontier models is determined by the size of the largest cluster with the latest AI accelerators that hyperscaler money can buy, subject to the time it takes the engineers to get used to the next level of scale, not by any tradeoff with cost of inference. Currently that’s about 100K H100s. This is the sense in which there is no tradeoff.
If somehow each model needed to be pretrained for a specific inference setup with specific inference costs and for it alone, then there could’ve been a tradeoff, but there is no such correspondence. The same model that’s used in a complicated costly inference heavy technique can also be used for the cheapest inference its number of active parameters allows. If progress slows down in a few years and it becomes technologically feasible to do pretraining runs that cost over $50bn, it will make sense to consider the shape of the resulting equilibrium and the largest scale of pretraining it endorses, but that’s a very different world.
There are open weights Llama 3 models, using them doesn’t involve paying for pretraining. The compute used in frontier models is determined by the size of the largest cluster with the latest AI accelerators that hyperscaler money can buy, subject to the time it takes the engineers to get used to the next level of scale, not by any tradeoff with cost of inference. Currently that’s about 100K H100s. This is the sense in which there is no tradeoff.
If somehow each model needed to be pretrained for a specific inference setup with specific inference costs and for it alone, then there could’ve been a tradeoff, but there is no such correspondence. The same model that’s used in a complicated costly inference heavy technique can also be used for the cheapest inference its number of active parameters allows. If progress slows down in a few years and it becomes technologically feasible to do pretraining runs that cost over $50bn, it will make sense to consider the shape of the resulting equilibrium and the largest scale of pretraining it endorses, but that’s a very different world.