Jumping in late just to say one thing very directly: I believe you are correct to be skeptical of the framing that inference compute introduces a “new scaling law”. Yes, we now have two ways of using more compute to get better performance – at training time or at inference time. But (as you’re presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won’t be worthwhile to go very far down the road of scaling inference compute.
We will continue to increase inference compute, for problems that are difficult enough to call for it, and more so as efficiency gains reduce the cost. But given the log-linear nature of the scaling law, and the inability to amortize, I don’t think we’ll see the many-order-of-magnitude journey that we’ve seen for training compute.
As others have said, what we should presumably expect from o4, o5, etc. is that they’ll make better use of a given amount of compute (and/or be able to throw compute at a broader range of problems), not that they’ll primarily be about pushing farther up that log-linear graph.
Of course in the domain of natural intelligence, it is sometimes worth having a person go off and spend a full day on a problem, or even have a large team spend several years on a high-level problem. In other words, to spend lots of inference-time compute on a single high-level task. I have not tried to wrap my head around how that relates to scaling of inference-time compute. Is the relationship between the performance of a team on a task, and the number of person-days the team has to spend, log-linear???
But (as you’re presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won’t be worthwhile to go very far down the road of scaling inference compute.
Inference compute is amortized across future inference when trained upon, and the three-way scaling law exchange rates between training compute vs runtime compute vs model size are critical. See AlphaZero for a good example.
Inference compute is amortized across future inference when trained upon
And it’s not just a sensible theory. This has already happened, in Huggingface’s attempted replication of o1 where the reward model was larger, had TTC, and process supervision, but the smaller main model did not have any of those expensive properties.
And also in DeepSeek v3, where the expensive TTC model (R1) was used to train a cheaper conventional LLM (DeepSeek v3).
One way to frame it is test-time-compute is actually label-search-compute: you are searching for better labels/reward, and then training on them. Repeat as needed. This is obviously easier if you know what “better” means.
Test time compute is applied to solving a particular problem, so it’s very worthwhile to scale, getting better and better at solving an extremely hard problem by spending compute on this problem specifically. For some problems, no amount of pretraining with only modest test-time compute would be able to match an effort that starts with the problem and proceeds from there with a serious compute budget.
Yes, test time compute can be worthwhile to scale. My argument is that it is less worthwhile than scaling training compute. We should expect to see scaling of test time compute, but (I suggest) we shouldn’t expect this scaling to go as far as it has for training compute, and we should expect it to be employed sparingly.
The main reason I think this is worth bringing up is that people have been talking about test-time compute as “the new scaling law”, with the implication that it will pick up right where scaling of training compute left off, just keep turning the dial and you’ll keep getting better results. I think the idea that there is no wall, everything is going to continue just as it was except now the compute scaling happens on the inference side, is exaggerated.
There are many things that can’t be done at all right now. Some of them can become possible through scaling, and it’s unclear if it’s scaling of pretraining or scaling of test-time compute that gets them first, at any price, because scaling is not just amount of resources, but also the tech being ready to apply them. In this sense there is some equivalence.
Jumping in late just to say one thing very directly: I believe you are correct to be skeptical of the framing that inference compute introduces a “new scaling law”. Yes, we now have two ways of using more compute to get better performance – at training time or at inference time. But (as you’re presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won’t be worthwhile to go very far down the road of scaling inference compute.
We will continue to increase inference compute, for problems that are difficult enough to call for it, and more so as efficiency gains reduce the cost. But given the log-linear nature of the scaling law, and the inability to amortize, I don’t think we’ll see the many-order-of-magnitude journey that we’ve seen for training compute.
As others have said, what we should presumably expect from o4, o5, etc. is that they’ll make better use of a given amount of compute (and/or be able to throw compute at a broader range of problems), not that they’ll primarily be about pushing farther up that log-linear graph.
Of course in the domain of natural intelligence, it is sometimes worth having a person go off and spend a full day on a problem, or even have a large team spend several years on a high-level problem. In other words, to spend lots of inference-time compute on a single high-level task. I have not tried to wrap my head around how that relates to scaling of inference-time compute. Is the relationship between the performance of a team on a task, and the number of person-days the team has to spend, log-linear???
Inference compute is amortized across future inference when trained upon, and the three-way scaling law exchange rates between training compute vs runtime compute vs model size are critical. See AlphaZero for a good example.
As always, if you can read only 1 thing about inference scaling, make it “Scaling Scaling Laws with Board Games”, Jones 2021.
And it’s not just a sensible theory. This has already happened, in Huggingface’s attempted replication of o1 where the reward model was larger, had TTC, and process supervision, but the smaller main model did not have any of those expensive properties.
And also in DeepSeek v3, where the expensive TTC model (R1) was used to train a cheaper conventional LLM (DeepSeek v3).
One way to frame it is test-time-compute is actually label-search-compute: you are searching for better labels/reward, and then training on them. Repeat as needed. This is obviously easier if you know what “better” means.
Test time compute is applied to solving a particular problem, so it’s very worthwhile to scale, getting better and better at solving an extremely hard problem by spending compute on this problem specifically. For some problems, no amount of pretraining with only modest test-time compute would be able to match an effort that starts with the problem and proceeds from there with a serious compute budget.
Yes, test time compute can be worthwhile to scale. My argument is that it is less worthwhile than scaling training compute. We should expect to see scaling of test time compute, but (I suggest) we shouldn’t expect this scaling to go as far as it has for training compute, and we should expect it to be employed sparingly.
The main reason I think this is worth bringing up is that people have been talking about test-time compute as “the new scaling law”, with the implication that it will pick up right where scaling of training compute left off, just keep turning the dial and you’ll keep getting better results. I think the idea that there is no wall, everything is going to continue just as it was except now the compute scaling happens on the inference side, is exaggerated.
There are many things that can’t be done at all right now. Some of them can become possible through scaling, and it’s unclear if it’s scaling of pretraining or scaling of test-time compute that gets them first, at any price, because scaling is not just amount of resources, but also the tech being ready to apply them. In this sense there is some equivalence.