Am I understanding right that inference compute scaling time is useful for coding, math, and other things that are machine-checkable, but not for writing, basic science, and other things that aren’t machine-checkable? Will it ever have implications for these things?
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon? Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn’t seem like it’s as simple as trying a hundred times, or even trying using a hundred different strategies?
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting? Doesn’t “scaling inference compute” just mean “spending more money and waiting longer on each prompt”? Why do we expect this to scale? Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt? Since they already have all the supercomputers (for training scaling) why does it take time and progress to get to the higher inference-compute levels? What is o3 doing that you couldn’t do by running o1 on more computers for longer?
The basic guess regarding how o3′s training loop works is that it generates a bunch of chains of thoughts (or, rather, a branching tree), then uses some learned meta-heuristic to pick the best chain of thought and output it.
As part of that, it also learns a meta-heuristic for which chains of thought to generate to begin with. (I. e., it continually makes judgement calls regarding which trains of thought to pursue, rather than e. g. generating all combinatorially possible combinations of letters.)
It would indeed work best in domains that allow machine verification, because then there’s an easily computed ground-truth RL signal for training the meta-heuristic. Run each CoT through a proof verifier/an array of unit tests, then assign reward based on that. The learned meta-heuristics can then just internalize that machine verifier. (I. e., they’d basically copy the proof-verifier into the meta-heuristics. Then (a) once a spread of CoTs is generated, it can easily prune those that involve mathematically invalid steps, and (b) the LLM would become ever-more-unlikely to generate a CoT that involves mathematically invalid steps to begin with.)
However, arguably, the capability gains could transfer to domains outside math/programming.
There are two main possibilities here:
You can jury-rig “machine verification” for “soft domains” by having an LLM inspect the spread of ideas it generated (e. g., 100 business plans), then pick the best one, using the LLM’s learned intuition as the reward function. (See e. g. how Constitutional AI works, compared to RLHF.)
You can hope that the meta-heuristics, after being trained on math/programming, learn some general-purpose “taste”, an ability to tell which CoTs are better or worse, in a way that automatically generalizes to “soft” domains (perhaps with some additional fine-tuning using the previous idea).
That said, empirically, if we compare o1-full to Claude Sonnet 3.5.1, it doesn’t seem that the former dominates the latter in “soft” domains as dramatically as it does at math. So the transfer, if it happens at all, isn’t everything AI researchers could hope for.
Also, there’s another subtle point here:
o1′s public version doesn’t seem to actually generate trees of thought in response to user queries and then pruning it. It just deterministically picks the best train of thought to pursue as judged by the learned meta-heuristic (the part of it that’s guiding which trees to generate; see the previous point regarding how it doesn’t just generate all possible combinations of letters, but makes judgement calls regarding that as well).
By contrast, o3 definitely generates that tree (else it couldn’t have spent thousands-of-dollars’ worth of compute on individual tasks, due to the context-window limitations).
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon?
The best guess based on the publicly available information is that yes, this is the case.
Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn’t seem like it’s as simple as trying a hundred times, or even trying using a hundred different strategies?
Which strategies you’re trying matters. It indeed wouldn’t do much good if you just pick completely random steps/generate totally random messages. But if you’ve trained some heuristic for picking the best-seeming strategies among the strategy-space, and this heuristic has superhuman research taste...
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting?
That for a given LLM model being steered by a given meta-heuristic, the performance on benchmarks steadily improves with the length of CoTs / the breadth of the ToTs generated.
Why do we expect this to scale?
Straight lines on graphs go brr? Same as with the pre-training laws. We see a simple pattern, we assume it extrapolates.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
I’m not sure. It’s possible that a given meta-heuristic can only keep the LLM on-track for a fixed length of CoT / for a fixed breadth of ToT. You would then need to learn how to train better meta-heuristics to squeeze out more performance.
A possible explanation is that you need “more refined” tastes to pick between a broader range of CoTs. E. g., suppose that the quality of CoTs is on a 0-100 scale. Suppose you’ve generated a spread of CoTs, and the top 5 of them have the “ground-truth quality” of 99.6, 99.4, 98, 97, 96. Suppose your meta-heuristic is of the form Q + e, where Q is the ground-truth quality and e is some approximation-error term. If e is on the order of 0.5, then the model can’t distinguish between the top 2 guesses, and picks one at random. If e is on the order of 0.05, however, it reliably picks the best guess of those five. This can scale: then, depending on how “coarse” your model’s tastes are, it can pick out the best guess among 10^4, 10^5 guesses, etc.
(I. e., then the inference-time scaling law isn’t just “train any good-enough meta-heuristic and pour compute”, it’s “we can train increasingly better meta-heuristics, and the more compute they can usefully consume at inference-time, the better the performance”.)
(Also: notably, the issue with the transfer-of-performance might be that how “refined” the meta-heuristic’s taste is depends on the domain. E. g., for math, the error term might be 10^-5, for programming 10^-4, and for “soft” domains, 10^-1.)
Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt?
Not necessarily. The strength of the LLM model being steered, and the quality of the meta-heuristics doing the steering, matters. GPT-5 can plausibly outperfrom o3-full for much less inference-time compute, by needing shorter CoTs. “o3.5”, using the same LLM but equipped with a better-trained meta-level heuristic, can likewise outperform o3 by having a better judgement regarding which trains of thought to pursue (roughly, the best guess of o3.5 among 10 guesses would be as good as the best guess of o3 among 100 guesses).
And then if my guess regarding different meta-heuristics only being able to make use of no more than a fixed quantity of compute is right, then yes, o[3+n] models would also be able to usefully consume more raw compute.
Edit: I. e., there are basically three variables at play here:
How many guesses it needs to find a guess with a ground-truth quality above some threshold. (How refined the “steering” meta-heuristic is. What is the ground-truth quality of the best guess in 100 guesses it generated? How much is the probability distribution over guesses skewed towards the high-quality guesses?)
How refined the tastes of the “pruning” meta-heuristic are. (I. e., the size of the error e in the toy Q + e model above. Mediates the number of guesses among which it can pick the actual best one, assuming that they’re drawn from a fixed distribution.)
How long the high-quality CoTs are. (E. g., recall how much useless work/backtracking o1′s publicly shown CoTs seems to do, and how much more efficient it’d be if the base LLM were smart enough to just instantly output the correct answer, on pure instinct.)
Improving on (1) and (3) increases the efficiency of the compute used by the models. Improving on (2) lets models productively use more compute.
And notably, capabilities could grow either from improving on (2), in a straightforward manner, or from improving (1) and (3). For example, suppose that there’s a “taste overhang”, in that o3′s tastes are refined enough to reliably pick the best guess out of 10^9 guesses (drawn from a fixed distribution), but it is only economical to let it generate 10^5 guesses. Then improving on (1) (skewing the distribution towards the high-quality guesses) and (3) (making guesses cheaper) would not only reduce the costs, but also increase the quality of the ultimately-picked guesses.
(My intuition is that there’s no taste overhang, though; and also that the tastes indeed get increasingly less refined the farther you move from the machine-verifiable domains.)
This is good speculation, but I don’t think you need to speculate so much. Papers and replication attempts can provide lots of empirical data points from which to speculate.
Overall, I see people using process supervision to make a reward model that is one step better than the SoTA. Then they are applying TTC to the reward model, while using it to train/distil a cheaper model. The TTC expense is a one-off cost, since it’s used to distil to a cheaper model.
There are some papers about the future of this trend:
I can see other methods used here instead of process supervision. Process supervision extracts additional supervision from easy to verify domains. But diffusion does something very similar for domains where we can apply noise, like code.
However, arguably, the capability gains could transfer to domains outside math/programming.
More than an argument, we can look at the o3 announcement, where iirc it shows around 30% of the gain in non-code benchmarks. Less, but still substantial.
P.S. I think it’s worth noting that Meta has some amazing papers here, but they are also the most open source lab. It seems likely that other labs are also sitting on capabilities advancements that they do not allow researchers to publish.
P.P.S I also liked the alignment paper that came out with o3, since applying RLHF at multiple stages, and with process supervision seems useful. Its alignment seems to generalise better OOD (table 3). It also gives some clues to how o3 works, giving examples of CoT data.
@Scott Alexander, correction to the above: there are rumors that, like o1, o3 doesn’t generate runtime trees of thought either, and that they spent thousands-of-dollars’ worth of compute on single tasks by (1) having it generate a thousand separate CoTs, (2) outputting the answer the model produced most frequently. I. e., the “pruning meta-heuristic” I speculated about might just be the (manually-implemented) majority vote.
I think the guy in the quotes might be misinterpreting OpenAI researchers’ statements, but it’s possible.
In which case:
We have to slightly reinterpret the reason for having the model try a thousand times. Rather than outputting the correct answer if at least one try is correct, it outputs the correct answer if, in N tries, it produces the correct answer more frequently than incorrect ones. The fact that they had to set N = 1024 for best performance on ARC-AGI still suggests there’s a large amount of brute-forcing involved.
Since it implies that if N = 100, the correct answer isn’t more frequent than incorrect ones. So on the problems which o3 got wrong in the N = 6 regime but got right in the N = 1024 regime, the probability of any given CoT producing the correct answer is quite low.
This has similar implications for the FrontierMath performance, if the interpretation of the dark-blue vs. light-blue bars is that dark-blue is for N = 1 or N = 6, and light-blue is for N = bignumber.
We have to throw out everything about the “pruning” meta-heuristics; only the “steering” meta-heuristics exist. In this case, the transfer-of-performance problem would be that the “steering” heuristics only become better for math/programming; that RL only skewes the distribution over CoTs towards the high-quality ones for problems in those domains. (The metaphorical “taste” then still exists, but only within CoTs.)
(I now somewhat regret introducing the “steering vs. pruning meta-heuristic” terminology.)
Again, I think this isn’t really confirmed, but I can very much see it.
Jumping in late just to say one thing very directly: I believe you are correct to be skeptical of the framing that inference compute introduces a “new scaling law”. Yes, we now have two ways of using more compute to get better performance – at training time or at inference time. But (as you’re presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won’t be worthwhile to go very far down the road of scaling inference compute.
We will continue to increase inference compute, for problems that are difficult enough to call for it, and more so as efficiency gains reduce the cost. But given the log-linear nature of the scaling law, and the inability to amortize, I don’t think we’ll see the many-order-of-magnitude journey that we’ve seen for training compute.
As others have said, what we should presumably expect from o4, o5, etc. is that they’ll make better use of a given amount of compute (and/or be able to throw compute at a broader range of problems), not that they’ll primarily be about pushing farther up that log-linear graph.
Of course in the domain of natural intelligence, it is sometimes worth having a person go off and spend a full day on a problem, or even have a large team spend several years on a high-level problem. In other words, to spend lots of inference-time compute on a single high-level task. I have not tried to wrap my head around how that relates to scaling of inference-time compute. Is the relationship between the performance of a team on a task, and the number of person-days the team has to spend, log-linear???
But (as you’re presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won’t be worthwhile to go very far down the road of scaling inference compute.
Inference compute is amortized across future inference when trained upon, and the three-way scaling law exchange rates between training compute vs runtime compute vs model size are critical. See AlphaZero for a good example.
Inference compute is amortized across future inference when trained upon
And it’s not just a sensible theory. This has already happened, in Huggingface’s attempted replication of o1 where the reward model was larger, had TTC, and process supervision, but the smaller main model did not have any of those expensive properties.
And also in DeepSeek v3, where the expensive TTC model (R1) was used to train a cheaper conventional LLM (DeepSeek v3).
One way to frame it is test-time-compute is actually label-search-compute: you are searching for better labels/reward, and then training on them. Repeat as needed. This is obviously easier if you know what “better” means.
Test time compute is applied to solving a particular problem, so it’s very worthwhile to scale, getting better and better at solving an extremely hard problem by spending compute on this problem specifically. For some problems, no amount of pretraining with only modest test-time compute would be able to match an effort that starts with the problem and proceeds from there with a serious compute budget.
Yes, test time compute can be worthwhile to scale. My argument is that it is less worthwhile than scaling training compute. We should expect to see scaling of test time compute, but (I suggest) we shouldn’t expect this scaling to go as far as it has for training compute, and we should expect it to be employed sparingly.
The main reason I think this is worth bringing up is that people have been talking about test-time compute as “the new scaling law”, with the implication that it will pick up right where scaling of training compute left off, just keep turning the dial and you’ll keep getting better results. I think the idea that there is no wall, everything is going to continue just as it was except now the compute scaling happens on the inference side, is exaggerated.
There are many things that can’t be done at all right now. Some of them can become possible through scaling, and it’s unclear if it’s scaling of pretraining or scaling of test-time compute that gets them first, at any price, because scaling is not just amount of resources, but also the tech being ready to apply them. In this sense there is some equivalence.
Am I understanding right that inference compute scaling time is useful for coding, math, and other things that are machine-checkable, but not for writing, basic science, and other things that aren’t machine-checkable?
I think it would be very surprising if it wasn’t useful at all—a human who spends time rewriting and revising their essay is making it better by spending more compute. When I do creative writing with LLMs, their outputs seem to be improved if we spend some time brainstorming the details of the content beforehand, with them then being able to tap into the details we’ve been thinking about.
It’s certainly going to be harder to train without machine-checkable criteria. But I’d be surprised if it was impossible—you can always do things like training a model to predict how much a human rater would like literary outputs, and gradually improve the rater models. Probably people are focusing on things like programming first both because it’s easier and also because there’s money in it.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
Unclear, but with $20 per test settings on ARC-AGI it only uses 6 reasoning traces and still gets much better results than o1, so it’s not just about throwing $4000 at the problem. Possibly it’s based on GPT-4.5 or trained on more tests.
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting?
The standard scaling law people talk about is for pretraining, shown in the Kaplan and Hoffman (Chinchilla) papers.
It was also the case that various post-training (i.e., finetuning) techniques improve performance, (though I don’t think there is as clean of a scaling law, I’m unsure). See e.g., this paper which I just found via googling fine-tuning scaling laws. See also the Tülu 3 paper, Figure 4.
We have also already seen scaling law-type trends for inference compute, e.g., this paper:
The o1 blog post points out that they are observing two scaling trends: predictable scaling w.r.t. post-training (RL) compute, and predictable scaling w.r.t. inference compute:
The paragraph before this image says: “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.” That is, the left graph is about post-training compute.
Following from that graph on the left, the o1 paradigm gives us models that are better for a fixed inference compute budget (which is basically what it means to train a model for longer or train a better model of the same size by using better algorithms — the method is new but not the trend), and following from the right, performance seems to scale well with inference compute budget. I’m not sure there’s sufficient public data to compare that graph on the right against other inference-compute scaling methods, but my guess is the returns are better.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
I mean, if you replace “o1” in this sentence with “monkeys typing Shakespeare with ground truth verification,” it’s true, right? But o3 is actually a smarter mind in some sense, so it takes [presumably much] less inference compute to get similar performance. For instance, see this graph about o3-mini:
The performance-per-dollar frontier is pushed up by the o3-mini models. It would be somewhat interesting to know how much cost it would take for o1 to reach o3 performance here, but my guess is it’s a huge amount and practically impossible. That is, there are some performance levels that are practically unobtainable for o1, the same way the monkeys won’t actually complete Shakespeare.
Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it’s not dishonesty but this does look like something one would do if you wanted to exaggerate the slope
I believe this is standard/acceptable for presenting log-axis data, but I’m not sure. This is a graph from the Kaplan paper:
It is certainly frustrating that they don’t label the x-axis. Here’s a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don’t notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don’t think they tried to mislead about this, instead this is a domain where log-scaling axes is super common and doesn’t invalidate the results in any way.
I do not have a gauge for how much I’m actually bringing to this convo, so you should weigh my opinion lightly, however:
I believe your third point kinda nails it. There are models for gains from collective intelligence (groups of agents collaborating) and the benefits of collaboration bottleneck hard on your ability to verify which outputs from the collective are the best, and even then the dropoff happens pretty quick the more agents collaborate.
10 people collaborating with no communication issues and accurate discrimination between good and bad ideas are better than a lone person on some tasks, 100 moreso
You do not see jumps like that moving from 1,000 to 1,000,000 unless you set unrealistic variables.
I think inference time probably works in a similar way: dependent on discrimination between right and wrong answers and steeply falling off as inference time increases
My understanding is that o3 is similar to o1 but probably with some specialization to make long chains of thought stay coherent? The cost per token from leaks I’ve seen is the same as o1, it came out very quickly after o1 and o1 was bizarrely better at math and coding than 4o
Apologies if this was no help, responding with the best intentions
Questions for people who know more:
Am I understanding right that inference compute scaling time is useful for coding, math, and other things that are machine-checkable, but not for writing, basic science, and other things that aren’t machine-checkable? Will it ever have implications for these things?
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon? Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn’t seem like it’s as simple as trying a hundred times, or even trying using a hundred different strategies?
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting? Doesn’t “scaling inference compute” just mean “spending more money and waiting longer on each prompt”? Why do we expect this to scale? Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt? Since they already have all the supercomputers (for training scaling) why does it take time and progress to get to the higher inference-compute levels? What is o3 doing that you couldn’t do by running o1 on more computers for longer?
The basic guess regarding how o3′s training loop works is that it generates a bunch of chains of thoughts (or, rather, a branching tree), then uses some learned meta-heuristic to pick the best chain of thought and output it.
As part of that, it also learns a meta-heuristic for which chains of thought to generate to begin with. (I. e., it continually makes judgement calls regarding which trains of thought to pursue, rather than e. g. generating all combinatorially possible combinations of letters.)
It would indeed work best in domains that allow machine verification, because then there’s an easily computed ground-truth RL signal for training the meta-heuristic. Run each CoT through a proof verifier/an array of unit tests, then assign reward based on that. The learned meta-heuristics can then just internalize that machine verifier. (I. e., they’d basically copy the proof-verifier into the meta-heuristics. Then (a) once a spread of CoTs is generated, it can easily prune those that involve mathematically invalid steps, and (b) the LLM would become ever-more-unlikely to generate a CoT that involves mathematically invalid steps to begin with.)
However, arguably, the capability gains could transfer to domains outside math/programming.
There are two main possibilities here:
You can jury-rig “machine verification” for “soft domains” by having an LLM inspect the spread of ideas it generated (e. g., 100 business plans), then pick the best one, using the LLM’s learned intuition as the reward function. (See e. g. how Constitutional AI works, compared to RLHF.)
You can hope that the meta-heuristics, after being trained on math/programming, learn some general-purpose “taste”, an ability to tell which CoTs are better or worse, in a way that automatically generalizes to “soft” domains (perhaps with some additional fine-tuning using the previous idea).
That said, empirically, if we compare o1-full to Claude Sonnet 3.5.1, it doesn’t seem that the former dominates the latter in “soft” domains as dramatically as it does at math. So the transfer, if it happens at all, isn’t everything AI researchers could hope for.
Also, there’s another subtle point here:
o1′s public version doesn’t seem to actually generate trees of thought in response to user queries and then pruning it. It just deterministically picks the best train of thought to pursue as judged by the learned meta-heuristic (the part of it that’s guiding which trees to generate; see the previous point regarding how it doesn’t just generate all possible combinations of letters, but makes judgement calls regarding that as well).
By contrast, o3 definitely generates that tree (else it couldn’t have spent thousands-of-dollars’ worth of compute on individual tasks, due to the context-window limitations).
The best guess based on the publicly available information is that yes, this is the case.
Which strategies you’re trying matters. It indeed wouldn’t do much good if you just pick completely random steps/generate totally random messages. But if you’ve trained some heuristic for picking the best-seeming strategies among the strategy-space, and this heuristic has superhuman research taste...
That for a given LLM model being steered by a given meta-heuristic, the performance on benchmarks steadily improves with the length of CoTs / the breadth of the ToTs generated.
Straight lines on graphs go brr? Same as with the pre-training laws. We see a simple pattern, we assume it extrapolates.
I’m not sure. It’s possible that a given meta-heuristic can only keep the LLM on-track for a fixed length of CoT / for a fixed breadth of ToT. You would then need to learn how to train better meta-heuristics to squeeze out more performance.
A possible explanation is that you need “more refined” tastes to pick between a broader range of CoTs. E. g., suppose that the quality of CoTs is on a 0-100 scale. Suppose you’ve generated a spread of CoTs, and the top 5 of them have the “ground-truth quality” of 99.6, 99.4, 98, 97, 96. Suppose your meta-heuristic is of the form Q + e, where Q is the ground-truth quality and e is some approximation-error term. If e is on the order of 0.5, then the model can’t distinguish between the top 2 guesses, and picks one at random. If e is on the order of 0.05, however, it reliably picks the best guess of those five. This can scale: then, depending on how “coarse” your model’s tastes are, it can pick out the best guess among 10^4, 10^5 guesses, etc.
(I. e., then the inference-time scaling law isn’t just “train any good-enough meta-heuristic and pour compute”, it’s “we can train increasingly better meta-heuristics, and the more compute they can usefully consume at inference-time, the better the performance”.)
(Also: notably, the issue with the transfer-of-performance might be that how “refined” the meta-heuristic’s taste is depends on the domain. E. g., for math, the error term might be 10^-5, for programming 10^-4, and for “soft” domains, 10^-1.)
Not necessarily. The strength of the LLM model being steered, and the quality of the meta-heuristics doing the steering, matters. GPT-5 can plausibly outperfrom o3-full for much less inference-time compute, by needing shorter CoTs. “o3.5”, using the same LLM but equipped with a better-trained meta-level heuristic, can likewise outperform o3 by having a better judgement regarding which trains of thought to pursue (roughly, the best guess of o3.5 among 10 guesses would be as good as the best guess of o3 among 100 guesses).
And then if my guess regarding different meta-heuristics only being able to make use of no more than a fixed quantity of compute is right, then yes, o[3+n] models would also be able to usefully consume more raw compute.
Edit: I. e., there are basically three variables at play here:
How many guesses it needs to find a guess with a ground-truth quality above some threshold. (How refined the “steering” meta-heuristic is. What is the ground-truth quality of the best guess in 100 guesses it generated? How much is the probability distribution over guesses skewed towards the high-quality guesses?)
How refined the tastes of the “pruning” meta-heuristic are. (I. e., the size of the error e in the toy Q + e model above. Mediates the number of guesses among which it can pick the actual best one, assuming that they’re drawn from a fixed distribution.)
How long the high-quality CoTs are. (E. g., recall how much useless work/backtracking o1′s publicly shown CoTs seems to do, and how much more efficient it’d be if the base LLM were smart enough to just instantly output the correct answer, on pure instinct.)
Improving on (1) and (3) increases the efficiency of the compute used by the models. Improving on (2) lets models productively use more compute.
And notably, capabilities could grow either from improving on (2), in a straightforward manner, or from improving (1) and (3). For example, suppose that there’s a “taste overhang”, in that o3′s tastes are refined enough to reliably pick the best guess out of 10^9 guesses (drawn from a fixed distribution), but it is only economical to let it generate 10^5 guesses. Then improving on (1) (skewing the distribution towards the high-quality guesses) and (3) (making guesses cheaper) would not only reduce the costs, but also increase the quality of the ultimately-picked guesses.
(My intuition is that there’s no taste overhang, though; and also that the tastes indeed get increasingly less refined the farther you move from the machine-verifiable domains.)
This is good speculation, but I don’t think you need to speculate so much. Papers and replication attempts can provide lots of empirical data points from which to speculate.
You should check out some of the related papers
H4 uses a process supervision reward model, with MCTS and attempts to replicate o1
(sp fixed) DeepSeek uses R1 to train DeepSeek v3
Overall, I see people using process supervision to make a reward model that is one step better than the SoTA. Then they are applying TTC to the reward model, while using it to train/distil a cheaper model. The TTC expense is a one-off cost, since it’s used to distil to a cheaper model.
There are some papers about the future of this trend:
Meta uses reasoning tokens to allow models to reason in a latent space (the hidden state, yuck). OpenAI insiders have said that o3 does not work like this, but o4 might. {I would hope they chose a much better latent space than the hidden state. Something interpretable, that’s not just designed to be de-embedded into output tokens.}
Meta throws out tokenisation in favour of grouping predictable bytes
I can see other methods used here instead of process supervision. Process supervision extracts additional supervision from easy to verify domains. But diffusion does something very similar for domains where we can apply noise, like code.
Codefusion shows diffusion with code
Meta has an llm+diffusion paper, and so does Apple
Some older background papers might be useful for reference.
[OpenAI’]s process supervision paper](https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/)
“Let’s Verify Step by Step”
Deepmind’s TTC scaling laws
More than an argument, we can look at the o3 announcement, where iirc it shows around 30% of the gain in non-code benchmarks. Less, but still substantial.
P.S. I think it’s worth noting that Meta has some amazing papers here, but they are also the most open source lab. It seems likely that other labs are also sitting on capabilities advancements that they do not allow researchers to publish.
P.P.S I also liked the alignment paper that came out with o3, since applying RLHF at multiple stages, and with process supervision seems useful. Its alignment seems to generalise better OOD (table 3). It also gives some clues to how o3 works, giving examples of CoT data.
@Scott Alexander, correction to the above: there are rumors that, like o1, o3 doesn’t generate runtime trees of thought either, and that they spent thousands-of-dollars’ worth of compute on single tasks by (1) having it generate a thousand separate CoTs, (2) outputting the answer the model produced most frequently. I. e., the “pruning meta-heuristic” I speculated about might just be the (manually-implemented) majority vote.
I think the guy in the quotes might be misinterpreting OpenAI researchers’ statements, but it’s possible.
In which case:
We have to slightly reinterpret the reason for having the model try a thousand times. Rather than outputting the correct answer if at least one try is correct, it outputs the correct answer if, in N tries, it produces the correct answer more frequently than incorrect ones. The fact that they had to set N = 1024 for best performance on ARC-AGI still suggests there’s a large amount of brute-forcing involved.
Since it implies that if N = 100, the correct answer isn’t more frequent than incorrect ones. So on the problems which o3 got wrong in the N = 6 regime but got right in the N = 1024 regime, the probability of any given CoT producing the correct answer is quite low.
This has similar implications for the FrontierMath performance, if the interpretation of the dark-blue vs. light-blue bars is that dark-blue is for N = 1 or N = 6, and light-blue is for N = bignumber.
We have to throw out everything about the “pruning” meta-heuristics; only the “steering” meta-heuristics exist. In this case, the transfer-of-performance problem would be that the “steering” heuristics only become better for math/programming; that RL only skewes the distribution over CoTs towards the high-quality ones for problems in those domains. (The metaphorical “taste” then still exists, but only within CoTs.)
(I now somewhat regret introducing the “steering vs. pruning meta-heuristic” terminology.)
Again, I think this isn’t really confirmed, but I can very much see it.
Jumping in late just to say one thing very directly: I believe you are correct to be skeptical of the framing that inference compute introduces a “new scaling law”. Yes, we now have two ways of using more compute to get better performance – at training time or at inference time. But (as you’re presumably thinking) training compute can be amortized across all occasions when the model is used, while inference compute cannot, which means it won’t be worthwhile to go very far down the road of scaling inference compute.
We will continue to increase inference compute, for problems that are difficult enough to call for it, and more so as efficiency gains reduce the cost. But given the log-linear nature of the scaling law, and the inability to amortize, I don’t think we’ll see the many-order-of-magnitude journey that we’ve seen for training compute.
As others have said, what we should presumably expect from o4, o5, etc. is that they’ll make better use of a given amount of compute (and/or be able to throw compute at a broader range of problems), not that they’ll primarily be about pushing farther up that log-linear graph.
Of course in the domain of natural intelligence, it is sometimes worth having a person go off and spend a full day on a problem, or even have a large team spend several years on a high-level problem. In other words, to spend lots of inference-time compute on a single high-level task. I have not tried to wrap my head around how that relates to scaling of inference-time compute. Is the relationship between the performance of a team on a task, and the number of person-days the team has to spend, log-linear???
Inference compute is amortized across future inference when trained upon, and the three-way scaling law exchange rates between training compute vs runtime compute vs model size are critical. See AlphaZero for a good example.
As always, if you can read only 1 thing about inference scaling, make it “Scaling Scaling Laws with Board Games”, Jones 2021.
And it’s not just a sensible theory. This has already happened, in Huggingface’s attempted replication of o1 where the reward model was larger, had TTC, and process supervision, but the smaller main model did not have any of those expensive properties.
And also in DeepSeek v3, where the expensive TTC model (R1) was used to train a cheaper conventional LLM (DeepSeek v3).
One way to frame it is test-time-compute is actually label-search-compute: you are searching for better labels/reward, and then training on them. Repeat as needed. This is obviously easier if you know what “better” means.
Test time compute is applied to solving a particular problem, so it’s very worthwhile to scale, getting better and better at solving an extremely hard problem by spending compute on this problem specifically. For some problems, no amount of pretraining with only modest test-time compute would be able to match an effort that starts with the problem and proceeds from there with a serious compute budget.
Yes, test time compute can be worthwhile to scale. My argument is that it is less worthwhile than scaling training compute. We should expect to see scaling of test time compute, but (I suggest) we shouldn’t expect this scaling to go as far as it has for training compute, and we should expect it to be employed sparingly.
The main reason I think this is worth bringing up is that people have been talking about test-time compute as “the new scaling law”, with the implication that it will pick up right where scaling of training compute left off, just keep turning the dial and you’ll keep getting better results. I think the idea that there is no wall, everything is going to continue just as it was except now the compute scaling happens on the inference side, is exaggerated.
There are many things that can’t be done at all right now. Some of them can become possible through scaling, and it’s unclear if it’s scaling of pretraining or scaling of test-time compute that gets them first, at any price, because scaling is not just amount of resources, but also the tech being ready to apply them. In this sense there is some equivalence.
I think it would be very surprising if it wasn’t useful at all—a human who spends time rewriting and revising their essay is making it better by spending more compute. When I do creative writing with LLMs, their outputs seem to be improved if we spend some time brainstorming the details of the content beforehand, with them then being able to tap into the details we’ve been thinking about.
It’s certainly going to be harder to train without machine-checkable criteria. But I’d be surprised if it was impossible—you can always do things like training a model to predict how much a human rater would like literary outputs, and gradually improve the rater models. Probably people are focusing on things like programming first both because it’s easier and also because there’s money in it.
Unclear, but with $20 per test settings on ARC-AGI it only uses 6 reasoning traces and still gets much better results than o1, so it’s not just about throwing $4000 at the problem. Possibly it’s based on GPT-4.5 or trained on more tests.
The standard scaling law people talk about is for pretraining, shown in the Kaplan and Hoffman (Chinchilla) papers.
It was also the case that various post-training (i.e., finetuning) techniques improve performance, (though I don’t think there is as clean of a scaling law, I’m unsure). See e.g., this paper which I just found via googling fine-tuning scaling laws. See also the Tülu 3 paper, Figure 4.
We have also already seen scaling law-type trends for inference compute, e.g., this paper:
The o1 blog post points out that they are observing two scaling trends: predictable scaling w.r.t. post-training (RL) compute, and predictable scaling w.r.t. inference compute:
The paragraph before this image says: “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.” That is, the left graph is about post-training compute.
Following from that graph on the left, the o1 paradigm gives us models that are better for a fixed inference compute budget (which is basically what it means to train a model for longer or train a better model of the same size by using better algorithms — the method is new but not the trend), and following from the right, performance seems to scale well with inference compute budget. I’m not sure there’s sufficient public data to compare that graph on the right against other inference-compute scaling methods, but my guess is the returns are better.
I mean, if you replace “o1” in this sentence with “monkeys typing Shakespeare with ground truth verification,” it’s true, right? But o3 is actually a smarter mind in some sense, so it takes [presumably much] less inference compute to get similar performance. For instance, see this graph about o3-mini:
The performance-per-dollar frontier is pushed up by the o3-mini models. It would be somewhat interesting to know how much cost it would take for o1 to reach o3 performance here, but my guess is it’s a huge amount and practically impossible. That is, there are some performance levels that are practically unobtainable for o1, the same way the monkeys won’t actually complete Shakespeare.
Hope that clears things up some!
Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it’s not dishonesty but this does look like something one would do if you wanted to exaggerate the slope
I believe this is standard/acceptable for presenting log-axis data, but I’m not sure. This is a graph from the Kaplan paper:
It is certainly frustrating that they don’t label the x-axis. Here’s a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don’t notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don’t think they tried to mislead about this, instead this is a domain where log-scaling axes is super common and doesn’t invalidate the results in any way.
Ah wait was reading it wrong. I thought each time was an order of magnitude, that looks to be standard notation for log scale. Mischief managed
I do not have a gauge for how much I’m actually bringing to this convo, so you should weigh my opinion lightly, however:
I believe your third point kinda nails it. There are models for gains from collective intelligence (groups of agents collaborating) and the benefits of collaboration bottleneck hard on your ability to verify which outputs from the collective are the best, and even then the dropoff happens pretty quick the more agents collaborate.
10 people collaborating with no communication issues and accurate discrimination between good and bad ideas are better than a lone person on some tasks, 100 moreso
You do not see jumps like that moving from 1,000 to 1,000,000 unless you set unrealistic variables.
I think inference time probably works in a similar way: dependent on discrimination between right and wrong answers and steeply falling off as inference time increases
My understanding is that o3 is similar to o1 but probably with some specialization to make long chains of thought stay coherent? The cost per token from leaks I’ve seen is the same as o1, it came out very quickly after o1 and o1 was bizarrely better at math and coding than 4o
Apologies if this was no help, responding with the best intentions