Am I understanding right that inference compute scaling time is useful for coding, math, and other things that are machine-checkable, but not for writing, basic science, and other things that aren’t machine-checkable? Will it ever have implications for these things?
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon? Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn’t seem like it’s as simple as trying a hundred times, or even trying using a hundred different strategies?
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting? Doesn’t “scaling inference compute” just mean “spending more money and waiting longer on each prompt”? Why do we expect this to scale? Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt? Since they already have all the supercomputers (for training scaling) why does it take time and progress to get to the higher inference-compute levels? What is o3 doing that you couldn’t do by running o1 on more computers for longer?
The basic guess regarding how o3′s training loop works is that it generates a bunch of chains of thoughts (or, rather, a branching tree), then uses some learned meta-heuristic to pick the best chain of thought and output it.
As part of that, it also learns a meta-heuristic for which chains of thought to generate to begin with. (I. e., it continually makes judgement calls regarding which trains of thought to pursue, rather than e. g. generating all combinatorially possible combinations of letters.)
It would indeed work best in domains that allow machine verification, because then there’s an easily computed ground-truth RL signal for training the meta-heuristic. Run each CoT through a proof verifier/an array of unit tests, then assign reward based on that. The learned meta-heuristics can then just internalize that machine verifier. (I. e., they’d basically copy the proof-verifier into the meta-heuristics. Then (a) once a spread of CoTs is generated, it can easily prune those that involve mathematically invalid steps, and (b) the LLM would become ever-more-unlikely to generate a CoT that involves mathematically invalid steps to begin with.)
However, arguably, the capability gains could transfer to domains outside math/programming.
There are two main possibilities here:
You can jury-rig “machine verification” for “soft domains” by having an LLM inspect the spread of ideas it generated (e. g., 100 business plans), then pick the best one, using the LLM’s learned intuition as the reward function. (See e. g. how Constitutional AI works, compared to RLHF.)
You can hope that the meta-heuristics, after being trained on math/programming, learn some general-purpose “taste”, an ability to tell which CoTs are better or worse, in a way that automatically generalizes to “soft” domains (perhaps with some additional fine-tuning using the previous idea).
That said, empirically, if we compare o1-full to Claude Sonnet 3.5.1, it doesn’t seem that the former dominates the latter in “soft” domains as dramatically as it does at math. So the transfer, if it happens at all, isn’t everything AI researchers could hope for.
Also, there’s another subtle point here:
o1′s public version doesn’t seem to actually generate trees of thought in response to user queries and then pruning it. It just deterministically picks the best train of thought to pursue as judged by the learned meta-heuristic (the part of it that’s guiding which trees to generate; see the previous point regarding how it doesn’t just generate all possible combinations of letters, but makes judgement calls regarding that as well).
By contrast, o3 definitely generates that tree (else it couldn’t have spent thousands-of-dollars’ worth of compute on individual tasks, due to the context-window limitations).
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon?
The best guess based on the publicly available information is that yes, this is the case.
Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn’t seem like it’s as simple as trying a hundred times, or even trying using a hundred different strategies?
Which strategies you’re trying matters. It indeed wouldn’t do much good if you just pick completely random steps/generate totally random messages. But if you’ve trained some heuristic for picking the best-seeming strategies among the strategy-space, and this heuristic has superhuman research taste...
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting?
That for a given LLM model being steered by a given meta-heuristic, the performance on benchmarks steadily improves with the length of CoTs / the breadth of the ToTs generated.
Why do we expect this to scale?
Straight lines on graphs go brr? Same as with the pre-training laws. We see a simple pattern, we assume it extrapolates.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
I’m not sure. It’s possible that a given meta-heuristic can only keep the LLM on-track for a fixed length of CoT / for a fixed breadth of ToT. You would then need to learn how to train better meta-heuristics to squeeze out more performance.
A possible explanation is that you need “more refined” tastes to pick between a broader range of CoTs. E. g., suppose that the quality of CoTs is on a 0-100 scale. Suppose you’ve generated a spread of CoTs, and the top 5 of them have the “ground-truth quality” of 99.6, 99.4, 98, 97, 96. Suppose your meta-heuristic is of the form Q + e, where Q is the ground-truth quality and e is some approximation-error term. If e is on the order of 0.5, then the model can’t distinguish between the top 2 guesses, and picks one at random. If e is on the order of 0.05, however, it reliably picks the best guess of those five. This can scale: then, depending on how “coarse” your model’s tastes are, it can pick out the best guess among 10^4, 10^5 guesses, etc.
(I. e., then the inference-time scaling law isn’t just “train any good-enough meta-heuristic and pour compute”, it’s “we can train increasingly better meta-heuristics, and the more compute they can usefully consume at inference-time, the better the performance”.)
(Also: notably, the issue with the transfer-of-performance might be that how “refined” the meta-heuristic’s taste is depends on the domain. E. g., for math, the error term might be 10^-5, for programming 10^-4, and for “soft” domains, 10^-1.)
Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt?
Not necessarily. The strength of the LLM model being steered, and the quality of the meta-heuristics doing the steering, matters. GPT-5 can plausibly outperfrom o3-full for much less inference-time compute, by needing shorter CoTs. “o3.5”, using the same LLM but equipped with a better-trained meta-level heuristic, can likewise outperform o3 by having a better judgement regarding which trains of thought to pursue (roughly, the best guess of o3.5 among 10 guesses would be as good as the best guess of o3 among 100 guesses).
And then if my guess regarding different meta-heuristics only being able to make use of no more than a fixed quantity of compute is right, then yes, o[3+n] models would also be able to usefully consume more raw compute.
Edit: I. e., there are basically three variables at play here:
How many guesses it needs to find a guess with a ground-truth quality above some threshold. (How refined the “steering” meta-heuristic is. What is the ground-truth quality of the best guess in 100 guesses it generated? How much is the probability distribution over guesses skewed towards the high-quality guesses?)
How refined the tastes of the “pruning” meta-heuristic are. (I. e., the size of the error e in the toy Q + e model above. Mediates the number of guesses among which it can pick the actual best one, assuming that they’re drawn from a fixed distribution.)
How long the high-quality CoTs are. (E. g., recall how much useless work/backtracking o1′s publicly shown CoTs seems to do, and how much more efficient it’d be if the base LLM were smart enough to just instantly output the correct answer, on pure instinct.)
Improving on (1) and (3) increases the efficiency of the compute used by the models. Improving on (2) lets models productively use more compute.
And notably, capabilities could grow either from improving on (2), in a straightforward manner, or from improving (1) and (3). For example, suppose that there’s a “taste overhang”, in that o3′s tastes are refined enough to reliably pick the best guess out of 10^9 guesses (drawn from a fixed distribution), but it is only economical to let it generate 10^5 guesses. Then improving on (1) (skewing the distribution towards the high-quality guesses) and (3) (making guesses cheaper) would not only reduce the costs, but also increase the quality of the ultimately-picked guesses.
(My intuition is that there’s no taste overhang, though; and also that the tastes indeed get increasingly less refined the farther you move from the machine-verifiable domains.)
Am I understanding right that inference compute scaling time is useful for coding, math, and other things that are machine-checkable, but not for writing, basic science, and other things that aren’t machine-checkable?
I think it would be very surprising if it wasn’t useful at all—a human who spends time rewriting and revising their essay is making it better by spending more compute. When I do creative writing with LLMs, their outputs seem to be improved if we spend some time brainstorming the details of the content beforehand, with them then being able to tap into the details we’ve been thinking about.
It’s certainly going to be harder to train without machine-checkable criteria. But I’d be surprised if it was impossible—you can always do things like training a model to predict how much a human rater would like literary outputs, and gradually improve the rater models. Probably people are focusing on things like programming first both because it’s easier and also because there’s money in it.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
Unclear, but with $20 per test settings on ARC-AGI it only uses 6 reasoning traces and still gets much better results than o1, so it’s not just about throwing $4000 at the problem. Possibly it’s based on GPT-4.5 or trained on more tests.
Questions for people who know more:
Am I understanding right that inference compute scaling time is useful for coding, math, and other things that are machine-checkable, but not for writing, basic science, and other things that aren’t machine-checkable? Will it ever have implications for these things?
Am I understanding right that this is all just clever ways of having it come up with many different answers or subanswers or preanswers, then picking the good ones to expand upon? Why should this be good for eg proving difficult math theorems, where many humans using many different approaches have failed, so it doesn’t seem like it’s as simple as trying a hundred times, or even trying using a hundred different strategies?
What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting? Doesn’t “scaling inference compute” just mean “spending more money and waiting longer on each prompt”? Why do we expect this to scale? Does inference compute scaling mean that o3 will use ten supercomputers for one hour per prompt, o4 will use a hundred supercomputers for ten hours per prompt, and o5 will use a thousand supercomputers for a hundred hours per prompt? Since they already have all the supercomputers (for training scaling) why does it take time and progress to get to the higher inference-compute levels? What is o3 doing that you couldn’t do by running o1 on more computers for longer?
The basic guess regarding how o3′s training loop works is that it generates a bunch of chains of thoughts (or, rather, a branching tree), then uses some learned meta-heuristic to pick the best chain of thought and output it.
As part of that, it also learns a meta-heuristic for which chains of thought to generate to begin with. (I. e., it continually makes judgement calls regarding which trains of thought to pursue, rather than e. g. generating all combinatorially possible combinations of letters.)
It would indeed work best in domains that allow machine verification, because then there’s an easily computed ground-truth RL signal for training the meta-heuristic. Run each CoT through a proof verifier/an array of unit tests, then assign reward based on that. The learned meta-heuristics can then just internalize that machine verifier. (I. e., they’d basically copy the proof-verifier into the meta-heuristics. Then (a) once a spread of CoTs is generated, it can easily prune those that involve mathematically invalid steps, and (b) the LLM would become ever-more-unlikely to generate a CoT that involves mathematically invalid steps to begin with.)
However, arguably, the capability gains could transfer to domains outside math/programming.
There are two main possibilities here:
You can jury-rig “machine verification” for “soft domains” by having an LLM inspect the spread of ideas it generated (e. g., 100 business plans), then pick the best one, using the LLM’s learned intuition as the reward function. (See e. g. how Constitutional AI works, compared to RLHF.)
You can hope that the meta-heuristics, after being trained on math/programming, learn some general-purpose “taste”, an ability to tell which CoTs are better or worse, in a way that automatically generalizes to “soft” domains (perhaps with some additional fine-tuning using the previous idea).
That said, empirically, if we compare o1-full to Claude Sonnet 3.5.1, it doesn’t seem that the former dominates the latter in “soft” domains as dramatically as it does at math. So the transfer, if it happens at all, isn’t everything AI researchers could hope for.
Also, there’s another subtle point here:
o1′s public version doesn’t seem to actually generate trees of thought in response to user queries and then pruning it. It just deterministically picks the best train of thought to pursue as judged by the learned meta-heuristic (the part of it that’s guiding which trees to generate; see the previous point regarding how it doesn’t just generate all possible combinations of letters, but makes judgement calls regarding that as well).
By contrast, o3 definitely generates that tree (else it couldn’t have spent thousands-of-dollars’ worth of compute on individual tasks, due to the context-window limitations).
The best guess based on the publicly available information is that yes, this is the case.
Which strategies you’re trying matters. It indeed wouldn’t do much good if you just pick completely random steps/generate totally random messages. But if you’ve trained some heuristic for picking the best-seeming strategies among the strategy-space, and this heuristic has superhuman research taste...
That for a given LLM model being steered by a given meta-heuristic, the performance on benchmarks steadily improves with the length of CoTs / the breadth of the ToTs generated.
Straight lines on graphs go brr? Same as with the pre-training laws. We see a simple pattern, we assume it extrapolates.
I’m not sure. It’s possible that a given meta-heuristic can only keep the LLM on-track for a fixed length of CoT / for a fixed breadth of ToT. You would then need to learn how to train better meta-heuristics to squeeze out more performance.
A possible explanation is that you need “more refined” tastes to pick between a broader range of CoTs. E. g., suppose that the quality of CoTs is on a 0-100 scale. Suppose you’ve generated a spread of CoTs, and the top 5 of them have the “ground-truth quality” of 99.6, 99.4, 98, 97, 96. Suppose your meta-heuristic is of the form Q + e, where Q is the ground-truth quality and e is some approximation-error term. If e is on the order of 0.5, then the model can’t distinguish between the top 2 guesses, and picks one at random. If e is on the order of 0.05, however, it reliably picks the best guess of those five. This can scale: then, depending on how “coarse” your model’s tastes are, it can pick out the best guess among 10^4, 10^5 guesses, etc.
(I. e., then the inference-time scaling law isn’t just “train any good-enough meta-heuristic and pour compute”, it’s “we can train increasingly better meta-heuristics, and the more compute they can usefully consume at inference-time, the better the performance”.)
(Also: notably, the issue with the transfer-of-performance might be that how “refined” the meta-heuristic’s taste is depends on the domain. E. g., for math, the error term might be 10^-5, for programming 10^-4, and for “soft” domains, 10^-1.)
Not necessarily. The strength of the LLM model being steered, and the quality of the meta-heuristics doing the steering, matters. GPT-5 can plausibly outperfrom o3-full for much less inference-time compute, by needing shorter CoTs. “o3.5”, using the same LLM but equipped with a better-trained meta-level heuristic, can likewise outperform o3 by having a better judgement regarding which trains of thought to pursue (roughly, the best guess of o3.5 among 10 guesses would be as good as the best guess of o3 among 100 guesses).
And then if my guess regarding different meta-heuristics only being able to make use of no more than a fixed quantity of compute is right, then yes, o[3+n] models would also be able to usefully consume more raw compute.
Edit: I. e., there are basically three variables at play here:
How many guesses it needs to find a guess with a ground-truth quality above some threshold. (How refined the “steering” meta-heuristic is. What is the ground-truth quality of the best guess in 100 guesses it generated? How much is the probability distribution over guesses skewed towards the high-quality guesses?)
How refined the tastes of the “pruning” meta-heuristic are. (I. e., the size of the error e in the toy Q + e model above. Mediates the number of guesses among which it can pick the actual best one, assuming that they’re drawn from a fixed distribution.)
How long the high-quality CoTs are. (E. g., recall how much useless work/backtracking o1′s publicly shown CoTs seems to do, and how much more efficient it’d be if the base LLM were smart enough to just instantly output the correct answer, on pure instinct.)
Improving on (1) and (3) increases the efficiency of the compute used by the models. Improving on (2) lets models productively use more compute.
And notably, capabilities could grow either from improving on (2), in a straightforward manner, or from improving (1) and (3). For example, suppose that there’s a “taste overhang”, in that o3′s tastes are refined enough to reliably pick the best guess out of 10^9 guesses (drawn from a fixed distribution), but it is only economical to let it generate 10^5 guesses. Then improving on (1) (skewing the distribution towards the high-quality guesses) and (3) (making guesses cheaper) would not only reduce the costs, but also increase the quality of the ultimately-picked guesses.
(My intuition is that there’s no taste overhang, though; and also that the tastes indeed get increasingly less refined the farther you move from the machine-verifiable domains.)
I think it would be very surprising if it wasn’t useful at all—a human who spends time rewriting and revising their essay is making it better by spending more compute. When I do creative writing with LLMs, their outputs seem to be improved if we spend some time brainstorming the details of the content beforehand, with them then being able to tap into the details we’ve been thinking about.
It’s certainly going to be harder to train without machine-checkable criteria. But I’d be surprised if it was impossible—you can always do things like training a model to predict how much a human rater would like literary outputs, and gradually improve the rater models. Probably people are focusing on things like programming first both because it’s easier and also because there’s money in it.
Unclear, but with $20 per test settings on ARC-AGI it only uses 6 reasoning traces and still gets much better results than o1, so it’s not just about throwing $4000 at the problem. Possibly it’s based on GPT-4.5 or trained on more tests.