What do people mean when they say that o1 and o3 have “opened up new scaling laws” and that inference-time compute will be really exciting?
The standard scaling law people talk about is for pretraining, shown in the Kaplan and Hoffman (Chinchilla) papers.
It was also the case that various post-training (i.e., finetuning) techniques improve performance, (though I don’t think there is as clean of a scaling law, I’m unsure). See e.g., this paper which I just found via googling fine-tuning scaling laws. See also the Tülu 3 paper, Figure 4.
We have also already seen scaling law-type trends for inference compute, e.g., this paper:
The o1 blog post points out that they are observing two scaling trends: predictable scaling w.r.t. post-training (RL) compute, and predictable scaling w.r.t. inference compute:
The paragraph before this image says: “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.” That is, the left graph is about post-training compute.
Following from that graph on the left, the o1 paradigm gives us models that are better for a fixed inference compute budget (which is basically what it means to train a model for longer or train a better model of the same size by using better algorithms — the method is new but not the trend), and following from the right, performance seems to scale well with inference compute budget. I’m not sure there’s sufficient public data to compare that graph on the right against other inference-compute scaling methods, but my guess is the returns are better.
What is o3 doing that you couldn’t do by running o1 on more computers for longer?
I mean, if you replace “o1” in this sentence with “monkeys typing Shakespeare with ground truth verification,” it’s true, right? But o3 is actually a smarter mind in some sense, so it takes [presumably much] less inference compute to get similar performance. For instance, see this graph about o3-mini:
The performance-per-dollar frontier is pushed up by the o3-mini models. It would be somewhat interesting to know how much cost it would take for o1 to reach o3 performance here, but my guess is it’s a huge amount and practically impossible. That is, there are some performance levels that are practically unobtainable for o1, the same way the monkeys won’t actually complete Shakespeare.
Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it’s not dishonesty but this does look like something one would do if you wanted to exaggerate the slope
I believe this is standard/acceptable for presenting log-axis data, but I’m not sure. This is a graph from the Kaplan paper:
It is certainly frustrating that they don’t label the x-axis. Here’s a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don’t notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don’t think they tried to mislead about this, instead this is a domain where log-scaling axes is super common and doesn’t invalidate the results in any way.
The standard scaling law people talk about is for pretraining, shown in the Kaplan and Hoffman (Chinchilla) papers.
It was also the case that various post-training (i.e., finetuning) techniques improve performance, (though I don’t think there is as clean of a scaling law, I’m unsure). See e.g., this paper which I just found via googling fine-tuning scaling laws. See also the Tülu 3 paper, Figure 4.
We have also already seen scaling law-type trends for inference compute, e.g., this paper:
The o1 blog post points out that they are observing two scaling trends: predictable scaling w.r.t. post-training (RL) compute, and predictable scaling w.r.t. inference compute:
The paragraph before this image says: “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.” That is, the left graph is about post-training compute.
Following from that graph on the left, the o1 paradigm gives us models that are better for a fixed inference compute budget (which is basically what it means to train a model for longer or train a better model of the same size by using better algorithms — the method is new but not the trend), and following from the right, performance seems to scale well with inference compute budget. I’m not sure there’s sufficient public data to compare that graph on the right against other inference-compute scaling methods, but my guess is the returns are better.
I mean, if you replace “o1” in this sentence with “monkeys typing Shakespeare with ground truth verification,” it’s true, right? But o3 is actually a smarter mind in some sense, so it takes [presumably much] less inference compute to get similar performance. For instance, see this graph about o3-mini:
The performance-per-dollar frontier is pushed up by the o3-mini models. It would be somewhat interesting to know how much cost it would take for o1 to reach o3 performance here, but my guess is it’s a huge amount and practically impossible. That is, there are some performance levels that are practically unobtainable for o1, the same way the monkeys won’t actually complete Shakespeare.
Hope that clears things up some!
Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it’s not dishonesty but this does look like something one would do if you wanted to exaggerate the slope
I believe this is standard/acceptable for presenting log-axis data, but I’m not sure. This is a graph from the Kaplan paper:
It is certainly frustrating that they don’t label the x-axis. Here’s a quick conversation where I asked GPT4o to explain. You are correct that a quick look at this graph (where you don’t notice the log-scale) would imply (highly surprising and very strong) linear scaling trends. Scaling laws are generally very sub-linear, in particular often following a power-law. I don’t think they tried to mislead about this, instead this is a domain where log-scaling axes is super common and doesn’t invalidate the results in any way.
Ah wait was reading it wrong. I thought each time was an order of magnitude, that looks to be standard notation for log scale. Mischief managed