Who has written up forecasts on how reasoning will scale?
I see people say that e.g. the marginal cost of training DeepSeek R1 over DeepSeek v3 was very little. And I see people say that reasoning capabilities will scale a lot further than they already have. So what’s the roadblock? Doesn’t seem to be compute, so it’s probably algorithmic.
But as a non-technical person I don’t really know how to model this (other than some vague feeling from posts I’ve read here that reasoning length will increase exponentially and that this will correspond to significantly improved problem-solving skills and increased agency), but it seems pretty central to forming timelines. So, anyone written anything informative about this?
We have data about how problem-solving ability scales with reasoning time for a fixed model. This isn’t your question, but it’s related. It’s pretty much logarithmic, IIRC.
The important question is, how far can we push the technique whereby reasoning models are trained? They are trained by having them solve a problem with chains of thought (CoT), and then having them look at their own CoT, and ask “how could I have thought that faster?” It’s unclear how far this technique can be pushed (at least to those of us outside the main AI labs).
The known scaling principles are unsatisfying from the point of view of someone who actually wants to know what will happen next. They can predict numbers like score on a certain test, or residual perplexity. But they can’t predict the emergence of new abilities like “can translate languages” or “can tell a joke” or “can take over the world.”
By the way, I wouldn’t put too much weight on any claims about “marginal cost of training DeepSeek R1 over DeepSeek v3”. DeepSeek has a track record of understating how much effort it took to do something. I’m not saying it’s actual dishonesty (although it might be) but it’s at least not counting costs that other companies include, so their estimates come out apparently much lower than other people.
Thanks. One thing that confuses me is that, if this is true, why do mini reasoning models often seem to out-perform their full counterparts at certain tasks?
e.g. grok 3 beta mini (think) performed overall roughly the same or better than grok 3 beta (think) on benchmarks[1]. And I remember a similar thing with OAI’s reasoning models.
Full Grok 3 only had a month for post-training, and keeping responses on general topics reasonable is a fiddly semi-manual process. They didn’t necessarily have the R1-Zero idea either, which might make long reasoning easier to scale automatically (as long as you have enough verifiable tasks, which is the thing that plausibly fails to scale very far).
Also, running long reasoning traces for a big model is more expensive and takes longer, so the default settings will tend to give smaller reasoning models more tokens to reason with, skewing the comparison.
Who has written up forecasts on how reasoning will scale?
I see people say that e.g. the marginal cost of training DeepSeek R1 over DeepSeek v3 was very little. And I see people say that reasoning capabilities will scale a lot further than they already have. So what’s the roadblock? Doesn’t seem to be compute, so it’s probably algorithmic.
But as a non-technical person I don’t really know how to model this (other than some vague feeling from posts I’ve read here that reasoning length will increase exponentially and that this will correspond to significantly improved problem-solving skills and increased agency), but it seems pretty central to forming timelines. So, anyone written anything informative about this?
An interesting and important question.
We have data about how problem-solving ability scales with reasoning time for a fixed model. This isn’t your question, but it’s related. It’s pretty much logarithmic, IIRC.
The important question is, how far can we push the technique whereby reasoning models are trained? They are trained by having them solve a problem with chains of thought (CoT), and then having them look at their own CoT, and ask “how could I have thought that faster?” It’s unclear how far this technique can be pushed (at least to those of us outside the main AI labs).
The known scaling principles are unsatisfying from the point of view of someone who actually wants to know what will happen next. They can predict numbers like score on a certain test, or residual perplexity. But they can’t predict the emergence of new abilities like “can translate languages” or “can tell a joke” or “can take over the world.”
By the way, I wouldn’t put too much weight on any claims about “marginal cost of training DeepSeek R1 over DeepSeek v3”. DeepSeek has a track record of understating how much effort it took to do something. I’m not saying it’s actual dishonesty (although it might be) but it’s at least not counting costs that other companies include, so their estimates come out apparently much lower than other people.
One non-technical forecast, related to gpt4.5′s announcement: https://x.com/davey_morse/status/1895563170405646458
Thanks. One thing that confuses me is that, if this is true, why do mini reasoning models often seem to out-perform their full counterparts at certain tasks?
e.g. grok 3 beta mini (think) performed overall roughly the same or better than grok 3 beta (think) on benchmarks[1]. And I remember a similar thing with OAI’s reasoning models.
[1] https://x.ai/blog/grok-3
Full Grok 3 only had a month for post-training, and keeping responses on general topics reasonable is a fiddly semi-manual process. They didn’t necessarily have the R1-Zero idea either, which might make long reasoning easier to scale automatically (as long as you have enough verifiable tasks, which is the thing that plausibly fails to scale very far).
Also, running long reasoning traces for a big model is more expensive and takes longer, so the default settings will tend to give smaller reasoning models more tokens to reason with, skewing the comparison.