Noticing progress in long reasoning models like o3 creates a different blind spot compared to popular reporting on how scaling of pretraining is stalling out. It can appear that long reasoning models reconcile the central point of pretraining stalling out with AI progress moving fast. But plausible success of reasoning models instead suggests that pretraining will continue scaling even more[1] than could be expected before.
Training systems were already on track to go from 50 MW, training current models for up to 1e26 FLOPs, to 150 MW in late 2024, and then 1 GW by end on 2025, training models for up to 5e27 FLOPs in 2026, 250x compute of original GPT-4. But with o3, it now seems more plausible that $150bn training systems will be built in 2026-2027, training models for up to 5e28 FLOPs in 2027-2028, which is 500x compute of the currently deployed 1e26 FLOPs models or 2500x compute of original GPT-4.
Scaling of pretraining is not stalling out, even without the new long reasoning paradigm. It might begin stalling out in 2026 at the earliest, but now more likely only in 2028. The issue is that the scale of training systems is not directly visible, there is a 1-2 year lag between decisions to build them and the observed resulting AI progress.
Reporting on how scaling is stalling out might have a point in returns on scale getting worse than expected. But if scale still keeps increasing despite that, there will be capabilities resulting from additional scale. Scaling by 10x in compute might do very little, and this is compatible with scaling by 500x in compute bringing a qualitative change.
Noticing progress in long reasoning models like o3 creates a different blind spot compared to popular reporting on how scaling of pretraining is stalling out. It can appear that long reasoning models reconcile the central point of pretraining stalling out with AI progress moving fast. But plausible success of reasoning models instead suggests that pretraining will continue scaling even more[1] than could be expected before.
Training systems were already on track to go from 50 MW, training current models for up to 1e26 FLOPs, to 150 MW in late 2024, and then 1 GW by end on 2025, training models for up to 5e27 FLOPs in 2026, 250x compute of original GPT-4. But with o3, it now seems more plausible that $150bn training systems will be built in 2026-2027, training models for up to 5e28 FLOPs in 2027-2028, which is 500x compute of the currently deployed 1e26 FLOPs models or 2500x compute of original GPT-4.
Scaling of pretraining is not stalling out, even without the new long reasoning paradigm. It might begin stalling out in 2026 at the earliest, but now more likely only in 2028. The issue is that the scale of training systems is not directly visible, there is a 1-2 year lag between decisions to build them and the observed resulting AI progress.
Reporting on how scaling is stalling out might have a point in returns on scale getting worse than expected. But if scale still keeps increasing despite that, there will be capabilities resulting from additional scale. Scaling by 10x in compute might do very little, and this is compatible with scaling by 500x in compute bringing a qualitative change.