The evidence they might have (as opposed to evidence-agnostic motivation to feed the narrative) is scaling laws for verifiable task RL training, for which there are still no clues in public (that is, what happens if you try to use a whole lot more compute there). It might be improving dramatically either with additional training, or depending on the quality of the pretrained model it’s based on, even with a feasible number of verifiable tasks.
OpenAI inevitably has a reasoning model based on GPT-4.5 at this point (whether it’s o3 or not), and Anthropic very likely has the same based on some bigger base model than Sonnet-3.5. Grok 3 and Gemini 2.0 Pro are probably overtrained in order to be cheap to serve, feel weaker than GPT-4.5, and we’ve only seen a reasoning model for Grok 3. I think they mostly don’t deploy the 3e26 FLOPs reasoning models because Blackwell is still getting ready, so they are too slow and expensive to serve, though it doesn’t explain Google’s behavior.
scaling laws for verifiable task RL training, for which there are still no clues in public
Why is that the case? While checking how it scales past o3/GPT-4-level isn’t possible, I’d expect people to have run experiments at lower ends of the scale (using lower-parameter models or less RL training), fitted a line to the results, and then assumed it extrapolates. Is there reason to think that wouldn’t work?
It needs verifiable tasks that might have to be crafted manually. It’s unknown what happens if you train too much with only a feasible amount of tasks, even if they are progressively more and more difficult. When data can’t be generated well, RL needs early stopping, has a bound on how far it goes, and in this case this bound could depend on the scale of the base model, or on the number/quality of verifiable tasks.
Depending on how it works, it might be impossible to use $100M for RL training at all, or scaling of pretraining might have a disproportionally large effect on quality of the optimally trained reasoning model based on it, or approximately no effect at all. Quantitative understanding of this is crucial for forecasting the consequences of the 2025-2028 compute scaleup. AI companies likely have some of this understanding, but it’s not public.
The evidence they might have (as opposed to evidence-agnostic motivation to feed the narrative) is scaling laws for verifiable task RL training, for which there are still no clues in public (that is, what happens if you try to use a whole lot more compute there). It might be improving dramatically either with additional training, or depending on the quality of the pretrained model it’s based on, even with a feasible number of verifiable tasks.
OpenAI inevitably has a reasoning model based on GPT-4.5 at this point (whether it’s o3 or not), and Anthropic very likely has the same based on some bigger base model than Sonnet-3.5. Grok 3 and Gemini 2.0 Pro are probably overtrained in order to be cheap to serve, feel weaker than GPT-4.5, and we’ve only seen a reasoning model for Grok 3. I think they mostly don’t deploy the 3e26 FLOPs reasoning models because Blackwell is still getting ready, so they are too slow and expensive to serve, though it doesn’t explain Google’s behavior.
Why is that the case? While checking how it scales past o3/GPT-4-level isn’t possible, I’d expect people to have run experiments at lower ends of the scale (using lower-parameter models or less RL training), fitted a line to the results, and then assumed it extrapolates. Is there reason to think that wouldn’t work?
It needs verifiable tasks that might have to be crafted manually. It’s unknown what happens if you train too much with only a feasible amount of tasks, even if they are progressively more and more difficult. When data can’t be generated well, RL needs early stopping, has a bound on how far it goes, and in this case this bound could depend on the scale of the base model, or on the number/quality of verifiable tasks.
Depending on how it works, it might be impossible to use $100M for RL training at all, or scaling of pretraining might have a disproportionally large effect on quality of the optimally trained reasoning model based on it, or approximately no effect at all. Quantitative understanding of this is crucial for forecasting the consequences of the 2025-2028 compute scaleup. AI companies likely have some of this understanding, but it’s not public.