Aaron_Scher comments on o3

Aaron_Scher Dec 20, 2024, 11:58 PM
19 points
4
Regarding whether this is a new base model, we have the following evidence:
Jason Wei:
o3 is very performant. More importantly, progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute. Way faster than pretraining paradigm of new model every 1-2 years
Nat McAleese:
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive. (2/n)
The prices leaked by ARC-ARG people indicate $60/million output tokens, which is also the current o1 pricing. 33m total tokens and a cost of $2,012.
Notably, the codeforces graph with pricing puts o3 about 3x higher than o1 (tho maybe it’s a secretly log scale), and the ARC-AGI graph has the cost of o3 being 10-20x that of o1-preview. Maybe this indicates it does a bunch more test-time reasoning. That’s collaborated by ARC-AGI, average 55k tokens per solution^[1], which seems like a ton.
I think this evidence indicates this is likely the same base model as o1, and I would be at like 65% sure, so not super confident.
1. ^
  edit to add because the phrasing is odd: this is the data being used for the estimate, and the estimate is 33m tokens / (100 tasks * 6 samples per task) = ~55k tokens per sample. I called this “solution” because I expect these are basically 6 independent attempts at answering the prompt, but somebody else might interpret things differently. The last column is “Time/Task (mins)”.
What links here?
- Thane Ruthenis's comment on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? by Thomas Kwa (May 6, 2025, 12:39 AM; 6 points)
- Vladimir_Nesov Dec 21, 2024, 12:23 AM
  8 points
  0
  Parent
  GPT-4o costs $10 per 1M output tokens, so the cost of $60 per 1M tokens is itself more than 6 times higher than it has to be. Which means they can afford to sell a much more expensive model at the same price. It could also be GPT-4.5o-mini or something, similar in size to GPT-4o but stronger, with knowledge distillation from full GPT-4.5o, given that a new training system has probably been available for 6+ months now.