It’s hard to find numbers. Here’s what I’ve been able to gather (please let me know if you find better numbers than these!). I’m mostly focusing on FrontierMath.
Pixel counting on the ARC-AGI image, I’m getting $3,644 ± $10 per task.
FrontierMath doesn’t say how many questions they have (!!!). However, they have percent breakdowns by subfield, and those percents are given to the nearest 0.1%; using this, I can narrow the range down to 289-292 problems in the dataset. Previous models solve around 3 problems (4 problems in total were ever solved by any previous model, though the full o1 was not evaluated, only o1-preview was)
o3 solves 25.2% of FrontierMath. This could be 73⁄290. But it is also possible that some questions were removed from the dataset (e.g. because they’re publicly available). 25.2% could also be 72⁄286 or 71⁄282, for example.
The 280 to 290 problems means a rough ballpark for a 95% confidence interval for FrontierMath would be [20%, 30%]. It is pretty strange that the ML community STILL doesn’t put confidence intervals on their benchmarks. If you see a model achieve 30% on FrontierMath later, remember that its confidence interval would overlap with o3′s. (Edit: actually, the confidence interval calculation assumes all problems are sampled from the same pool, which is explicitly not the case for this dataset: some problems are harder than others. So it is hard to get a true confidence interval without rerunning the evaluation several times, which would cost many millions of dollars.)
Using the pricing for ARC-AGI, o3 cost around $1mm to evaluate on the 280-290 problems of FrontierMath. That’s around $3,644 per attempt, or roughly $14,500 per correct solution.
This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem (they’d take at most few hours per problem if you find the right expert, except for the hardest problems which o3 also didn’t solve). Even without hiring a different domain expert per problem, I think if you gave me FrontierMath and told me “solve as many as you can, and you get $15k per correct answer” I’d probably spend like a year and solve a lot of them: if I match o3 within a year, I’d get $1mm, which is much higher than my mathematician salary. (o3 has an edge in speed, of course, but you could parallelize the hiring of mathematicians too.) I think this is the first model I’ve seen which gets paid more than I do!
This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem
I don’t think anchoring to o3′s current cost-efficiency is a reasonable thing to do. Now that AI has the capability to solve these problems in-principle, buying this capability is probably going to get orders of magnitude cheaper within the next five minutes months, as they find various algorithmic shortcuts.
I would guess that OpenAI did this using a non-optimized model because they expected it to be net beneficial: that producing a headline-grabbing result now will attract more counterfactual investment than e. g. the $900k they’d save by running the benchmarks half a year later.
Edit: In fact, if, against these expectations, the implementation of o3′s trick can’t be made orders-of-magnitude cheaper (say, because a base model of a given size necessarily takes ~n tries/MCTS branches per a FrontierMath problem and you can’t get more efficient than one try per try), that would make me do a massive update against the “inference-time compute” paradigm.
It’s hard to find numbers. Here’s what I’ve been able to gather (please let me know if you find better numbers than these!). I’m mostly focusing on FrontierMath.
Pixel counting on the ARC-AGI image, I’m getting $3,644 ± $10 per task.
FrontierMath doesn’t say how many questions they have (!!!). However, they have percent breakdowns by subfield, and those percents are given to the nearest 0.1%; using this, I can narrow the range down to 289-292 problems in the dataset. Previous models solve around 3 problems (4 problems in total were ever solved by any previous model, though the full o1 was not evaluated, only o1-preview was)
o3 solves 25.2% of FrontierMath. This could be 73⁄290. But it is also possible that some questions were removed from the dataset (e.g. because they’re publicly available). 25.2% could also be 72⁄286 or 71⁄282, for example.
The 280 to 290 problems means a rough ballpark for a 95% confidence interval for FrontierMath would be [20%, 30%]. It is pretty strange that the ML community STILL doesn’t put confidence intervals on their benchmarks. If you see a model achieve 30% on FrontierMath later, remember that its confidence interval would overlap with o3′s. (Edit: actually, the confidence interval calculation assumes all problems are sampled from the same pool, which is explicitly not the case for this dataset: some problems are harder than others. So it is hard to get a true confidence interval without rerunning the evaluation several times, which would cost many millions of dollars.)
Using the pricing for ARC-AGI, o3 cost around $1mm to evaluate on the 280-290 problems of FrontierMath. That’s around $3,644 per attempt, or roughly $14,500 per correct solution.
This is actually likely more expensive than hiring a domain-specific expert mathematician for each problem (they’d take at most few hours per problem if you find the right expert, except for the hardest problems which o3 also didn’t solve). Even without hiring a different domain expert per problem, I think if you gave me FrontierMath and told me “solve as many as you can, and you get $15k per correct answer” I’d probably spend like a year and solve a lot of them: if I match o3 within a year, I’d get $1mm, which is much higher than my mathematician salary. (o3 has an edge in speed, of course, but you could parallelize the hiring of mathematicians too.) I think this is the first model I’ve seen which gets paid more than I do!
I don’t think anchoring to o3′s current cost-efficiency is a reasonable thing to do. Now that AI has the capability to solve these problems in-principle, buying this capability is probably going to get orders of magnitude cheaper within the next five
minutesmonths, as they find various algorithmic shortcuts.I would guess that OpenAI did this using a non-optimized model because they expected it to be net beneficial: that producing a headline-grabbing result now will attract more counterfactual investment than e. g. the $900k they’d save by running the benchmarks half a year later.
Edit: In fact, if, against these expectations, the implementation of o3′s trick can’t be made orders-of-magnitude cheaper (say, because a base model of a given size necessarily takes ~n tries/MCTS branches per a FrontierMath problem and you can’t get more efficient than one try per try), that would make me do a massive update against the “inference-time compute” paradigm.