Aggregating from independent reasoning traces is a well-known technique that helps somewhat but quickly plateaus, which is the reason o1/o3 are an important innovation, they use additional tokens much more efficiently and reach greater capability, as long as those tokens are within a single reasoning trace. Once a trace is done, more compute can only go to consensus or best-of-k aggregation from multiple traces, which is more wasteful in compute and quickly plateaus.
The $4000 high resource config of o3 for ARC-AGI was using 1024 traces of about 55K tokens, the same length as with the low resource config that runs 6 traces. Possibly longer reasoning traces don’t work yet, otherwise a pour money on the problem option would’ve used longer traces. So a million dollar config would just use 250K reasoning traces of length 55K, which is probably slightly better than what 1K traces produce already.
In my extrapolation, going from $3,000 to $1,000,000 for one task would move one from 175th to 87th position on the CodeForces leaderboard, which seems to be not that much.
Aggregating from independent reasoning traces is a well-known technique that helps somewhat but quickly plateaus, which is the reason o1/o3 are an important innovation, they use additional tokens much more efficiently and reach greater capability, as long as those tokens are within a single reasoning trace. Once a trace is done, more compute can only go to consensus or best-of-k aggregation from multiple traces, which is more wasteful in compute and quickly plateaus.
The $4000 high resource config of o3 for ARC-AGI was using 1024 traces of about 55K tokens, the same length as with the low resource config that runs 6 traces. Possibly longer reasoning traces don’t work yet, otherwise a pour money on the problem option would’ve used longer traces. So a million dollar config would just use 250K reasoning traces of length 55K, which is probably slightly better than what 1K traces produce already.
In my extrapolation, going from $3,000 to $1,000,000 for one task would move one from 175th to 87th position on the CodeForces leaderboard, which seems to be not that much.
O1 preview: $1.2 → 1258 ELO
O1: $3 → 1891
O3 low $20 → 2300
O3 high: $3,000 → 2727
O4: $1,000,000 → ? Chatgpt gives around 2900 ELO