Using $4K per task means a lot of inference in parallel, which wasn’t in o1. So that’s one possible source of improvement, maybe it’s running MCTS instead of individual long traces (including on low settings at $20 per task). And it might be built on the 100K H100s base model.
The scary less plausible option is that RL training scales, so it’s mostly o1 trained with more compute, and $4K per task is more of an inefficient premium option on top rather than a higher setting on o3′s source of power.
Performance at $20 per task is already much better than for o1, so it can’t be just best-of-n, you’d need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that’s plausible, but not at $20 per task). There are various clever beam search options that don’t need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).
There’s still no word on the 100K H100s model, so that’s another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that’s better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it’s hard to say how much that alone should help.
Using $4K per task means a lot of inference in parallel, which wasn’t in o1. So that’s one possible source of improvement, maybe it’s running MCTS instead of individual long traces (including on low settings at $20 per task). And it might be built on the 100K H100s base model.
The scary less plausible option is that RL training scales, so it’s mostly o1 trained with more compute, and $4K per task is more of an inefficient premium option on top rather than a higher setting on o3′s source of power.
The obvious boring guess is best of n. Maybe you’re asserting that using $4,000 implies that they’re doing more than that.
Performance at $20 per task is already much better than for o1, so it can’t be just best-of-n, you’d need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that’s plausible, but not at $20 per task). There are various clever beam search options that don’t need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).
There’s still no word on the 100K H100s model, so that’s another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that’s better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it’s hard to say how much that alone should help.
Anyone want to guess how capable Claude system level 2 will be when it is polished? I expect better than o3 by a small amt.
The ARC-AGI page (which I think has been updated) currently says: