Performance at $20 per task is already much better than for o1, so it can’t be just best-of-n, you’d need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that’s plausible, but not at $20 per task). There are various clever beam search options that don’t need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).
There’s still no word on the 100K H100s model, so that’s another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that’s better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it’s hard to say how much that alone should help.
The obvious boring guess is best of n. Maybe you’re asserting that using $4,000 implies that they’re doing more than that.
Performance at $20 per task is already much better than for o1, so it can’t be just best-of-n, you’d need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that’s plausible, but not at $20 per task). There are various clever beam search options that don’t need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).
There’s still no word on the 100K H100s model, so that’s another possibility. Currently Claude 3.5 Sonnet seems to be better at System 1, while OpenAI o1 is better at System 2, and combining these advantages in o3 based on a yet-unannounced GPT-4.5o base model that’s better than Claude 3.5 Sonnet might be sufficient to explain the improvement. Without any public 100K H100s Chinchilla optimal models it’s hard to say how much that alone should help.
Anyone want to guess how capable Claude system level 2 will be when it is polished? I expect better than o3 by a small amt.