Nice article! I’m still somewhat concerned that the performance increase of o1 can be partially attributed to the benchmarks (blockworld, AGI-ARC) having existed for a while on the internet, and thus having made their way into updated training corpora (which of course we don’t have access to). So an alternative hypothesis would simply be that o1 is still doing pattern matching, just that it has better and more relevant data to pattern-match towards here. Still, I don’t think this can fully explain the increase in capabilities observed, so I agree with the high-level argument you present.
Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc (‘randomized mystery blocksworld’). They do see a 30% accuracy dip when doing that, but o1-preview’s performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there’s no way to tell, though, since they don’t test o1-preview on the fully-private held-out set of problems.
Nice article! I’m still somewhat concerned that the performance increase of o1 can be partially attributed to the benchmarks (blockworld, AGI-ARC) having existed for a while on the internet, and thus having made their way into updated training corpora (which of course we don’t have access to). So an alternative hypothesis would simply be that o1 is still doing pattern matching, just that it has better and more relevant data to pattern-match towards here. Still, I don’t think this can fully explain the increase in capabilities observed, so I agree with the high-level argument you present.
Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc (‘randomized mystery blocksworld’). They do see a 30% accuracy dip when doing that, but o1-preview’s performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there’s no way to tell, though, since they don’t test o1-preview on the fully-private held-out set of problems.