Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc (‘randomized mystery blocksworld’). They do see a 30% accuracy dip when doing that, but o1-preview’s performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there’s no way to tell, though, since they don’t test o1-preview on the fully-private held-out set of problems.
Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc (‘randomized mystery blocksworld’). They do see a 30% accuracy dip when doing that, but o1-preview’s performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there’s no way to tell, though, since they don’t test o1-preview on the fully-private held-out set of problems.