My predictions are looking pretty reasonable, maybe a bit underconfident in AI progress.
70% probability: A team of 3 top research ML engineers with fine-tuning access to GPT-4o (including SFT and RL), $10 million in compute, and 1 year of time could use GPT-4o to surpass typical naive MTurk performance at ARC-AGI on the test set while using less than $100 per problem at runtime (as denominated by GPT-4o API costs).
With $20 per task, it looks like o3 is matching MTurk performance on the semi-private set and solving it on the public set. This likely depended on other advances in RL and a bunch of other training, but probably much less than $10 million + 3 top ML researchers + year was dedicated to ARC-AGI in particular.
I wasn’t expecting OpenAI to specifically try on ARC-AGI, so I wasn’t expecting this level of performance this fast (and I lost some mana due to this).
35% probability: Under the above conditions, 85% on the test set would be achieved. It’s unclear which humans perform at >=85% on the test set, though this is probably not that hard for smart humans.
Looks like o3 is under this, even with $100 per problem (as o3 high compute is just barely over and is 172x compute). Probably I should have been higher than 35% depending on how we count transfer from other work on RL etc.
80% probability: next generation multi-model models (e.g. GPT-5) will be able to substantially advance performance on ARC-AGI.
Seems clearly true if we count o3 as a next generation multi-model model. Idk how we should have counted o1, though I also think this arguably substantially advanced performance.
My predictions are looking pretty reasonable, maybe a bit underconfident in AI progress.
With $20 per task, it looks like o3 is matching MTurk performance on the semi-private set and solving it on the public set. This likely depended on other advances in RL and a bunch of other training, but probably much less than $10 million + 3 top ML researchers + year was dedicated to ARC-AGI in particular.
I wasn’t expecting OpenAI to specifically try on ARC-AGI, so I wasn’t expecting this level of performance this fast (and I lost some mana due to this).
Looks like o3 is under this, even with $100 per problem (as o3 high compute is just barely over and is 172x compute). Probably I should have been higher than 35% depending on how we count transfer from other work on RL etc.
Seems clearly true if we count o3 as a next generation multi-model model. Idk how we should have counted o1, though I also think this arguably substantially advanced performance.