Last week there was some uncertainty about whether @RyanPGreenblatt’s ARC-AGI solution was really sota, because many other solutions did better on public eval and we didn’t have private test results. There is now a semi-private eval set; he’s at the top of this leaderboard.
Our guess is that Ryan’s technique beats other solutions despite performing worse at the public eval because other solutions are more overfit to public eval. (But we don’t know the performance of MindsAI’s solution (@Jcole75Cole), which is sota on Kaggle, on this eval set.)
This result doesn’t clarify everything, but at least addresses concerns that Ryan’s solution is overfit because of data contamination in the data OpenAI used to pretrain GPT-4o.
Thanks to the ARC team for helping with running Ryan’s submission, and to @Jcole75Cole and @MaxNadeau_ for helpful discussion, and thanks to the community as a whole for being chill during the uncertainty here.
And https://x.com/bshlgrs/status/1806397587085468116 for some discussion.