Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan’s top score on that leaderboard. I’m not really sure what’s going on here,
Why isn’t Jack’s solution on the public leaderboard?
Is the semi-pubic test set the same as the old private set?
If not, is it equal in difficulty to the public test set, or the harder private test set?
Here it says “New high scores are accepted when the semi-private and public evaluation sets are in good agreement”. What does that mean?
One important caveat to the presentation of results in this post (and the discussion on Twitter) is that there are reasons to think this approach may not be SOTA, as it performs similarly to the prior best-performing approach when tested apples-to-apples, i.e. on the same problems.
There are three sets of ARC problems: the public training set, the public eval set, and the private eval set.
Buck and Ryan got 71% on the first, 51% on the second, and [we don’t know] on the third.
The past SOTA got [we don’t know] on the first, 52% on the second, and 34% on the third.
Humans get 85% on the first, [we don’t know] on the second, and [we don’t know] on the third
My two main deductions from this are:
It’s very misleading to compare human perf on the train set and AI perf on either of the test sets; the test sets seem way harder! Note that 71% is approaching 85%, so it seems like AIs are not far from human perf when you compare apples-to-apples. So graphs from the ARC folks like the one showing little progress towards human-level perf on this page are not scientifically valid.
Buck and Ryan’s approach doesn’t exceed the past AI SOTA on the only apples-to-apples comparison we have so far. Unclear if it will beat it on the private test set.
Apparently, lots of people get better performance on the public test set than the private one, which is a little surprising given that if you read this page from the ARC folks, you’ll see the following:
The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many “curriculum” type tasks intended to demonstrate Core Knowledge systems. It’s like a tutorial level.
The public evaluation sets and the private test sets are intended to be the same difficulty.
Two explanations come to mind: maybe the public and private test sets are not IID, and/or maybe past SOTA method overfit to the public set. Chollet claims it’s (accidentally) the latter here, but he doesn’t rule out the former. He says the tasks across the two public test sets are meant to be equally hard for a human, but he doesn’t say they’re divided in an IID manner.
I guess we’ll see how the results on the public leaderboard shake out.
I’m submitting to the private leaderboard (with fewer samples than used in this post). If results indicate that SOTA is unlikely, I’ll retract my claim.
I agree that there is a good chance that this solution is not actually SOTA, and that it is important to distinguish the three sets.
There’s a further distinction between 3 guesses per problem (which is allowed according to the original specification as Ryan notes), and 2 guesses per problem (which is currently what the leaderboard tracks [rules]).
Some additional comments / minor corrections:
The past SOTA got [we don’t know] on the first, 52% on the second, and 34% on the third.
AFAICT, the current SOTA-on-the-private-test-set with 3 submissions per problem is 37%, and that solution scores 54% on the public eval set.
The SOTA-on-the-public-eval-set is at least 60% (see thread).
Apparently, lots of people get worse performance on the public test set than the private one
I think this is a typo and you mean the opposite.
From looking into this a bit, it seems pretty clear that the public eval set and the private test set are not IID. They’re “intended” to be the “same” difficulty, but AFAICT this essentially just means that they both consist of problems that are feasible for humans to solve.
It’s not the case that a fixed set of eval/test problems were created and then randomly distributed between the public eval set and private test set. At your link, Chollet says “the [private] test set was created last” and the problems in it are “more unique and more diverse” than the public eval set. He confirms that here:
This is *also* likely in part due to the fact that the eval set contains more “easy” tasks. The eval set and test set were not calibrated for difficulty. So while all tasks across the board are feasible for humans, the tasks in the test set may be harder on average. This was not intentional, and is likely either a fluke (there are only 100 tasks in the test set) or due to the test set having been created last.”
Bottom line: I would expect Ryan’s solution to score significantly lower than 50% on the private test set.
Last week there was some uncertainty about whether @RyanPGreenblatt’s ARC-AGI solution was really sota, because many other solutions did better on public eval and we didn’t have private test results. There is now a semi-private eval set; he’s at the top of this leaderboard.
Our guess is that Ryan’s technique beats other solutions despite performing worse at the public eval because other solutions are more overfit to public eval. (But we don’t know the performance of MindsAI’s solution (@Jcole75Cole), which is sota on Kaggle, on this eval set.)
This result doesn’t clarify everything, but at least addresses concerns that Ryan’s solution is overfit because of data contamination in the data OpenAI used to pretrain GPT-4o.
Thanks to the ARC team for helping with running Ryan’s submission, and to @Jcole75Cole and @MaxNadeau_ for helpful discussion, and thanks to the community as a whole for being chill during the uncertainty here.
Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan’s top score on that leaderboard. I’m not really sure what’s going on here,
Why isn’t Jack’s solution on the public leaderboard?
Is the semi-pubic test set the same as the old private set?
If not, is it equal in difficulty to the public test set, or the harder private test set?
Here it says “New high scores are accepted when the semi-private and public evaluation sets are in good agreement”. What does that mean?
One important caveat to the presentation of results in this post (and the discussion on Twitter) is that there are reasons to think this approach may not be SOTA, as it performs similarly to the prior best-performing approach when tested apples-to-apples, i.e. on the same problems.
There are three sets of ARC problems: the public training set, the public eval set, and the private eval set.
Buck and Ryan got 71% on the first, 51% on the second, and [we don’t know] on the third.
The past SOTA got [we don’t know] on the first, 52% on the second, and 34% on the third.
Humans get 85% on the first, [we don’t know] on the second, and [we don’t know] on the third
My two main deductions from this are:
It’s very misleading to compare human perf on the train set and AI perf on either of the test sets; the test sets seem way harder! Note that 71% is approaching 85%, so it seems like AIs are not far from human perf when you compare apples-to-apples. So graphs from the ARC folks like the one showing little progress towards human-level perf on this page are not scientifically valid.
Buck and Ryan’s approach doesn’t exceed the past AI SOTA on the only apples-to-apples comparison we have so far. Unclear if it will beat it on the private test set.
Apparently, lots of people get better performance on the public test set than the private one, which is a little surprising given that if you read this page from the ARC folks, you’ll see the following:
Two explanations come to mind: maybe the public and private test sets are not IID, and/or maybe past SOTA method overfit to the public set. Chollet claims it’s (accidentally) the latter here, but he doesn’t rule out the former. He says the tasks across the two public test sets are meant to be equally hard for a human, but he doesn’t say they’re divided in an IID manner.
I guess we’ll see how the results on the public leaderboard shake out.
(Expanding on a tweet)
I endorse this comment for the record.
I’m considering editing the blog post to clarify.
If I had known that prior work got a wildly different score on the public test set (comparable to the score I get), I wouldn’t have claimed SOTA.
(That said, as you note, it seems reasonably likely (though unclear) that this prior solution was overfit to the test set while my solution is not.)
I’m submitting to the private leaderboard (with fewer samples than used in this post). If results indicate that SOTA is unlikely, I’ll retract my claim.
I edited to add:
And changed from “this dataset” to “a similarly difficult dataset”.
I agree that there is a good chance that this solution is not actually SOTA, and that it is important to distinguish the three sets.
There’s a further distinction between 3 guesses per problem (which is allowed according to the original specification as Ryan notes), and 2 guesses per problem (which is currently what the leaderboard tracks [rules]).
Some additional comments / minor corrections:
AFAICT, the current SOTA-on-the-private-test-set with 3 submissions per problem is 37%, and that solution scores 54% on the public eval set.
The SOTA-on-the-public-eval-set is at least 60% (see thread).
I think this is a typo and you mean the opposite.
From looking into this a bit, it seems pretty clear that the public eval set and the private test set are not IID. They’re “intended” to be the “same” difficulty, but AFAICT this essentially just means that they both consist of problems that are feasible for humans to solve.
It’s not the case that a fixed set of eval/test problems were created and then randomly distributed between the public eval set and private test set. At your link, Chollet says “the [private] test set was created last” and the problems in it are “more unique and more diverse” than the public eval set. He confirms that here:
Bottom line: I would expect Ryan’s solution to score significantly lower than 50% on the private test set.
Thanks, this is a helpful comment. Fixed the typo
See https://x.com/GregKamradt/status/1806372523170533457 for a somewhat confusing update
And https://x.com/bshlgrs/status/1806397587085468116 for some discussion.
Thanks, edited my post to reference this (lmk if you understand what’s happening here better than I do)