I agree that there is a good chance that this solution is not actually SOTA, and that it is important to distinguish the three sets.
There’s a further distinction between 3 guesses per problem (which is allowed according to the original specification as Ryan notes), and 2 guesses per problem (which is currently what the leaderboard tracks [rules]).
Some additional comments / minor corrections:
The past SOTA got [we don’t know] on the first, 52% on the second, and 34% on the third.
AFAICT, the current SOTA-on-the-private-test-set with 3 submissions per problem is 37%, and that solution scores 54% on the public eval set.
The SOTA-on-the-public-eval-set is at least 60% (see thread).
Apparently, lots of people get worse performance on the public test set than the private one
I think this is a typo and you mean the opposite.
From looking into this a bit, it seems pretty clear that the public eval set and the private test set are not IID. They’re “intended” to be the “same” difficulty, but AFAICT this essentially just means that they both consist of problems that are feasible for humans to solve.
It’s not the case that a fixed set of eval/test problems were created and then randomly distributed between the public eval set and private test set. At your link, Chollet says “the [private] test set was created last” and the problems in it are “more unique and more diverse” than the public eval set. He confirms that here:
This is *also* likely in part due to the fact that the eval set contains more “easy” tasks. The eval set and test set were not calibrated for difficulty. So while all tasks across the board are feasible for humans, the tasks in the test set may be harder on average. This was not intentional, and is likely either a fluke (there are only 100 tasks in the test set) or due to the test set having been created last.”
Bottom line: I would expect Ryan’s solution to score significantly lower than 50% on the private test set.
I agree that there is a good chance that this solution is not actually SOTA, and that it is important to distinguish the three sets.
There’s a further distinction between 3 guesses per problem (which is allowed according to the original specification as Ryan notes), and 2 guesses per problem (which is currently what the leaderboard tracks [rules]).
Some additional comments / minor corrections:
AFAICT, the current SOTA-on-the-private-test-set with 3 submissions per problem is 37%, and that solution scores 54% on the public eval set.
The SOTA-on-the-public-eval-set is at least 60% (see thread).
I think this is a typo and you mean the opposite.
From looking into this a bit, it seems pretty clear that the public eval set and the private test set are not IID. They’re “intended” to be the “same” difficulty, but AFAICT this essentially just means that they both consist of problems that are feasible for humans to solve.
It’s not the case that a fixed set of eval/test problems were created and then randomly distributed between the public eval set and private test set. At your link, Chollet says “the [private] test set was created last” and the problems in it are “more unique and more diverse” than the public eval set. He confirms that here:
Bottom line: I would expect Ryan’s solution to score significantly lower than 50% on the private test set.
Thanks, this is a helpful comment. Fixed the typo