We do not in fact have strong evidence for this. There does not exist any baseline for ARC puzzles among humans, smart or otherwise, just a claim that two people the designers asked to attempt them were able to solve them all. It seems entirely plausible to me that the best score on that leaderboard is pretty close to the human median.
Edit: I failed to mention that there is a baseline on the test set, which is different from the eval set that is used for the scoreboard and is, I believe, significantly easier.
It is worth noting that LLM based approachs can perform reasonably well on the train set. For instance, my approach gets 72%.
The LLM based approach works quite differently from how a human would normally solve the problem, and if you give LLMs “only one attempt” or otherwise limit them to do a qualitatively similar amount of reasoning as with humans I think they do considerably worse than humans. (Though to make this “only one attempt” baseline fair, you have to allow for the iteration that humans would normally do.)
Thanks for finding a cite. I’ve definitely seen Chollet (on Twitter) give 85% as the success rate on the (easier) training set (and the paper picks problems from the training set as well).
I also think this is plausible—note that randomly selected examples from the public evaluation set are often considerably harder than the train set on which there is a known MTurk baseline (which is an average of 84%).
We do not in fact have strong evidence for this. There does not exist any baseline for ARC puzzles among humans, smart or otherwise, just a claim that two people the designers asked to attempt them were able to solve them all. It seems entirely plausible to me that the best score on that leaderboard is pretty close to the human median.
Edit: I failed to mention that there is a baseline on the test set, which is different from the eval set that is used for the scoreboard and is, I believe, significantly easier.
Their website cites https://cims.nyu.edu/~brenden/papers/JohnsonEtAl2021CogSci.pdf as having found an average 84% success rate on the tested subset of puzzles.
It is worth noting that LLM based approachs can perform reasonably well on the train set. For instance, my approach gets 72%.
The LLM based approach works quite differently from how a human would normally solve the problem, and if you give LLMs “only one attempt” or otherwise limit them to do a qualitatively similar amount of reasoning as with humans I think they do considerably worse than humans. (Though to make this “only one attempt” baseline fair, you have to allow for the iteration that humans would normally do.)
Yeah, I failed to mention this. Edited to clarify what I meant.
Thanks for finding a cite. I’ve definitely seen Chollet (on Twitter) give 85% as the success rate on the (easier) training set (and the paper picks problems from the training set as well).
There is important context here.
I also think this is plausible—note that randomly selected examples from the public evaluation set are often considerably harder than the train set on which there is a known MTurk baseline (which is an average of 84%).