It is worth noting that LLM based approachs can perform reasonably well on the train set. For instance, my approach gets 72%.
The LLM based approach works quite differently from how a human would normally solve the problem, and if you give LLMs “only one attempt” or otherwise limit them to do a qualitatively similar amount of reasoning as with humans I think they do considerably worse than humans. (Though to make this “only one attempt” baseline fair, you have to allow for the iteration that humans would normally do.)
Thanks for finding a cite. I’ve definitely seen Chollet (on Twitter) give 85% as the success rate on the (easier) training set (and the paper picks problems from the training set as well).
Their website cites https://cims.nyu.edu/~brenden/papers/JohnsonEtAl2021CogSci.pdf as having found an average 84% success rate on the tested subset of puzzles.
It is worth noting that LLM based approachs can perform reasonably well on the train set. For instance, my approach gets 72%.
The LLM based approach works quite differently from how a human would normally solve the problem, and if you give LLMs “only one attempt” or otherwise limit them to do a qualitatively similar amount of reasoning as with humans I think they do considerably worse than humans. (Though to make this “only one attempt” baseline fair, you have to allow for the iteration that humans would normally do.)
Yeah, I failed to mention this. Edited to clarify what I meant.
Thanks for finding a cite. I’ve definitely seen Chollet (on Twitter) give 85% as the success rate on the (easier) training set (and the paper picks problems from the training set as well).