All LLMs to date fail rather badly at classic problems of rearranging colored blocks.
It’s pretty unclear to me that the LLMs do much worse than humans at this task.
They establish the humans baseline by picking one problem at random out of 600 and evaluating 50 humans on this. (Why only one problem!? It would be vastly more meaningful if you check 5 problems with 10 humans assigned to each problem!) 78% of humans succeed.
with what seems to be rapid improvement with scale
Modulo the questions around obfuscation (which you raise in your other comment), I agree. Kambhampati emphasizes that they’re still much worse than humans. In my view the performance of the next gen of LLMs will tell us a lot about whether to take his arguments seriously.
It’s pretty unclear to me that the LLMs do much worse than humans at this task.
They establish the humans baseline by picking one problem at random out of 600 and evaluating 50 humans on this. (Why only one problem!? It would be vastly more meaningful if you check 5 problems with 10 humans assigned to each problem!) 78% of humans succeed.
(Human participants are from Prolific.)
On randomly selected problems, GPT-4 gets 35% right and I bet this improves with better prompting.
So, GPT-4 is maybe 2x worse than humans with huge error bars (and with what seems to be rapid improvement with scale).
Modulo the questions around obfuscation (which you raise in your other comment), I agree. Kambhampati emphasizes that they’re still much worse than humans. In my view the performance of the next gen of LLMs will tell us a lot about whether to take his arguments seriously.