Nicky Case points me to ‘Emergent Analogical Reasoning in Large Language Models’, from Webb et al, late 2022, which claims that GPT-3 does better than human on a version of Raven’s Standard Progressive Matrices, often considered one of the best measures of non-verbal fluid intelligence. I somewhat roll to disbelieve, because that would seem to conflict with eg LLMs’ poor performance on ARC-AGI. There’s been some debate about the paper:
This is an impressive result, showing the zero-shot ability of GPT-3 to recognize patterns in its input (though it did have trouble with some of the patterns—more on its pattern-abstraction troubles in the section on letter-string analogies). However, I disagree with the authors that this task has “comparable problem structure and complexity as Raven’s Progressive Matrices”. The translation from a visual RPM problem into a digit matrix requires segmenting the figure into different objects, disentangling the attributes, and including only those objects and attributes that are relevant to solving the problem. That is, the translation itself (or the automated creation of digit matrices as done by the authors) does a lot of the hard work for the machine. In short, solving digit matrices does not equate to solving Raven’s problems.
The authors found that humans performed about as well on the digit matrix versions as on the original problems. But that is because humans are generally extremely good at the visual and cognitive processes of segmenting the figures, disentangling the attributes, and identifying the relevant attributes. These are abilities that are often the hardest for machines (see the “easy things are easy” fallacy I wrote about in 2021). Thus while the difficulty of RPM problems and digit matrices might be similar for humans, I don’t believe they are equally similar for machines.
I found the arguments in Response: Emergent analogical reasoning in large language models somewhat weaker on the whole, and in particular I think rearranging the alphabet on the fly (section 7.1 & appendix 7.1) is fundamentally hard to deal with for LLMs and so doesn’t cleanly measure general reasoning. Their argument that some of this may be in the training data does seem reasonable to me.
Overall I’m left somewhat skeptical of the claims from the Webb et al paper, but it’s at least a bit of possible evidence on general reasoning.
Nicky Case points me to ‘Emergent Analogical Reasoning in Large Language Models’, from Webb et al, late 2022, which claims that GPT-3 does better than human on a version of Raven’s Standard Progressive Matrices, often considered one of the best measures of non-verbal fluid intelligence. I somewhat roll to disbelieve, because that would seem to conflict with eg LLMs’ poor performance on ARC-AGI. There’s been some debate about the paper:
Response: Emergent analogical reasoning in large language models (08/23)
Evidence from counterfactual tasks supports emergent analogical reasoning in large language models (04/24)
(different thread of disagreement)
On Analogy-Making in Large Language Models (Melanie Mitchell, 01⁄23)
Response to “On Analogy-Making in Large Language Models” (03/23)
I find Mitchell’s point pretty strong here:
I found the arguments in Response: Emergent analogical reasoning in large language models somewhat weaker on the whole, and in particular I think rearranging the alphabet on the fly (section 7.1 & appendix 7.1) is fundamentally hard to deal with for LLMs and so doesn’t cleanly measure general reasoning. Their argument that some of this may be in the training data does seem reasonable to me.
Overall I’m left somewhat skeptical of the claims from the Webb et al paper, but it’s at least a bit of possible evidence on general reasoning.