How should I modify the problems I gave it? What would be the least impressive test which would convince you it is reasoning, and not memorizing? (Preferably something that doesn’t rely on eg rhyming, since GPT-3 uses an obfuscating input encoding)
Anyway, my main issue is that you’re not defining what you mean by reasoning, even informally. What’s the difference between reasoning vs mere interpolation/extrapolation? A stab at a definition would make it a lot easier to differentiate.
One stab might be some kind of “semantic sensitivity”:
Some inputs are close in terms of edit distance, but very different semantically. One clue that a system can reason is if it can correctly respond to these small variations, and explain the difference.
This is part of why I tested similar situations with the bullet—I wanted to see whether small changes to the words would provoke a substantively different response.
I think another part of this is “sequential processing steps required”—you couldn’t just look up a fact or a definition somewhere, to get the correct response.
This is still woefully incomplete, but hopefully this helps a bit.
I like the second suggestion a lot more than the first. To me, the first is getting more at “Does GPT convert to a semantic representation, or just go based off of syntax?” I already strongly suspect it does something more meaningful than “just syntax”—but whether it then reasons about it is another matter.
How should I modify the problems I gave it? What would be the least impressive test which would convince you it is reasoning, and not memorizing? (Preferably something that doesn’t rely on eg rhyming, since GPT-3 uses an obfuscating input encoding)
I know there are benchmarks for NL reasoning, but I’m not re-finding them so easily...
This looks like one:
https://github.com/facebookresearch/clutrr/
Anyway, my main issue is that you’re not defining what you mean by reasoning, even informally. What’s the difference between reasoning vs mere interpolation/extrapolation? A stab at a definition would make it a lot easier to differentiate.
One stab might be some kind of “semantic sensitivity”:
This is part of why I tested similar situations with the bullet—I wanted to see whether small changes to the words would provoke a substantively different response.
I think another part of this is “sequential processing steps required”—you couldn’t just look up a fact or a definition somewhere, to get the correct response.
This is still woefully incomplete, but hopefully this helps a bit.
I like the second suggestion a lot more than the first. To me, the first is getting more at “Does GPT convert to a semantic representation, or just go based off of syntax?” I already strongly suspect it does something more meaningful than “just syntax”—but whether it then reasons about it is another matter.