Are there bits of evidence against general reasoning ability in GPT-3? Any answers it gives that it would obviously not give if it had a shred of general reasoning ability?
In the post I gestured towards the first test I would want to do here—compare its performance on arithmetic to its performance on various “fake arithmetics.” If #2 is the mechanism for its arithmetic performance, then I’d expect fake arithmetic performance which
is roughy comparable to real arithmetic performance (perhaps a bit worse but not vastly so)
is at least far above random guessing
more closely correlates with the compressibility / complexity of the formal system than with its closeness to real arithmetic
BTW, I want to reiterate that #2 is about non-linguistic general reasoning, the ability to few-shot learn generic formal systems with no relation to English. So the analogies and novel words results seem irrelevant here, although word scramble results may be relevant, as dmtea says.
----
There’s something else I keep wanting to say, because it’s had a large influence on my beliefs, but is hard to phrase in an objective-sounding way . . . I’ve had a lot of experience with GPT-2:
I was playing around with fine-tuning soon after 117M was released, and jumped to each of the three larger versions shortly after itsrelease. I have done fine-tuning with at least 11 different text corpora I prepared myself.
All this energy for GPT-2 hobby work eventually convergent into my tumblr bot, which uses a fine-tuned 1.5B with domain-specific encoding choices and a custom sampling strategy (“middle-p”), and generates 10-20 candidate samples per post which are then scored by a separate BERT model optimizing for user engagement and a sentiment model to constrain tone. It’s made over 5000 posts so far and continues to make 15+ / day.
So, I think have a certain intimate familiarity with GPT-2 -- what it “feels like” across the 4 released sizes and across numerous fine-tuning / sampling / etc strategies on many corpora—that can’t be acquired just by reading papers. And I think this makes me less impressed with arithmetic and other synthetic results than some people.
I regularly see my own GPT-2s do all sorts of cool tricks somewhat similar to these (in fact the biggest surprise here is how far you have to scale to get few-shot arithmetic!), and yet there are also difficult-to-summarize patterns of failure and ignorance which are remarkably resistant to scaling across the 117M-to-1.5B range. (Indeed, the qualitative difference across that range is far smaller than I had expected when only 117M was out.) GPT-2 feels like a very familiar “character” to me by now, and I saw that “character” persist across the staged release without qualitative jumps. I still wait for evidence that convinces me 175B is a new “character” and not my old, goofy friend with another lovely makeover.
Thanks for the thoughtful reply. I definitely acknowledge and appreciate your experience. I agree the test you proposed would be worth doing and would provide some evidence. I think it would have to be designed carefully so that the model knows it is doing fake arithmetic rather than ordinary arithmetic. Maybe the prompt could be something like “Consider the following made-up mathematical operation “@. ” 3@7=8, 4@4 = 3, … [more examples], What does 2@7 equal? Answer: 2@7 equals ” I also think that we shouldn’t expect GPT-3 to be able to do general formal reasoning at a level higher than, say, a fifth grade human. After all, it’s been trained on a similar dataset (english, mostly non-math, but a bit of normal arithmetic math).
Are you saying that GPT-3 has learned linguistic general reasoning but not non-linguistic general reasoning? I’m not sure there’s an important distinction there.
It doesn’t surprise me that you need to scale up a bunch to get this sort of stuff. After all, we are still well below the size of the human brain.
Side question about experience: Surveys seem to show that older AI scientists, who have been working in the field for longer, tend to think AGI is farther in the future—median 100+ years for scientists with 20+ years of experience, if I recall correctly. Do you think that this phenomenon represents a bias on the part of older scientists, younger scientists, both, or neither?
Also note that a significant number of humans would fail the kind of test you described (inducing the behavior of a novel mathematical operation from a relatively small number of examples), which is why similar tests of inductive reasoning ability show up quite often on IQ tests and the like. It’s not the case that failing at that kind of test shows a lack of general reasoning skills, unless we permit that a substantial fraction of humans lack general reasoning skills to at least some extent.
In the post I gestured towards the first test I would want to do here—compare its performance on arithmetic to its performance on various “fake arithmetics.” If #2 is the mechanism for its arithmetic performance, then I’d expect fake arithmetic performance which
is roughy comparable to real arithmetic performance (perhaps a bit worse but not vastly so)
is at least far above random guessing
more closely correlates with the compressibility / complexity of the formal system than with its closeness to real arithmetic
BTW, I want to reiterate that #2 is about non-linguistic general reasoning, the ability to few-shot learn generic formal systems with no relation to English. So the analogies and novel words results seem irrelevant here, although word scramble results may be relevant, as dmtea says.
----
There’s something else I keep wanting to say, because it’s had a large influence on my beliefs, but is hard to phrase in an objective-sounding way . . . I’ve had a lot of experience with GPT-2:
I was playing around with fine-tuning soon after 117M was released, and jumped to each of the three larger versions shortly after its release. I have done fine-tuning with at least 11 different text corpora I prepared myself.
All this energy for GPT-2 hobby work eventually convergent into my tumblr bot, which uses a fine-tuned 1.5B with domain-specific encoding choices and a custom sampling strategy (“middle-p”), and generates 10-20 candidate samples per post which are then scored by a separate BERT model optimizing for user engagement and a sentiment model to constrain tone. It’s made over 5000 posts so far and continues to make 15+ / day.
So, I think have a certain intimate familiarity with GPT-2 -- what it “feels like” across the 4 released sizes and across numerous fine-tuning / sampling / etc strategies on many corpora—that can’t be acquired just by reading papers. And I think this makes me less impressed with arithmetic and other synthetic results than some people.
I regularly see my own GPT-2s do all sorts of cool tricks somewhat similar to these (in fact the biggest surprise here is how far you have to scale to get few-shot arithmetic!), and yet there are also difficult-to-summarize patterns of failure and ignorance which are remarkably resistant to scaling across the 117M-to-1.5B range. (Indeed, the qualitative difference across that range is far smaller than I had expected when only 117M was out.) GPT-2 feels like a very familiar “character” to me by now, and I saw that “character” persist across the staged release without qualitative jumps. I still wait for evidence that convinces me 175B is a new “character” and not my old, goofy friend with another lovely makeover.
Thanks for the thoughtful reply. I definitely acknowledge and appreciate your experience. I agree the test you proposed would be worth doing and would provide some evidence. I think it would have to be designed carefully so that the model knows it is doing fake arithmetic rather than ordinary arithmetic. Maybe the prompt could be something like “Consider the following made-up mathematical operation “@. ” 3@7=8, 4@4 = 3, … [more examples], What does 2@7 equal? Answer: 2@7 equals ” I also think that we shouldn’t expect GPT-3 to be able to do general formal reasoning at a level higher than, say, a fifth grade human. After all, it’s been trained on a similar dataset (english, mostly non-math, but a bit of normal arithmetic math).
Are you saying that GPT-3 has learned linguistic general reasoning but not non-linguistic general reasoning? I’m not sure there’s an important distinction there.
It doesn’t surprise me that you need to scale up a bunch to get this sort of stuff. After all, we are still well below the size of the human brain.
Side question about experience: Surveys seem to show that older AI scientists, who have been working in the field for longer, tend to think AGI is farther in the future—median 100+ years for scientists with 20+ years of experience, if I recall correctly. Do you think that this phenomenon represents a bias on the part of older scientists, younger scientists, both, or neither?
Also note that a significant number of humans would fail the kind of test you described (inducing the behavior of a novel mathematical operation from a relatively small number of examples), which is why similar tests of inductive reasoning ability show up quite often on IQ tests and the like. It’s not the case that failing at that kind of test shows a lack of general reasoning skills, unless we permit that a substantial fraction of humans lack general reasoning skills to at least some extent.