I’m very interested in this particular issue, so I would love to see the two of you say more about this. For my part, (1) as abergal said elsewhere in these comments, maybe there is some mundane explanation for the mathematical success (and math errors) besides assumption #2.
Assumption #2 does seem like a fairly parsimonious explanation of the data, no? It would explain the math success, the SAT question success, the “wug test” success, etc. (Whereas “it’s just regurgitating the memorized dataset” does not, and “OK so it’s learned some basic math and some basic word-smithing, but still doesn’t have a general reasoning ability” is unparsimonious.)
I think this comes down to priors: How hard do you think it is to develop general reasoning ability?
I think we tend to think of it as the hardest thing ever because that’s what you need to get AGI and AGI is far away. But this is a bias; Nature didn’t sit down and make a tech tree with increasingly useful stuff being increasingly difficult to get, and general reasoning therefore being the hardest to get because it is so useful. I think our prior for the difficulty of general reasoning ability should be similar to our prior for the difficulty of image recognition, or fine motor control, or chess playing, or lord of the rings fanfiction writing. Given that the evidence shows GPT-3 architecture succeeds at two of the four examples above and hasn’t been applied to the other two, well, it shouldn’t be that crazy to think it might succeed at general reasoning too.
Are there bits of evidence against general reasoning ability in GPT-3? Any answers it gives that it would obviously not give if it had a shred of general reasoning ability?
Are there bits of evidence against general reasoning ability in GPT-3? Any answers it gives that it would obviously not give if it had a shred of general reasoning ability?
In the post I gestured towards the first test I would want to do here—compare its performance on arithmetic to its performance on various “fake arithmetics.” If #2 is the mechanism for its arithmetic performance, then I’d expect fake arithmetic performance which
is roughy comparable to real arithmetic performance (perhaps a bit worse but not vastly so)
is at least far above random guessing
more closely correlates with the compressibility / complexity of the formal system than with its closeness to real arithmetic
BTW, I want to reiterate that #2 is about non-linguistic general reasoning, the ability to few-shot learn generic formal systems with no relation to English. So the analogies and novel words results seem irrelevant here, although word scramble results may be relevant, as dmtea says.
----
There’s something else I keep wanting to say, because it’s had a large influence on my beliefs, but is hard to phrase in an objective-sounding way . . . I’ve had a lot of experience with GPT-2:
I was playing around with fine-tuning soon after 117M was released, and jumped to each of the three larger versions shortly after itsrelease. I have done fine-tuning with at least 11 different text corpora I prepared myself.
All this energy for GPT-2 hobby work eventually convergent into my tumblr bot, which uses a fine-tuned 1.5B with domain-specific encoding choices and a custom sampling strategy (“middle-p”), and generates 10-20 candidate samples per post which are then scored by a separate BERT model optimizing for user engagement and a sentiment model to constrain tone. It’s made over 5000 posts so far and continues to make 15+ / day.
So, I think have a certain intimate familiarity with GPT-2 -- what it “feels like” across the 4 released sizes and across numerous fine-tuning / sampling / etc strategies on many corpora—that can’t be acquired just by reading papers. And I think this makes me less impressed with arithmetic and other synthetic results than some people.
I regularly see my own GPT-2s do all sorts of cool tricks somewhat similar to these (in fact the biggest surprise here is how far you have to scale to get few-shot arithmetic!), and yet there are also difficult-to-summarize patterns of failure and ignorance which are remarkably resistant to scaling across the 117M-to-1.5B range. (Indeed, the qualitative difference across that range is far smaller than I had expected when only 117M was out.) GPT-2 feels like a very familiar “character” to me by now, and I saw that “character” persist across the staged release without qualitative jumps. I still wait for evidence that convinces me 175B is a new “character” and not my old, goofy friend with another lovely makeover.
Thanks for the thoughtful reply. I definitely acknowledge and appreciate your experience. I agree the test you proposed would be worth doing and would provide some evidence. I think it would have to be designed carefully so that the model knows it is doing fake arithmetic rather than ordinary arithmetic. Maybe the prompt could be something like “Consider the following made-up mathematical operation “@. ” 3@7=8, 4@4 = 3, … [more examples], What does 2@7 equal? Answer: 2@7 equals ” I also think that we shouldn’t expect GPT-3 to be able to do general formal reasoning at a level higher than, say, a fifth grade human. After all, it’s been trained on a similar dataset (english, mostly non-math, but a bit of normal arithmetic math).
Are you saying that GPT-3 has learned linguistic general reasoning but not non-linguistic general reasoning? I’m not sure there’s an important distinction there.
It doesn’t surprise me that you need to scale up a bunch to get this sort of stuff. After all, we are still well below the size of the human brain.
Side question about experience: Surveys seem to show that older AI scientists, who have been working in the field for longer, tend to think AGI is farther in the future—median 100+ years for scientists with 20+ years of experience, if I recall correctly. Do you think that this phenomenon represents a bias on the part of older scientists, younger scientists, both, or neither?
Also note that a significant number of humans would fail the kind of test you described (inducing the behavior of a novel mathematical operation from a relatively small number of examples), which is why similar tests of inductive reasoning ability show up quite often on IQ tests and the like. It’s not the case that failing at that kind of test shows a lack of general reasoning skills, unless we permit that a substantial fraction of humans lack general reasoning skills to at least some extent.
The big jump in performance between the zero shot and few shot setting in arithmetic and other non-linguistic reasoning tasks[esp. 3D- & 3D+] is why I think it is almost certain #2 is true. Few shot inference relies on no further training [unlike fine tuning], so the improvement in ‘pattern recognition’ so to speak is happening entirely at inference. It follows that the underlying model has general reasoning abilities, i.e. the ability to detect and repeat arbitrary patterns of ever increasing complexity, that occur in its input (conditioning) data.
Interestingly, the model fails to completely learn 4D and 5D arithmetic, where its zero-shot scores were really low. However few shot inference does show improvement. I wonder if problems of increasing complexity can also be solved using increasing numbers of examples in few shot (say k=500 for 4D+). Though of course this will run into the roadblock of context size very soon.
If increasing number of few-shot examples allows it to correctly solve ever-harder problems, there is a strong case for scaling the reformer, with a context window of 1 million tokens, to a GPT-3 like size.
It would be fascinating to probe how much of the general reasoning capabilities arise from the size of transformer itself, and how much they arise from training on a large volume of language data. Does language training implicitly impart it with the tools for all human symbolic reasoning?
A test anybody with 1024 GPUs for even a few minutes can perform, is to load an untrained GPT-3 size model, train it for a few steps on a few hundred 3D, 4D, and 5D calculations, and then test its inference. It will help show if these skills can be learnt absent a basis in language. It parallels a question in humans—can a human learn math without first learning language?
A success would indicate the existence of general reasoning as an innate attribute of large transformers themselves; failure would not however falsify general reasoning: it would imply that any general reasoning originates in language learning—which could justify why pre-trained models can perform arithmetic but untrained models can’t.
[Note: my use of “trained” and “untrained” refers to pre-training on CommonCrawl.]
GPT-3 made me update my prior for “scaling current techniques will get us to Superintelligence”, from probably not (<30%) to likely (>60%). The phase shifts in many tasks mentioned by dxu, and its ability to perform non-lingustic reasoning at inference, are the reasons for this shift. I tried a number of ways to make gpt-2 perform basic arithmetic but always failed, which was responsible for my earlier prior.
My updated models predict that a model between 1-2 orders of magnitude bigger will almost certainly be able to utilise calculus, trigonometry, and derivations in a human-like way to reach conclusions, given a few examples.
Essentially, I see no evidence against the proposition that language, math, and abstract reasoning are points along the same continuum—and this paper provides strong evidence that these abilities lie on the same continuum, the difference is only one of complexity.
What makes you conclude this?
I’m very interested in this particular issue, so I would love to see the two of you say more about this. For my part, (1) as abergal said elsewhere in these comments, maybe there is some mundane explanation for the mathematical success (and math errors) besides assumption #2.
Assumption #2 does seem like a fairly parsimonious explanation of the data, no? It would explain the math success, the SAT question success, the “wug test” success, etc. (Whereas “it’s just regurgitating the memorized dataset” does not, and “OK so it’s learned some basic math and some basic word-smithing, but still doesn’t have a general reasoning ability” is unparsimonious.)
I think this comes down to priors: How hard do you think it is to develop general reasoning ability?
I think we tend to think of it as the hardest thing ever because that’s what you need to get AGI and AGI is far away. But this is a bias; Nature didn’t sit down and make a tech tree with increasingly useful stuff being increasingly difficult to get, and general reasoning therefore being the hardest to get because it is so useful. I think our prior for the difficulty of general reasoning ability should be similar to our prior for the difficulty of image recognition, or fine motor control, or chess playing, or lord of the rings fanfiction writing. Given that the evidence shows GPT-3 architecture succeeds at two of the four examples above and hasn’t been applied to the other two, well, it shouldn’t be that crazy to think it might succeed at general reasoning too.
Are there bits of evidence against general reasoning ability in GPT-3? Any answers it gives that it would obviously not give if it had a shred of general reasoning ability?
In the post I gestured towards the first test I would want to do here—compare its performance on arithmetic to its performance on various “fake arithmetics.” If #2 is the mechanism for its arithmetic performance, then I’d expect fake arithmetic performance which
is roughy comparable to real arithmetic performance (perhaps a bit worse but not vastly so)
is at least far above random guessing
more closely correlates with the compressibility / complexity of the formal system than with its closeness to real arithmetic
BTW, I want to reiterate that #2 is about non-linguistic general reasoning, the ability to few-shot learn generic formal systems with no relation to English. So the analogies and novel words results seem irrelevant here, although word scramble results may be relevant, as dmtea says.
----
There’s something else I keep wanting to say, because it’s had a large influence on my beliefs, but is hard to phrase in an objective-sounding way . . . I’ve had a lot of experience with GPT-2:
I was playing around with fine-tuning soon after 117M was released, and jumped to each of the three larger versions shortly after its release. I have done fine-tuning with at least 11 different text corpora I prepared myself.
All this energy for GPT-2 hobby work eventually convergent into my tumblr bot, which uses a fine-tuned 1.5B with domain-specific encoding choices and a custom sampling strategy (“middle-p”), and generates 10-20 candidate samples per post which are then scored by a separate BERT model optimizing for user engagement and a sentiment model to constrain tone. It’s made over 5000 posts so far and continues to make 15+ / day.
So, I think have a certain intimate familiarity with GPT-2 -- what it “feels like” across the 4 released sizes and across numerous fine-tuning / sampling / etc strategies on many corpora—that can’t be acquired just by reading papers. And I think this makes me less impressed with arithmetic and other synthetic results than some people.
I regularly see my own GPT-2s do all sorts of cool tricks somewhat similar to these (in fact the biggest surprise here is how far you have to scale to get few-shot arithmetic!), and yet there are also difficult-to-summarize patterns of failure and ignorance which are remarkably resistant to scaling across the 117M-to-1.5B range. (Indeed, the qualitative difference across that range is far smaller than I had expected when only 117M was out.) GPT-2 feels like a very familiar “character” to me by now, and I saw that “character” persist across the staged release without qualitative jumps. I still wait for evidence that convinces me 175B is a new “character” and not my old, goofy friend with another lovely makeover.
Thanks for the thoughtful reply. I definitely acknowledge and appreciate your experience. I agree the test you proposed would be worth doing and would provide some evidence. I think it would have to be designed carefully so that the model knows it is doing fake arithmetic rather than ordinary arithmetic. Maybe the prompt could be something like “Consider the following made-up mathematical operation “@. ” 3@7=8, 4@4 = 3, … [more examples], What does 2@7 equal? Answer: 2@7 equals ” I also think that we shouldn’t expect GPT-3 to be able to do general formal reasoning at a level higher than, say, a fifth grade human. After all, it’s been trained on a similar dataset (english, mostly non-math, but a bit of normal arithmetic math).
Are you saying that GPT-3 has learned linguistic general reasoning but not non-linguistic general reasoning? I’m not sure there’s an important distinction there.
It doesn’t surprise me that you need to scale up a bunch to get this sort of stuff. After all, we are still well below the size of the human brain.
Side question about experience: Surveys seem to show that older AI scientists, who have been working in the field for longer, tend to think AGI is farther in the future—median 100+ years for scientists with 20+ years of experience, if I recall correctly. Do you think that this phenomenon represents a bias on the part of older scientists, younger scientists, both, or neither?
Also note that a significant number of humans would fail the kind of test you described (inducing the behavior of a novel mathematical operation from a relatively small number of examples), which is why similar tests of inductive reasoning ability show up quite often on IQ tests and the like. It’s not the case that failing at that kind of test shows a lack of general reasoning skills, unless we permit that a substantial fraction of humans lack general reasoning skills to at least some extent.
The big jump in performance between the zero shot and few shot setting in arithmetic and other non-linguistic reasoning tasks[esp. 3D- & 3D+] is why I think it is almost certain #2 is true. Few shot inference relies on no further training [unlike fine tuning], so the improvement in ‘pattern recognition’ so to speak is happening entirely at inference. It follows that the underlying model has general reasoning abilities, i.e. the ability to detect and repeat arbitrary patterns of ever increasing complexity, that occur in its input (conditioning) data.
Interestingly, the model fails to completely learn 4D and 5D arithmetic, where its zero-shot scores were really low. However few shot inference does show improvement. I wonder if problems of increasing complexity can also be solved using increasing numbers of examples in few shot (say k=500 for 4D+). Though of course this will run into the roadblock of context size very soon.
If increasing number of few-shot examples allows it to correctly solve ever-harder problems, there is a strong case for scaling the reformer, with a context window of 1 million tokens, to a GPT-3 like size.
It would be fascinating to probe how much of the general reasoning capabilities arise from the size of transformer itself, and how much they arise from training on a large volume of language data. Does language training implicitly impart it with the tools for all human symbolic reasoning?
A test anybody with 1024 GPUs for even a few minutes can perform, is to load an untrained GPT-3 size model, train it for a few steps on a few hundred 3D, 4D, and 5D calculations, and then test its inference. It will help show if these skills can be learnt absent a basis in language. It parallels a question in humans—can a human learn math without first learning language?
A success would indicate the existence of general reasoning as an innate attribute of large transformers themselves; failure would not however falsify general reasoning: it would imply that any general reasoning originates in language learning—which could justify why pre-trained models can perform arithmetic but untrained models can’t.
[Note: my use of “trained” and “untrained” refers to pre-training on CommonCrawl.]
GPT-3 made me update my prior for “scaling current techniques will get us to Superintelligence”, from probably not (<30%) to likely (>60%). The phase shifts in many tasks mentioned by dxu, and its ability to perform non-lingustic reasoning at inference, are the reasons for this shift. I tried a number of ways to make gpt-2 perform basic arithmetic but always failed, which was responsible for my earlier prior.
My updated models predict that a model between 1-2 orders of magnitude bigger will almost certainly be able to utilise calculus, trigonometry, and derivations in a human-like way to reach conclusions, given a few examples.
Essentially, I see no evidence against the proposition that language, math, and abstract reasoning are points along the same continuum—and this paper provides strong evidence that these abilities lie on the same continuum, the difference is only one of complexity.