The big jump in performance between the zero shot and few shot setting in arithmetic and other non-linguistic reasoning tasks[esp. 3D- & 3D+] is why I think it is almost certain #2 is true. Few shot inference relies on no further training [unlike fine tuning], so the improvement in ‘pattern recognition’ so to speak is happening entirely at inference. It follows that the underlying model has general reasoning abilities, i.e. the ability to detect and repeat arbitrary patterns of ever increasing complexity, that occur in its input (conditioning) data.
Interestingly, the model fails to completely learn 4D and 5D arithmetic, where its zero-shot scores were really low. However few shot inference does show improvement. I wonder if problems of increasing complexity can also be solved using increasing numbers of examples in few shot (say k=500 for 4D+). Though of course this will run into the roadblock of context size very soon.
If increasing number of few-shot examples allows it to correctly solve ever-harder problems, there is a strong case for scaling the reformer, with a context window of 1 million tokens, to a GPT-3 like size.
It would be fascinating to probe how much of the general reasoning capabilities arise from the size of transformer itself, and how much they arise from training on a large volume of language data. Does language training implicitly impart it with the tools for all human symbolic reasoning?
A test anybody with 1024 GPUs for even a few minutes can perform, is to load an untrained GPT-3 size model, train it for a few steps on a few hundred 3D, 4D, and 5D calculations, and then test its inference. It will help show if these skills can be learnt absent a basis in language. It parallels a question in humans—can a human learn math without first learning language?
A success would indicate the existence of general reasoning as an innate attribute of large transformers themselves; failure would not however falsify general reasoning: it would imply that any general reasoning originates in language learning—which could justify why pre-trained models can perform arithmetic but untrained models can’t.
[Note: my use of “trained” and “untrained” refers to pre-training on CommonCrawl.]
GPT-3 made me update my prior for “scaling current techniques will get us to Superintelligence”, from probably not (<30%) to likely (>60%). The phase shifts in many tasks mentioned by dxu, and its ability to perform non-lingustic reasoning at inference, are the reasons for this shift. I tried a number of ways to make gpt-2 perform basic arithmetic but always failed, which was responsible for my earlier prior.
My updated models predict that a model between 1-2 orders of magnitude bigger will almost certainly be able to utilise calculus, trigonometry, and derivations in a human-like way to reach conclusions, given a few examples.
Essentially, I see no evidence against the proposition that language, math, and abstract reasoning are points along the same continuum—and this paper provides strong evidence that these abilities lie on the same continuum, the difference is only one of complexity.
The big jump in performance between the zero shot and few shot setting in arithmetic and other non-linguistic reasoning tasks[esp. 3D- & 3D+] is why I think it is almost certain #2 is true. Few shot inference relies on no further training [unlike fine tuning], so the improvement in ‘pattern recognition’ so to speak is happening entirely at inference. It follows that the underlying model has general reasoning abilities, i.e. the ability to detect and repeat arbitrary patterns of ever increasing complexity, that occur in its input (conditioning) data.
Interestingly, the model fails to completely learn 4D and 5D arithmetic, where its zero-shot scores were really low. However few shot inference does show improvement. I wonder if problems of increasing complexity can also be solved using increasing numbers of examples in few shot (say k=500 for 4D+). Though of course this will run into the roadblock of context size very soon.
If increasing number of few-shot examples allows it to correctly solve ever-harder problems, there is a strong case for scaling the reformer, with a context window of 1 million tokens, to a GPT-3 like size.
It would be fascinating to probe how much of the general reasoning capabilities arise from the size of transformer itself, and how much they arise from training on a large volume of language data. Does language training implicitly impart it with the tools for all human symbolic reasoning?
A test anybody with 1024 GPUs for even a few minutes can perform, is to load an untrained GPT-3 size model, train it for a few steps on a few hundred 3D, 4D, and 5D calculations, and then test its inference. It will help show if these skills can be learnt absent a basis in language. It parallels a question in humans—can a human learn math without first learning language?
A success would indicate the existence of general reasoning as an innate attribute of large transformers themselves; failure would not however falsify general reasoning: it would imply that any general reasoning originates in language learning—which could justify why pre-trained models can perform arithmetic but untrained models can’t.
[Note: my use of “trained” and “untrained” refers to pre-training on CommonCrawl.]
GPT-3 made me update my prior for “scaling current techniques will get us to Superintelligence”, from probably not (<30%) to likely (>60%). The phase shifts in many tasks mentioned by dxu, and its ability to perform non-lingustic reasoning at inference, are the reasons for this shift. I tried a number of ways to make gpt-2 perform basic arithmetic but always failed, which was responsible for my earlier prior.
My updated models predict that a model between 1-2 orders of magnitude bigger will almost certainly be able to utilise calculus, trigonometry, and derivations in a human-like way to reach conclusions, given a few examples.
Essentially, I see no evidence against the proposition that language, math, and abstract reasoning are points along the same continuum—and this paper provides strong evidence that these abilities lie on the same continuum, the difference is only one of complexity.