Have you looked at the NLP tasks they evaluated it on?
Yes. Nothing I’ve seen suggests GPT-2 would successfully solve simple formal problems like the one I mentioned in the grandparent (unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels).
I don’t know why you would think that would be such a barrier. You don’t need Transformers at all to do analogical reasoning, and both the CoQA and SQUAD results suggests at least some ‘modest logic-related stuff’ is going on. If you put your exact sample into the public/small GPT-2 model, it’ll even generate syntactically correct list completions and additional lists which are somewhat more sorted than not.
We might be interpreting “modest logic-related stuff” differently—I am thinking about simple formal problems like sorting a short list of integers.
I wouldn’t be surprised if GPT-2 (or its smaller version) are very capable at completing strings like “[1,2,” in a way that is merely syntactically correct. Publicly available texts on the internet probably contain a lot of comma-separated number lists in brackets. The challenge is for the model to have the ability to sort numbers (when trained only to predict the next word in internet texts).
However, after thinking about it more I am now less confident that GPT-2 would fail to complete my above sentence with a correctly sorted list, because for any two small integers like 2 and 3 it is plausible that the training data contains more “2,3” strings than “3,2″ strings.
Consider instead the following problem:
“The median number in the list [9,2,1,6,8] is ”
I’m pretty sure that GPT-2 would fail at least 1⁄5 of the times to complete such a sentence (i.e. if we query it multiple times and each time the sentence contains small random integers).
GPT-2 works by deterministically fetching the probability distribution over the next token, then sampling from it. It is plausible that the probability it assigns to 6 is no larger than 80%, but it’s simple enough to postprocess every probability larger than 50% to 100%. (This isn’t always done because when completing a list prefix of size 4, it would always produce an infinite list, because the probability of another , is more than 50%.)
Mathematical reasoning—a core ability within human intelligence—presents some unique challenges as a domain: we do not come to understand and solve mathematical problems primarily on the back of experience and evidence, but on the basis of inferring, learning, and exploiting laws, axioms, and symbol manipulation rules. In this paper, we present a new challenge for the evaluation (and eventually the design) of neural architectures and similar system, developing a task suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format. The structured nature of the mathematics domain, covering arithmetic, algebra, probability and calculus, enables the construction of training and test splits designed to clearly illuminate the capabilities and failure-modes of different architectures, as well as evaluate their ability to compose and relate knowledge and learned processes. Having described the data generation process and its potential future expansions, we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge.
And this sounds like goal post moving:
unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels
First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on 5-digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations.As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time.
In what sense is being able to do addition or subtraction with different numbers, for example, which is what it means to learn addition or subtraction, not ‘the exact same problem but with different labels’?
FWIW, when I wrote “the exact same problem but with different labels” I meant “the exact same problem but with different arbitrary names for entities”.
For example, I would consider the following two problems to be “the exact same problem but with different labels”:
Yes. Nothing I’ve seen suggests GPT-2 would successfully solve simple formal problems like the one I mentioned in the grandparent (unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels).
I don’t know why you would think that would be such a barrier. You don’t need Transformers at all to do analogical reasoning, and both the CoQA and SQUAD results suggests at least some ‘modest logic-related stuff’ is going on. If you put your exact sample into the public/small GPT-2 model, it’ll even generate syntactically correct list completions and additional lists which are somewhat more sorted than not.
We might be interpreting “modest logic-related stuff” differently—I am thinking about simple formal problems like sorting a short list of integers.
I wouldn’t be surprised if GPT-2 (or its smaller version) are very capable at completing strings like “[1,2,” in a way that is merely syntactically correct. Publicly available texts on the internet probably contain a lot of comma-separated number lists in brackets. The challenge is for the model to have the ability to sort numbers (when trained only to predict the next word in internet texts).
However, after thinking about it more I am now less confident that GPT-2 would fail to complete my above sentence with a correctly sorted list, because for any two small integers like 2 and 3 it is plausible that the training data contains more “2,3” strings than “3,2″ strings.
Consider instead the following problem:
“The median number in the list [9,2,1,6,8] is ”
I’m pretty sure that GPT-2 would fail at least 1⁄5 of the times to complete such a sentence (i.e. if we query it multiple times and each time the sentence contains small random integers).
GPT-2 works by deterministically fetching the probability distribution over the next token, then sampling from it. It is plausible that the probability it assigns to 6 is no larger than 80%, but it’s simple enough to postprocess every probability larger than 50% to 100%. (This isn’t always done because when completing a list prefix of size 4, it would always produce an infinite list, because the probability of another , is more than 50%.)
DeepMind has shown that Transformers trained on natural text descriptions of math problems can solve them at well above random: “Analysing Mathematical Reasoning Abilities of Neural Models”, Saxton et al 2019:
And this sounds like goal post moving:
GPT-3 can do arithmetic with zero arithmetic training: https://arxiv.org/pdf/2005.14165.pdf#page=21
I’m failing to see a goal-post-moving between me writing:
and then later writing (in reply to your comment quoting that sentence):
If I’m missing something I’d be grateful for a further explanation.
In what sense is being able to do addition or subtraction with different numbers, for example, which is what it means to learn addition or subtraction, not ‘the exact same problem but with different labels’?
Thank you for clarifying!
FWIW, when I wrote “the exact same problem but with different labels” I meant “the exact same problem but with different arbitrary names for entities”.
For example, I would consider the following two problems to be “the exact same problem but with different labels”:
“X+1=2 therefore X=”
“Y+1=2 therefore Y=”
But NOT the following two problems:
“1+1=”
“1+2=”