Mathematical reasoning—a core ability within human intelligence—presents some unique challenges as a domain: we do not come to understand and solve mathematical problems primarily on the back of experience and evidence, but on the basis of inferring, learning, and exploiting laws, axioms, and symbol manipulation rules. In this paper, we present a new challenge for the evaluation (and eventually the design) of neural architectures and similar system, developing a task suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format. The structured nature of the mathematics domain, covering arithmetic, algebra, probability and calculus, enables the construction of training and test splits designed to clearly illuminate the capabilities and failure-modes of different architectures, as well as evaluate their ability to compose and relate knowledge and learned processes. Having described the data generation process and its potential future expansions, we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge.
And this sounds like goal post moving:
unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels
First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on 5-digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations.As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time.
In what sense is being able to do addition or subtraction with different numbers, for example, which is what it means to learn addition or subtraction, not ‘the exact same problem but with different labels’?
FWIW, when I wrote “the exact same problem but with different labels” I meant “the exact same problem but with different arbitrary names for entities”.
For example, I would consider the following two problems to be “the exact same problem but with different labels”:
DeepMind has shown that Transformers trained on natural text descriptions of math problems can solve them at well above random: “Analysing Mathematical Reasoning Abilities of Neural Models”, Saxton et al 2019:
And this sounds like goal post moving:
GPT-3 can do arithmetic with zero arithmetic training: https://arxiv.org/pdf/2005.14165.pdf#page=21
I’m failing to see a goal-post-moving between me writing:
and then later writing (in reply to your comment quoting that sentence):
If I’m missing something I’d be grateful for a further explanation.
In what sense is being able to do addition or subtraction with different numbers, for example, which is what it means to learn addition or subtraction, not ‘the exact same problem but with different labels’?
Thank you for clarifying!
FWIW, when I wrote “the exact same problem but with different labels” I meant “the exact same problem but with different arbitrary names for entities”.
For example, I would consider the following two problems to be “the exact same problem but with different labels”:
“X+1=2 therefore X=”
“Y+1=2 therefore Y=”
But NOT the following two problems:
“1+1=”
“1+2=”