This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.
I’m most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).
But, what does this failure mean? I’m not sure I understand the authors’ conclusions: they state (3.2.3) this “suggests that models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning.” I don’t see any evidence of that in the paper!
In particular, 3.2.3 and figure 7′s categorization of errors, as well as the theoretical results they discuss in section 4, gives me the opposite impression. Basically they say that if you make a local error, it’ll propagate and screw you up. You can see, e.g., in figure 7′s five-shot GPT-4 example, how a single local error at graph layer 1 causes propagation error to start growing immediately. Later more local errors kick in, but to me this is sort of understandable: once the calculation starts going off the rails, the model might not be in a good place to do even local reasoning.
I don’t see what any of this has to do with planning and composing! In particular I don’t see any measurement of something like “set up a totally wrong plan for multiplying numbers” or “fail to compose all the individual digit computation-steps into the final answer-concatenation step”. Such errors might exist, but the paper doesn’t give examples or any measurements of them. Its categorization of error types seems to assume that the model always produces a computation graph, which to me is pretty strong evidence of planning and composing abilities!
Stated another way: I suspect that if you eliminated all the local errors, accuracy would be good! So the question is: why is GPT-4 failing to multiply single-digit numbers sometimes, in the middle of these steps?
(It’s possible the answer lies in tokenization difficulties, but it seems unlikely.)
OK, now let’s look at it from another angle: how different is this from humans? What’s impressive to me about this result is that it is quite different. I was expecting to say something like, “oh, not every human will be able to get 100% accuracy on following the multiplication algorithm for 5-digit-by-5-digit numbers; it’s OK to expect some mistakes”. But, GPT-4 fails to multiply 5x5 digit numbers every time!! Even with a scratchpad! Most educated humans would get better than zero accuracy.
So my overall takeaway is that local errors are still too prevalent in these sorts of tasks. Humans don’t always >=1 mistake on { one-digit multiplication, sum, mod 10, carry over, concatenation } when performing 5-digit by 5-digit multiplication, whereas GPT-4 supposedly does.
Am I understanding this correctly? Well, it’d be nice to reproduce their results to confirm. If they’re to be believed, I should be able to ask GPT-4 to do one of these multiplication tasks with a scratchpad, and always find an error in the middle. But when trying to reproduce their results, I ran into an issue of under-documented methodology (how did they compose the prompts?) and non-published data (what inaccurate things did the models actually say?). Filed on GitHub; we’ll see if they get back to me.
Regarding grokking, they attempt to test whether GPT-3 finetuned on these sorts of problems will exhibit grokking. However, I’m skeptical of this attempt: they trained for 60 epochs for zero-shot and 40 epochs with scratchpads. Whereas the original grokking paper used between 3,571 epochs and 50,000 epochs.
(I think epochs is probably a relevant measure here, instead of training steps. This paper does 420K and 30K steps whereas the original grokking paper does 100K steps, so if we were comparing steps it seems reasonable. But “number of times you saw the whole data set” seems more relevant for grokking, in my uninformed opinion!)
Has anyone actually seen LLMs (not just transformers) exhibit grokking? A quick search says no.
This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.
I’m most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).
But, what does this failure mean? I’m not sure I understand the authors’ conclusions: they state (3.2.3) this “suggests that models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning.” I don’t see any evidence of that in the paper!
In particular, 3.2.3 and figure 7′s categorization of errors, as well as the theoretical results they discuss in section 4, gives me the opposite impression. Basically they say that if you make a local error, it’ll propagate and screw you up. You can see, e.g., in figure 7′s five-shot GPT-4 example, how a single local error at graph layer 1 causes propagation error to start growing immediately. Later more local errors kick in, but to me this is sort of understandable: once the calculation starts going off the rails, the model might not be in a good place to do even local reasoning.
I don’t see what any of this has to do with planning and composing! In particular I don’t see any measurement of something like “set up a totally wrong plan for multiplying numbers” or “fail to compose all the individual digit computation-steps into the final answer-concatenation step”. Such errors might exist, but the paper doesn’t give examples or any measurements of them. Its categorization of error types seems to assume that the model always produces a computation graph, which to me is pretty strong evidence of planning and composing abilities!
Stated another way: I suspect that if you eliminated all the local errors, accuracy would be good! So the question is: why is GPT-4 failing to multiply single-digit numbers sometimes, in the middle of these steps?
(It’s possible the answer lies in tokenization difficulties, but it seems unlikely.)
OK, now let’s look at it from another angle: how different is this from humans? What’s impressive to me about this result is that it is quite different. I was expecting to say something like, “oh, not every human will be able to get 100% accuracy on following the multiplication algorithm for 5-digit-by-5-digit numbers; it’s OK to expect some mistakes”. But, GPT-4 fails to multiply 5x5 digit numbers every time!! Even with a scratchpad! Most educated humans would get better than zero accuracy.
So my overall takeaway is that local errors are still too prevalent in these sorts of tasks. Humans don’t always >=1 mistake on { one-digit multiplication, sum, mod 10, carry over, concatenation } when performing 5-digit by 5-digit multiplication, whereas GPT-4 supposedly does.
Am I understanding this correctly? Well, it’d be nice to reproduce their results to confirm. If they’re to be believed, I should be able to ask GPT-4 to do one of these multiplication tasks with a scratchpad, and always find an error in the middle. But when trying to reproduce their results, I ran into an issue of under-documented methodology (how did they compose the prompts?) and non-published data (what inaccurate things did the models actually say?). Filed on GitHub; we’ll see if they get back to me.
Regarding grokking, they attempt to test whether GPT-3 finetuned on these sorts of problems will exhibit grokking. However, I’m skeptical of this attempt: they trained for 60 epochs for zero-shot and 40 epochs with scratchpads. Whereas the original grokking paper used between 3,571 epochs and 50,000 epochs.
(I think epochs is probably a relevant measure here, instead of training steps. This paper does 420K and 30K steps whereas the original grokking paper does 100K steps, so if we were comparing steps it seems reasonable. But “number of times you saw the whole data set” seems more relevant for grokking, in my uninformed opinion!)
Has anyone actually seen LLMs (not just transformers) exhibit grokking? A quick search says no.