I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation).
(I don’t know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)
Though I agree that there’s a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.
They can certainly use answer text as a scratchpad (even nonfunctional text that gives more space for hidden activations to flow). But they don’t without explicit training. Actually maybe they do- maybe RLHF incentivizes a verbose style to give more room for thought. But I think even “thinking step by step,” there are still plenty of issues.
Tokenization is definitely a contributor. But that doesn’t really support the notion that there’s an underlying human-like cognitive algorithm behind human-like text output. The point is the way it adds numbers is very inhuman, despite producing human-like output on the most common/easy cases.
I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation). (I don’t know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)
Though I agree that there’s a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.
They can certainly use answer text as a scratchpad (even nonfunctional text that gives more space for hidden activations to flow). But they don’t without explicit training. Actually maybe they do- maybe RLHF incentivizes a verbose style to give more room for thought. But I think even “thinking step by step,” there are still plenty of issues.
Tokenization is definitely a contributor. But that doesn’t really support the notion that there’s an underlying human-like cognitive algorithm behind human-like text output. The point is the way it adds numbers is very inhuman, despite producing human-like output on the most common/easy cases.
I definitely agree that it doesn’t give reason to support a human-like algorithm, I was focusing in on the part about adding numbers reliably.