In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
I disagree at least with the “almost completely” version of this claim:
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
(1x∗1y)(t)=∑s1x(t−s)1y(s)=1x+y(t).
This also applies to the non-modular addition operation, but I think it’s pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the “tokenized” representation back into the “magnitude” representation, and then simply adds them as normal.
Some evidence for this:
I’ve heard it claimed that an LLM represented the number of characters in a token as a magnitude, which was used for deciding whether a line of code was > 80 characters (useful for predicting line breaks).
This paper trains on non-modular addition and gets this result. (Note however that the paper has a highly unusual setting that isn’t representative of typical network training, and arguably the setup is such that you have to get this result: in particular the architecture enforces that the embeddings of the two inputs are added together, which wouldn’t work in a Fourier basis. I cite it as evidence that networks do learn magnitude representations when forced to do so.)
It seems like if you believe “when the operation you want to learn is a convolution operation, then you will learn the Fourier basis”, you should also believe that you’ll get a Fourier basis for non-modular addition on one-hot-encoded numbers, and currently my guess is that that’s false.
Fwiw, I agree that the algorithm is quite “mathematically natural” (indeed, one person came up with the algorithm independently, prompted by “how would you solve this problem if you were a Transformer”), though I feel like the “modular” part is pretty crucial for me (and the story I’d tell would be the one in Daniel’s comment).
Thanks for the comment Rohin, that’s interesting (though I haven’t looked at the paper you linked).
I’ll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers x and y and don’t do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I’m doing ‘addition modulo 113’, but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.
I agree—the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn’t super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.
I disagree at least with the “almost completely” version of this claim:
This also applies to the non-modular addition operation, but I think it’s pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the “tokenized” representation back into the “magnitude” representation, and then simply adds them as normal.
Some evidence for this:
I’ve heard it claimed that an LLM represented the number of characters in a token as a magnitude, which was used for deciding whether a line of code was > 80 characters (useful for predicting line breaks).
This paper trains on non-modular addition and gets this result. (Note however that the paper has a highly unusual setting that isn’t representative of typical network training, and arguably the setup is such that you have to get this result: in particular the architecture enforces that the embeddings of the two inputs are added together, which wouldn’t work in a Fourier basis. I cite it as evidence that networks do learn magnitude representations when forced to do so.)
It seems like if you believe “when the operation you want to learn is a convolution operation, then you will learn the Fourier basis”, you should also believe that you’ll get a Fourier basis for non-modular addition on one-hot-encoded numbers, and currently my guess is that that’s false.
Fwiw, I agree that the algorithm is quite “mathematically natural” (indeed, one person came up with the algorithm independently, prompted by “how would you solve this problem if you were a Transformer”), though I feel like the “modular” part is pretty crucial for me (and the story I’d tell would be the one in Daniel’s comment).
Thanks for the comment Rohin, that’s interesting (though I haven’t looked at the paper you linked).
I’ll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers x and y and don’t do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I’m doing ‘addition modulo 113’, but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.
I agree—the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn’t super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.