These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I’ve not spent much time on this post and it’s certainly plausible that I’ve gotten things wrong.
This algorithm operates via using trig identities and Discrete Fourier Transforms to map x,y→cos(w(x+y)),sin(w(x+y)), and then extracting x+y(modp)
And
The model is trained to map x,y to z≡x+y(mod113) (henceforth 113 is referred to as p)
But the casual reader should use caution! It is in fact the case that “Inputs x,y are given as one-hot encoded vectors in Rp ”. This point is of course emphasized more in the full notebook (it has to be, that’s where the code is), and the arXiv paper that followed is also much clearer about this point. However, when giving brief takeaways from the work, especially when it comes to discussing how ‘natural’ the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given x and y as inputs. It is nottrained to ‘act’ on the numbersx,ythemselves.
When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting is already a significant transformation. You have moved away from having the number x be represented by its own magnitude. You instead have a situation in which x and y now really live ‘in the domain’ (its almost like a dual point of view: The number x is not the size of the signal, but the positionat which the input signal is non-zero).
So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is ‘seeing’ as input is something like the indicator functionst↦1x(t) and t↦1y(t). Here, t is something like the ‘token variable’ in the sense that these are functions on the vocabulary. And if we essentially ignore the additional tokens for | and =, we can think that these are functions on the group Z/pZ and that we would like the network to learn to produce the function t↦1x+y(t) at its output neurons.
In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
(1x∗1y)(t)=∑s1x(t−s)1y(s)=1x+y(t).
And (as I distinctly remember being made to practically chant in an ‘Analysis of Boolean Functions’ class given by Tom Sanders) the Fourier Transform is the (essentially unique) change of basis that simultaneously diagonalizes all convolution operations. This is coming close to saying something like: There is one special basis that makes the operation you want to learn uniquely easy to do using matrix multiplications, and that basis is the Fourier basis.
Short Remark on the (subjective) mathematical ‘naturalness’ of the Nanda—Lieberum addition modulo 113 algorithm
These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I’ve not spent much time on this post and it’s certainly plausible that I’ve gotten things wrong.
In the ‘Key Takeaways’ section of the Modular Addition part of the well-known post ‘A Mechanistic Interpretability Analysis of Grokking’ , Nanda and Lieberum write:
And
But the casual reader should use caution! It is in fact the case that “Inputs x,y are given as one-hot encoded vectors in Rp ”. This point is of course emphasized more in the full notebook (it has to be, that’s where the code is), and the arXiv paper that followed is also much clearer about this point. However, when giving brief takeaways from the work, especially when it comes to discussing how ‘natural’ the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given x and y as inputs. It is not trained to ‘act’ on the numbers x, y themselves.
(1x∗1y)(t)=∑s1x(t−s)1y(s)=1x+y(t).When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting is already a significant transformation. You have moved away from having the number x be represented by its own magnitude. You instead have a situation in which x and y now really live ‘in the domain’ (its almost like a dual point of view: The number x is not the size of the signal, but the position at which the input signal is non-zero).
So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is ‘seeing’ as input is something like the indicator functions t↦1x(t) and t↦1y(t). Here, t is something like the ‘token variable’ in the sense that these are functions on the vocabulary. And if we essentially ignore the additional tokens for | and =, we can think that these are functions on the group Z/pZ and that we would like the network to learn to produce the function t↦1x+y(t) at its output neurons.
In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
And (as I distinctly remember being made to practically chant in an ‘Analysis of Boolean Functions’ class given by Tom Sanders) the Fourier Transform is the (essentially unique) change of basis that simultaneously diagonalizes all convolution operations. This is coming close to saying something like: There is one special basis that makes the operation you want to learn uniquely easy to do using matrix multiplications, and that basis is the Fourier basis.