I don’t see how the mechanistic interpretability of grokking analysis is evidence against this.
At the start of training, the modular addition network is quickly evolving to get increasingly better training loss by overfitting on the training data. Every time it gets an answer in the training set right that it didn’t before, it has to have moved from one behavioural manifold in the loss landscape to another. It’s evolved a new tiny piece of circuitry, making it no longer the same algorithm it was a couple of batch updates ago.
Eventually, it reaches the zero loss manifold. This is a mostly fully connected subset of parameter space. I currently like to visualise it like a canyon landscape, though in truth it is much more high dimensional. It is made of many basins, some broad (high dimensional), some narrow (low dimensional), connected by paths, some straight, some winding.
A path through the loss landscape visible in 3D doesn’t correspond to how and what the neural network is actually learning. Almost all of the changes to the loss are due to the increasingly good implementation of Algorithm 1; but apparently, the entire time, the gradient also points towards some faraway implementation of Algorithm 2.
In the broad basin picture, there aren’t just two algorithms here, but many. Every time the neural network constructs a new internal elementary piece of circuitry, that corresponds to moving from one basin in this canyon landscape to another. Between the point where the loss flatlines and the point where grokking happens, the network is moving through dozens of different basins or more. Eventually, it arrives at the largest, most high dimensional basin in the landscape, and there it stays.
the entire time the neural network’s parameters visibly move down the wider basin
I think this might be the source of confusion here. Until grokking finishes, the network isn’t even in that basin yet. You can’t be in multiple basins simultaneously.
At the time the network is learning the pieces of what you refer to as algorithm 2, it is not yet in the basin of algorithm 2. Likewise, if you went into the finished network sitting in the basin of algorithm 2 and added some additional internal piece of circuitry into it by changing the parameters, that would take it out of the basin of algorithm 2 and into a different, narrower one. Because it’s not the same algorithm any more. It’s got a higher effective parameter count now, a bigger Real Log Canonical Threshold.
Points in the same basin correspond to the same algorithm. But it really does have to be the same algorithm. The definition is quite strict here. What you refer to as superpositions of algorithm 1 and algorithm 2 are all various different basins in parameter space. Basins are regions where every point maps to the same algorithm, and all of those superpositions are different algorithms.
I don’t see what part of the graphs would lead to that conclusion. As the paper says, there’s a memorization, circuit formation and cleanup phase. Everywhere along these lines in the three phases, the network is building up or removing pieces of internal circuitry. Every time an elementary piece of circuitry is added or removed, that corresponds to moving into a different basin (convex subset?).
Points in the same basin are related by internal symmetries. They correspond to the same algorithm, not just in the sense of having the same input-output behavior on the training data (all points on the zero loss manifold have that in common), but also in sharing common intermediate representations. If one solution has a piece of circuitry another doesn’t, they can’t be part of the same basin. Because you can’t transform them into each other through internal symmetries.
So the network is moving through different basins all along those graphs.
I don’t see how the mechanistic interpretability of grokking analysis is evidence against this.
At the start of training, the modular addition network is quickly evolving to get increasingly better training loss by overfitting on the training data. Every time it gets an answer in the training set right that it didn’t before, it has to have moved from one behavioural manifold in the loss landscape to another. It’s evolved a new tiny piece of circuitry, making it no longer the same algorithm it was a couple of batch updates ago.
Eventually, it reaches the zero loss manifold. This is a mostly fully connected subset of parameter space. I currently like to visualise it like a canyon landscape, though in truth it is much more high dimensional. It is made of many basins, some broad (high dimensional), some narrow (low dimensional), connected by paths, some straight, some winding.
In the broad basin picture, there aren’t just two algorithms here, but many. Every time the neural network constructs a new internal elementary piece of circuitry, that corresponds to moving from one basin in this canyon landscape to another. Between the point where the loss flatlines and the point where grokking happens, the network is moving through dozens of different basins or more. Eventually, it arrives at the largest, most high dimensional basin in the landscape, and there it stays.
I think this might be the source of confusion here. Until grokking finishes, the network isn’t even in that basin yet. You can’t be in multiple basins simultaneously.
At the time the network is learning the pieces of what you refer to as algorithm 2, it is not yet in the basin of algorithm 2. Likewise, if you went into the finished network sitting in the basin of algorithm 2 and added some additional internal piece of circuitry into it by changing the parameters, that would take it out of the basin of algorithm 2 and into a different, narrower one. Because it’s not the same algorithm any more. It’s got a higher effective parameter count now, a bigger Real Log Canonical Threshold.
Points in the same basin correspond to the same algorithm. But it really does have to be the same algorithm. The definition is quite strict here. What you refer to as superpositions of algorithm 1 and algorithm 2 are all various different basins in parameter space. Basins are regions where every point maps to the same algorithm, and all of those superpositions are different algorithms.
Doesn’t Figure 7, top left from the arXiv paper provide evidence against the “network is moving through dozens of different basins or more” picture?
… No?
I don’t see what part of the graphs would lead to that conclusion. As the paper says, there’s a memorization, circuit formation and cleanup phase. Everywhere along these lines in the three phases, the network is building up or removing pieces of internal circuitry. Every time an elementary piece of circuitry is added or removed, that corresponds to moving into a different basin (convex subset?).
Points in the same basin are related by internal symmetries. They correspond to the same algorithm, not just in the sense of having the same input-output behavior on the training data (all points on the zero loss manifold have that in common), but also in sharing common intermediate representations. If one solution has a piece of circuitry another doesn’t, they can’t be part of the same basin. Because you can’t transform them into each other through internal symmetries.
So the network is moving through different basins all along those graphs.