I didn’t read super carefully, but it seems like the former paper is saying that, for some definition of “mechanistically similar”:
Mechanistically dissimilar algorithms can be “mode connected”—that is, local minima-ish that are connected by a path of local minima (the paper proves this for their definition of “mechanistically similar”).
If two models aren’t linearly mode connected, then that means that they’re dissimilar (note that this is a conjecture, but I guess they probably find evidence for it).
I don’t think this is in much tension with the post?
My reading of the post says that two algorithms, with different generalization and structural properties, can lie in the same basin, and it uses evidence from our knowledge of the mechanisms behind grokking on synthetic data to make this point. But the above papers show that in more realistic settings empirically, two models lie in the same basin (up to permutation symmetries) if and only if they have similar generalization and structural properties.
the above papers show that in more realistic settings empirically, two models lie in the same basin (up to permutation symmetries) if and only if they have similar generalization and structural properties.
I think they only check if they lie in linearly-connected bits of the same basin if they have similar generalization properties? E.g. Figure 4 of Mechanistic Mode Connectivity is titled “Non-Linear Mode Connectivity of Mechanistically Dissimilar Models” and the subtitle states that “quadratic paths can be easily identified to mode connect mechanistically dissimilar models[, and] linear paths cannot be identified, even after permutation”. Linear Connectivity Reveals Generalization Strategies seems to be focussed on linear mode connectivity, rather than more general mode connectivity.
Mechanistically dissimilar algorithms can be “mode connected”—that is, local minima-ish that are connected by a path of local minima (the paper proves this for their definition of “mechanistically similar”)
Mea culpa: AFAICT, the ‘proof’ in Mechanistic Mode Connectivity fails. It basically goes:
Prior work has shown that under overparametrization, all global loss minimizers are mode connected.
Therefore, mechanistically distinct global loss minimizers are also mode connected.
The problem is that prior work made the assumption that for a net of the right size, there’s only one loss minimizer up to permutation—aka there are no mechanistically distinct loss minimizers.
[EDIT: the proof also cites Nguyen (2019) in support of its arguments. I haven’t checked the proof in Nguyen (2019), but if it holds up, it does substantiate the claim in Mechanistic Mode Connectivity—altho if I’m reading it correctly you need so much overparameterization that the neural net has a layer with as many hidden neurons as there are training data points.]
The second paper is just about linear connectivity, and does seem to suggest that linearly connected models run similar algorithms. But I guess I don’t expect neural net training to go in straight lines? (Altho I suppose momentum helps with this?)
What do you make of the mechanistic mode connectivity, and linear connectivity papers then?
I didn’t read super carefully, but it seems like the former paper is saying that, for some definition of “mechanistically similar”:
Mechanistically dissimilar algorithms can be “mode connected”—that is, local minima-ish that are connected by a path of local minima (the paper proves this for their definition of “mechanistically similar”).
If two models aren’t linearly mode connected, then that means that they’re dissimilar (note that this is a conjecture, but I guess they probably find evidence for it).
I don’t think this is in much tension with the post?
My reading of the post says that two algorithms, with different generalization and structural properties, can lie in the same basin, and it uses evidence from our knowledge of the mechanisms behind grokking on synthetic data to make this point. But the above papers show that in more realistic settings empirically, two models lie in the same basin (up to permutation symmetries) if and only if they have similar generalization and structural properties.
I think they only check if they lie in linearly-connected bits of the same basin if they have similar generalization properties? E.g. Figure 4 of Mechanistic Mode Connectivity is titled “Non-Linear Mode Connectivity of Mechanistically Dissimilar Models” and the subtitle states that “quadratic paths can be easily identified to mode connect mechanistically dissimilar models[, and] linear paths cannot be identified, even after permutation”. Linear Connectivity Reveals Generalization Strategies seems to be focussed on linear mode connectivity, rather than more general mode connectivity.
Mea culpa: AFAICT, the ‘proof’ in Mechanistic Mode Connectivity fails. It basically goes:
Prior work has shown that under overparametrization, all global loss minimizers are mode connected.
Therefore, mechanistically distinct global loss minimizers are also mode connected.
The problem is that prior work made the assumption that for a net of the right size, there’s only one loss minimizer up to permutation—aka there are no mechanistically distinct loss minimizers.
[EDIT: the proof also cites Nguyen (2019) in support of its arguments. I haven’t checked the proof in Nguyen (2019), but if it holds up, it does substantiate the claim in Mechanistic Mode Connectivity—altho if I’m reading it correctly you need so much overparameterization that the neural net has a layer with as many hidden neurons as there are training data points.]
Update: I currently think that Nguyen (2019) proves the claim, but it actually requires a layer to have two hidden neurons per training example.
The second paper is just about linear connectivity, and does seem to suggest that linearly connected models run similar algorithms. But I guess I don’t expect neural net training to go in straight lines? (Altho I suppose momentum helps with this?)