TL;DR: I agree that the answer to the question above definitely isn’t always yes, because of your counterexample, but I think that moving forward on a similar research direction might be useful anyways.
One can imagine partitioning the parameter space into sets that arrive at basins where each model in the basin has the same, locally optimal performance, this is like a Rashomon set (relaxing the requirement from global minima so that we get a partition of the space). The models which can compress the training data (and thus have free parameters) are generally more likely to be found by random selection and search, because the free parameters means that the dimensionality of this set is higher, and hence exponentially more likely.
Thus, we can move within these high-dimensional regions of locally optimal loss, which could allow us to find more interpretable (or maybe more desirable along another axis), which is the stated motivation in the article:
Ultimately, we hope that the study of equivalently optimal models would lead to advances in interpretability: for example, by producing models that are simultaneously optimal and interpretable.
This seems super relevant to alignment! The default path to AGI right now to me seems like something like a LLM world model hooked up to some RL to make it more agenty, and I expect this kind of theory applied to LLMs, because of the large number of parameters. I’m hoping that this theory gets us better predictions on which Rashomon sets are found (this would look like a selection theorem), and the ability to move within a Rashomon set towards parameters that are better. Such a selection theorem seems likely because of the dimensionality argument above.
Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!
It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on “non-artificial” datasets (“datasets from nature”?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn’t happen.
One heuristic argument for why two disconnected global minimizers might only happen in “artificial” datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model’s loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is “stretched out” to a whole submanifold. This “stretching out” is the symmetry; all optimal models on this submanifold are secretly the same.
One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, “modding out by the symmetry” is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I’m guessing these types of situations are more common in “artificial” datasets which have not modded out all the obvious symmetries yet.
TL;DR: I agree that the answer to the question above definitely isn’t always yes, because of your counterexample, but I think that moving forward on a similar research direction might be useful anyways.
One can imagine partitioning the parameter space into sets that arrive at basins where each model in the basin has the same, locally optimal performance, this is like a Rashomon set (relaxing the requirement from global minima so that we get a partition of the space). The models which can compress the training data (and thus have free parameters) are generally more likely to be found by random selection and search, because the free parameters means that the dimensionality of this set is higher, and hence exponentially more likely.
Thus, we can move within these high-dimensional regions of locally optimal loss, which could allow us to find more interpretable (or maybe more desirable along another axis), which is the stated motivation in the article:
This seems super relevant to alignment! The default path to AGI right now to me seems like something like a LLM world model hooked up to some RL to make it more agenty, and I expect this kind of theory applied to LLMs, because of the large number of parameters. I’m hoping that this theory gets us better predictions on which Rashomon sets are found (this would look like a selection theorem), and the ability to move within a Rashomon set towards parameters that are better. Such a selection theorem seems likely because of the dimensionality argument above.
Edit: Adding a link to “Git Re-Basin: Merging Models modulo Permutation Symmetries,” a relevant paper that has recently been posted on arXiv.
Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!
It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on “non-artificial” datasets (“datasets from nature”?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn’t happen.
One heuristic argument for why two disconnected global minimizers might only happen in “artificial” datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model’s loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is “stretched out” to a whole submanifold. This “stretching out” is the symmetry; all optimal models on this submanifold are secretly the same.
One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, “modding out by the symmetry” is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I’m guessing these types of situations are more common in “artificial” datasets which have not modded out all the obvious symmetries yet.