Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!
It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on “non-artificial” datasets (“datasets from nature”?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn’t happen.
One heuristic argument for why two disconnected global minimizers might only happen in “artificial” datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model’s loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is “stretched out” to a whole submanifold. This “stretching out” is the symmetry; all optimal models on this submanifold are secretly the same.
One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, “modding out by the symmetry” is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I’m guessing these types of situations are more common in “artificial” datasets which have not modded out all the obvious symmetries yet.
Edit: Adding a link to “Git Re-Basin: Merging Models modulo Permutation Symmetries,” a relevant paper that has recently been posted on arXiv.
Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!
It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on “non-artificial” datasets (“datasets from nature”?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn’t happen.
One heuristic argument for why two disconnected global minimizers might only happen in “artificial” datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model’s loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is “stretched out” to a whole submanifold. This “stretching out” is the symmetry; all optimal models on this submanifold are secretly the same.
One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, “modding out by the symmetry” is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I’m guessing these types of situations are more common in “artificial” datasets which have not modded out all the obvious symmetries yet.