It seems like it’s possible to design models that always have this property. For any model M, consider two copies of it, M’ and M″. We can construct a third model N that is composed of M’ and M″ plus a single extra weight p (with 0 ⇐ p ⇐ 1). Define N to have output by running the input on M’ and M″ and choose the result of M’ with probability p and otherwise the result of M″. Any N achieves minimal loss if and only if it’s one of the following cases:
M’ and M″ are both individually optimal and 0 < p < 1,
M’ is optimal and p = 1 (M″ arbitrary), or
M″ is optimal and p = 0 (M’ arbitrary).
Then to get a path between any optimal weights, we just move to p = 0, modify M″ as desired, then move to p=1 and modify M’ as desired, then set p as desired. I think there’s a few more details than this; it probably depends upon some properties of the loss function, for example, that it doesn’t depend upon the weights of M only the output.
Unfortunately this doesn’t help us at all in this case! I don’t think it’s any more interpretable than M alone. So maybe this example isn’t useful.
Also, don’t submersions never have (local or global) minima? Their derivative is surjective so can never be zero. Pretty nice-looking loss functions end up not having manifolds for their minimizing sets, like x^2y^2. I have a hard time reasoning about whether this is typical or atypical though. I don’t even have an intuition for why the global minimizer isn’t (nearly always) just a point. Any explanation for that observation in practice?
Also, don’t submersions never have (local or global) minima?
I agree, and believe that the post should not have mentioned submersions.
Pretty nice-looking loss functions end up not having manifolds for their minimizing sets, like x^2 y^2. I have a hard time reasoning about whether this is typical or atypical though. I don’t even have an intuition for why the global minimizer isn’t (nearly always) just a point. Any explanation for that observation in practice?
I agree that the typical function has only one (or zero) global minimizer. But in the case of overparametrized smooth Neural Networks it can be typical that zero loss can be achieved. Then the set of weights that lead to zero loss is typically a manifold, and not a single point.
Some intuition: Consider linear regression with more parameters than data. Then the typical outcome is a perfect match of the data, and the possible solution space is a (possibly high-dimensional) linear manifold. We should expect the same to be true for smooth nonlinear neural networks, because locally they are linear.
Note that the above does not hold when we e.g. add a l2-Penalty for the weights: Then I expect there to be a single global minimizer in the typical case.
Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team’s collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.
I think we will be revising this post in the near future reflecting this and other errors.
(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn’t necessarily tell you about higher-order penalties to loss, which is something I forgot to mention. A direction that looks like zero-loss when looking at the Hessian may not actually be not actually be zero-loss if it applies, say, a fourth-order penalty to the loss. This could only be probed by a matrix of fourth derivatives. But I think a heuristic argument suggests that a zero-eigenvalue direction of the Hessian should almost always be an actual zero-loss direction. Let me know if you buy this!)
This is a nice write-up, thanks for sharing it.
It seems like it’s possible to design models that always have this property. For any model M, consider two copies of it, M’ and M″. We can construct a third model N that is composed of M’ and M″ plus a single extra weight p (with 0 ⇐ p ⇐ 1). Define N to have output by running the input on M’ and M″ and choose the result of M’ with probability p and otherwise the result of M″. Any N achieves minimal loss if and only if it’s one of the following cases:
M’ and M″ are both individually optimal and 0 < p < 1,
M’ is optimal and p = 1 (M″ arbitrary), or
M″ is optimal and p = 0 (M’ arbitrary).
Then to get a path between any optimal weights, we just move to p = 0, modify M″ as desired, then move to p=1 and modify M’ as desired, then set p as desired. I think there’s a few more details than this; it probably depends upon some properties of the loss function, for example, that it doesn’t depend upon the weights of M only the output.
Unfortunately this doesn’t help us at all in this case! I don’t think it’s any more interpretable than M alone. So maybe this example isn’t useful.
Also, don’t submersions never have (local or global) minima? Their derivative is surjective so can never be zero. Pretty nice-looking loss functions end up not having manifolds for their minimizing sets, like x^2y^2. I have a hard time reasoning about whether this is typical or atypical though. I don’t even have an intuition for why the global minimizer isn’t (nearly always) just a point. Any explanation for that observation in practice?
I agree, and believe that the post should not have mentioned submersions.
I agree that the typical function has only one (or zero) global minimizer. But in the case of overparametrized smooth Neural Networks it can be typical that zero loss can be achieved. Then the set of weights that lead to zero loss is typically a manifold, and not a single point. Some intuition: Consider linear regression with more parameters than data. Then the typical outcome is a perfect match of the data, and the possible solution space is a (possibly high-dimensional) linear manifold. We should expect the same to be true for smooth nonlinear neural networks, because locally they are linear.
Note that the above does not hold when we e.g. add a l2-Penalty for the weights: Then I expect there to be a single global minimizer in the typical case.
Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team’s collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.
I think we will be revising this post in the near future reflecting this and other errors.
(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn’t necessarily tell you about higher-order penalties to loss, which is something I forgot to mention. A direction that looks like zero-loss when looking at the Hessian may not actually be not actually be zero-loss if it applies, say, a fourth-order penalty to the loss. This could only be probed by a matrix of fourth derivatives. But I think a heuristic argument suggests that a zero-eigenvalue direction of the Hessian should almost always be an actual zero-loss direction. Let me know if you buy this!)