Also, don’t submersions never have (local or global) minima?
I agree, and believe that the post should not have mentioned submersions.
Pretty nice-looking loss functions end up not having manifolds for their minimizing sets, like x^2 y^2. I have a hard time reasoning about whether this is typical or atypical though. I don’t even have an intuition for why the global minimizer isn’t (nearly always) just a point. Any explanation for that observation in practice?
I agree that the typical function has only one (or zero) global minimizer. But in the case of overparametrized smooth Neural Networks it can be typical that zero loss can be achieved. Then the set of weights that lead to zero loss is typically a manifold, and not a single point.
Some intuition: Consider linear regression with more parameters than data. Then the typical outcome is a perfect match of the data, and the possible solution space is a (possibly high-dimensional) linear manifold. We should expect the same to be true for smooth nonlinear neural networks, because locally they are linear.
Note that the above does not hold when we e.g. add a l2-Penalty for the weights: Then I expect there to be a single global minimizer in the typical case.
Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team’s collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.
I think we will be revising this post in the near future reflecting this and other errors.
(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn’t necessarily tell you about higher-order penalties to loss, which is something I forgot to mention. A direction that looks like zero-loss when looking at the Hessian may not actually be not actually be zero-loss if it applies, say, a fourth-order penalty to the loss. This could only be probed by a matrix of fourth derivatives. But I think a heuristic argument suggests that a zero-eigenvalue direction of the Hessian should almost always be an actual zero-loss direction. Let me know if you buy this!)
I agree, and believe that the post should not have mentioned submersions.
I agree that the typical function has only one (or zero) global minimizer. But in the case of overparametrized smooth Neural Networks it can be typical that zero loss can be achieved. Then the set of weights that lead to zero loss is typically a manifold, and not a single point. Some intuition: Consider linear regression with more parameters than data. Then the typical outcome is a perfect match of the data, and the possible solution space is a (possibly high-dimensional) linear manifold. We should expect the same to be true for smooth nonlinear neural networks, because locally they are linear.
Note that the above does not hold when we e.g. add a l2-Penalty for the weights: Then I expect there to be a single global minimizer in the typical case.
Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team’s collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.
I think we will be revising this post in the near future reflecting this and other errors.
(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn’t necessarily tell you about higher-order penalties to loss, which is something I forgot to mention. A direction that looks like zero-loss when looking at the Hessian may not actually be not actually be zero-loss if it applies, say, a fourth-order penalty to the loss. This could only be probed by a matrix of fourth derivatives. But I think a heuristic argument suggests that a zero-eigenvalue direction of the Hessian should almost always be an actual zero-loss direction. Let me know if you buy this!)