Yep, regularization tends to break these symmetries.
I think the best way to think of this is that it causes the valleys to become curved — i.e., regularization helps the neural network navigate the loss landscape. In its absence, moving across these valleys depends on the stochasticity of SGD which grows very slowly with the square root of time.
That said, regularization is only a convex change to the landscape that doesn’t change the important geometrical features. In its presence, we should still expect the singularities of the corresponding regularization-free landscape to have a major macroscopic effect.
There are also continuous zero-loss deformations in the loss landscape that are not affected by regularization because they aren’t a feature of the architecture but of the “truth”. (See the thread with tgb for a discussion of this, where we call these “Type B”.)
Yep, regularization tends to break these symmetries.
I think the best way to think of this is that it causes the valleys to become curved — i.e., regularization helps the neural network navigate the loss landscape. In its absence, moving across these valleys depends on the stochasticity of SGD which grows very slowly with the square root of time.
That said, regularization is only a convex change to the landscape that doesn’t change the important geometrical features. In its presence, we should still expect the singularities of the corresponding regularization-free landscape to have a major macroscopic effect.
There are also continuous zero-loss deformations in the loss landscape that are not affected by regularization because they aren’t a feature of the architecture but of the “truth”. (See the thread with tgb for a discussion of this, where we call these “Type B”.)