This is a toy example (I didn’t come up with it for any particular f in mind.
I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
If you’re interested in interpreting the energy landscape as a loss landscape, x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.
This is a toy example (I didn’t come up with it for any particular f in mind.
I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to “trap” random motion. And it seems like both somehow help make the loss landscape more navigable.
If you’re interested in interpreting the energy landscape as a loss landscape, x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.