A singularity here is defined as where the tangent is ill-defined, is this just saying where the lines cross? In other words, that places where loss valleys intersect tend to generalize?
Yep, crossings are singularities, as are things like cusps and weirder things like tacnodes
It’s not necessarily saying that these places tend to generalize. It’s that these singularities have a disproportionate impact on the overall tendency of models learning in that landscape to generalize. So these points can impact nearby (and even distant) points.
If true, what is a good intuition to have around loss valleys? Is it reasonable to think of loss valleys kind of as their own heuristic functions?
I still find the intuition difficult
For example, if you have a dataset with height and weight and are trying to predict life expectancy, one heuristic might be that if weight/height > X then predict lower life expectancy. My intuition reading is that all sets of weights that implement this heuristic would correspond to one loss valley.
If we think about some other loss valley, maybe one that captures underweight people where weight/height < Z, then the place where these loss valleys intersect would correspond to a neural network that predicts lower life expectancy for both overweight and underweight people. Intuitively it makes sense that this would correspond to better model generalization, is that on the right track?
But to me it seems like these valleys would be additive, i.e. the place where they intersect should be lower loss than the basin of either valley on its own. This is because our crossing point should create good predictions for both overweight and underweight people, whereas either valley on its own should only create good predictions for one of those two sets. However, in the post the crossing points are depicted as having the same loss as either valley has on its own, is this intentional or do you think there ought to be a dip where valleys meet?
I like this example! If your model is lifespan(h,w)=f(wh) then the w-h space is split into lines of constant lifespan (top-left figure). If you have a loss which compares predicted lifespan to true lifespan, this will be constant on those lines as well. The lower overweight and underweight lifespans will be two valleys that intersect at the origin. The loss landscape could, however, be very different because it’s measuring how good your prediction is, so there could be one loss valley, or two, or several.
Suppose you have a different function g(w+h) with also with two valleys (top-right). Yes, if you add the two functions, the minima of the result will be at the intersections. But adding isn’t actually representative of the kinds of operations we perform in networks.
For example, compare taking their min, now they cross and form part of the same level sets. It depends very much on the kind of composition. The symmetries I mention can cooperate very well.
From top-left clockwise: f(w/h); g(w+h); f(w/h)+g(w+h); min(f(w/h),g(w+h)).
Not at all stupid!
Yep, crossings are singularities, as are things like cusps and weirder things like tacnodes
It’s not necessarily saying that these places tend to generalize. It’s that these singularities have a disproportionate impact on the overall tendency of models learning in that landscape to generalize. So these points can impact nearby (and even distant) points.
I still find the intuition difficult
I like this example! If your model is lifespan(h,w)=f(wh) then the w-h space is split into lines of constant lifespan (top-left figure). If you have a loss which compares predicted lifespan to true lifespan, this will be constant on those lines as well. The lower overweight and underweight lifespans will be two valleys that intersect at the origin. The loss landscape could, however, be very different because it’s measuring how good your prediction is, so there could be one loss valley, or two, or several.
Suppose you have a different function g(w+h) with also with two valleys (top-right). Yes, if you add the two functions, the minima of the result will be at the intersections. But adding isn’t actually representative of the kinds of operations we perform in networks.
For example, compare taking their min, now they cross and form part of the same level sets. It depends very much on the kind of composition. The symmetries I mention can cooperate very well.
From top-left clockwise: f(w/h); g(w+h); f(w/h)+g(w+h); min(f(w/h),g(w+h)).