I’m not very familiar with singularities, forgive some potentially stupid questions.
A singularity here is defined as where the tangent is ill-defined, is this just saying where the lines cross? In other words, that places where loss valleys intersect tend to generalize?
If true, what is a good intuition to have around loss valleys? Is it reasonable to think of loss valleys kind of as their own heuristic functions?
For example, if you have a dataset with height and weight and are trying to predict life expectancy, one heuristic might be that if weight/height > X then predict lower life expectancy. My intuition reading is that all sets of weights that implement this heuristic would correspond to one loss valley.
If we think about some other loss valley, maybe one that captures underweight people where weight/height < Z, then the place where these loss valleys intersect would correspond to a neural network that predicts lower life expectancy for both overweight and underweight people. Intuitively it makes sense that this would correspond to better model generalization, is that on the right track?
But to me it seems like these valleys would be additive, i.e. the place where they intersect should be lower loss than the basin of either valley on its own. This is because our crossing point should create good predictions for both overweight and underweight people, whereas either valley on its own should only create good predictions for one of those two sets. However, in the post the crossing points are depicted as having the same loss as either valley has on its own, is this intentional or do you think there ought to be a dip where valleys meet?
I’m not very familiar with singularities, forgive some potentially stupid questions.
A singularity here is defined as where the tangent is ill-defined, is this just saying where the lines cross? In other words, that places where loss valleys intersect tend to generalize?
If true, what is a good intuition to have around loss valleys? Is it reasonable to think of loss valleys kind of as their own heuristic functions?
For example, if you have a dataset with height and weight and are trying to predict life expectancy, one heuristic might be that if weight/height > X then predict lower life expectancy. My intuition reading is that all sets of weights that implement this heuristic would correspond to one loss valley.
If we think about some other loss valley, maybe one that captures underweight people where weight/height < Z, then the place where these loss valleys intersect would correspond to a neural network that predicts lower life expectancy for both overweight and underweight people. Intuitively it makes sense that this would correspond to better model generalization, is that on the right track?
But to me it seems like these valleys would be additive, i.e. the place where they intersect should be lower loss than the basin of either valley on its own. This is because our crossing point should create good predictions for both overweight and underweight people, whereas either valley on its own should only create good predictions for one of those two sets. However, in the post the crossing points are depicted as having the same loss as either valley has on its own, is this intentional or do you think there ought to be a dip where valleys meet?