There’s a summary here, but to have a somewhat more accessible version:
If you have a perceptron (aka linear neural net) with parameter vector θ of length N, predicting a single number as the output, then the possible parameterizations of the neural network are given by RN.
(R represents the reals, apparently the Latex on this site doesn’t support the normal symbol.)
Each (x,y) data point enforces the constraint θ⋅x=y. This is a single linear equation, if you imagine partitioning RN based on possible values of θ⋅x, there are in some sense R1 possible partitions, and each such partition in some sense contains RN−1 points. More formally, each such partition has dimension N−1, including the one where θ⋅x=y.
If you have k training data points with k<N, then you have k such linear equations. If these are “totally different constraints” (i.e. linearly independent), then the resulting “partition that does best on the training data” has dimension N−k. This means that for any minimum of the training loss, there will be N−k orthogonal directions in which you could move with no change in how well you do on the training data. There could be even more such directions if the partitions are not “totally different” (i.e. linearly dependent).
If you now think of a neural net instead of a perceptron, in turns out that basically all of this reasoning just continues to work, with only one minor change: it applies only in the neighborhood of a given θ, rather than globally in all of RN. So, at any point, there are at least N−k orthogonal directions in which you can take an infinitesimally small step without changing how well you do on the training data. But it might be different at different places: maybe for one local minimum there are N−k such orthogonal directions, while for a different one there are N−k+5 such orthogonal directions. Intuitively, since there are more directions you can go in where the loss stays at a minimum in the second case, that point is a “flatter basin” in the loss landscape. Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.
Why does this matter? Some people think SGD is more likely to find points in flatter basins because the flatter basins are in some sense bigger. I think there’s some empirical evidence pointing in this direction but it hasn’t seemed conclusive to me, though I don’t pay that much attention to this area and could easily just not know about some very good evidence that would convince me. Anyway, if this were true, you might hope to understand what kinds of policies SGD tends to find (e.g. deceptive vs not) by understanding basin flatness better.
Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.
I am confused: how can this be “information loss” when we are assuming that due to linear dependence of the data points, we necessarily have 5 extra dimensions where the loss is the same? Because 5 of the data points “didn’t matter”, that shouldn’t count as “information loss” but more like “redundant data, ergo no information transmitted”.
There’s a summary here, but to have a somewhat more accessible version:
If you have a perceptron (aka linear neural net) with parameter vector θ of length N, predicting a single number as the output, then the possible parameterizations of the neural network are given by RN.
(R represents the reals, apparently the Latex on this site doesn’t support the normal symbol.)
Each (x,y) data point enforces the constraint θ⋅x=y. This is a single linear equation, if you imagine partitioning RN based on possible values of θ⋅x, there are in some sense R1 possible partitions, and each such partition in some sense contains RN−1 points. More formally, each such partition has dimension N−1, including the one where θ⋅x=y.
If you have k training data points with k<N, then you have k such linear equations. If these are “totally different constraints” (i.e. linearly independent), then the resulting “partition that does best on the training data” has dimension N−k. This means that for any minimum of the training loss, there will be N−k orthogonal directions in which you could move with no change in how well you do on the training data. There could be even more such directions if the partitions are not “totally different” (i.e. linearly dependent).
If you now think of a neural net instead of a perceptron, in turns out that basically all of this reasoning just continues to work, with only one minor change: it applies only in the neighborhood of a given θ, rather than globally in all of RN. So, at any point, there are at least N−k orthogonal directions in which you can take an infinitesimally small step without changing how well you do on the training data. But it might be different at different places: maybe for one local minimum there are N−k such orthogonal directions, while for a different one there are N−k+5 such orthogonal directions. Intuitively, since there are more directions you can go in where the loss stays at a minimum in the second case, that point is a “flatter basin” in the loss landscape. Also intuitively, in the latter case 5 of the data points “didn’t matter” in that you’d have had the same constraints (at that point) without them, and so this is kinda sorta like “information loss”.
Why does this matter? Some people think SGD is more likely to find points in flatter basins because the flatter basins are in some sense bigger. I think there’s some empirical evidence pointing in this direction but it hasn’t seemed conclusive to me, though I don’t pay that much attention to this area and could easily just not know about some very good evidence that would convince me. Anyway, if this were true, you might hope to understand what kinds of policies SGD tends to find (e.g. deceptive vs not) by understanding basin flatness better.
I am confused: how can this be “information loss” when we are assuming that due to linear dependence of the data points, we necessarily have 5 extra dimensions where the loss is the same? Because 5 of the data points “didn’t matter”, that shouldn’t count as “information loss” but more like “redundant data, ergo no information transmitted”.
I agree “information loss” seems kinda sketchy as a description of this phenomenon, it’s not what I would have chosen.