Rohin Shah comments on Understanding “Deep Double Descent”

Rohin Shah 18 Dec 2019 6:57 UTC
LW: 4 AF: 3
AF
I ended up reading another paper on double descent:
More Data Can Hurt for Linear Regression: Sample-wise Double Descent (Preetum Nakkiran) (summarized by Rohin): This paper demonstrates the presence of double descent (in the size of the dataset) for unregularized linear regression. In particular, we assume that each data point x is a vector in independent samples from N(0, σ^2), and the output is y = βx + ε. Given a dataset of (x, y) pairs, we would like to estimate the unknown β, under the mean squared error loss, with no regularization.
In this setting, when the dimensionality d of the space (and thus number of parameters in β) is equal to the number of training points n, the training data points are linearly independent almost always / with probability 1, and so there will be exactly one β that solves the n linearly independent equalities of the form βx = y. However, such a β must also be fitting the noise variables ε, which means that it could be drastically overfitted, with very high norm. For example, imagine β = [1, 1], and the data points are (-1, 3): 3 and (0, 1): 0 (the data points had errors of +1 and −1 respectively). The estimate will be β = [-3, 0], which is going to generalize very poorly.
As we decrease the number of training points n, so that d > n, there are infinitely many settings of the d parameters of β that satisfy the n linearly independent equalities, and gradient descent naturally chooses the one with minimum norm (even without regularization). This limits how bad the test error can be. Similarly, as we increase the number of training points, so that d < n, there are too many constraints for β to satisfy, and so it ends up primarily modeling the signal rather than the noise, and so generalizing well.
Rohin’s opinion: Basically what’s happening here is that at the interpolation threshold, the model is forced to memorize noise, and it has only one way of doing so, which need not generalize well. However, past the interpolation threshold, when the model is overparameterized, there are many models that successfully memorize noise, and gradient descent “correctly” chooses one with minimum norm. This fits into the broader story being told in other papers that what’s happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn’t generalize, and after the interpolation threshold it fits the noise in a way that does generalize. Here that’s happening because gradient descent chooses the minimum norm estimator that fits the noise; perhaps something similar is happening with neural nets.
This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better. While I can’t rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn’t “come into effect” after the interpolation threshold.)