This blog post provides empirical evidence for the existence of the _double descent_ phenomenon, proposed in an earlier paper summarized below. Define the _effective model complexity_ (EMC) of a training procedure and a dataset to be the maximum size of training set such that the training procedure achieves a _train_ error of at most ε (they use ε = 0.1). Let’s suppose you start with a small, underparameterized model with low EMC. Then initially, as you increase the EMC, the model will achieve a better fit to the data, leading to lower test error. However, once the EMC is approximately equal to the size of the actual training set, then the model can “just barely” fit the training set, and the test error can increase or decrease. Finally, as you increase the EMC even further, so that the training procedure can easily fit the training set, the test error will once again _decrease_, causing a second descent in test error. This unifies the perspectives of statistics, where larger models are predicted to overfit, leading to increasing test error with higher EMC, and modern machine learning, where the common empirical wisdom is to make models as big as possible and test error will continue decreasing.
They show that this pattern arises in a variety of simple settings. As you increase the width of a ResNet up to 64, you can observe double descent in the final test error of the trained model. In addition, if you fix a large overparameterized model and change the number of epochs for which it is trained, you see another double descent curve, which means that simply training longer can actually _correct overfitting_. Finally, if you fix a training procedure and change the size of the dataset, you can see a double descent curve as the size of the dataset decreases. This actually implies that there are points in which _more data is worse_, because the training procedure is in the critical interpolation region where test error can increase. Note that most of these results only occur when there is _label noise_ present, that is, some proportion of the training set (usually 10-20%) is given random incorrect labels. Some results still occur without label noise, but the resulting double descent peak is quite small. The authors hypothesize that label noise leads to the effect because double descent occurs when the model is misspecified, though it is not clear to me what it means for a model to be misspecified in this context.
Opinion:
While I previously didn’t think that double descent was a real phenomenon (see summaries later in this email for details), these experiments convinced me that I was wrong and in fact there is something real going on. Note that the settings studied in this work are still not fully representative of typical use of neural nets today; the label noise is the most obvious difference, but also e.g. ResNets are usually trained with higher widths than studied in this paper. So the phenomenon might not generalize to neural nets as used in practice, but nonetheless, there’s _some_ real phenomenon here, which flies in the face of all of my intuitions.
The authors don’t really suggest an explanation; the closest they come is speculating that at the interpolation threshold there’s only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can “choose” from the various models that all fit the data, and that “choice” leads to better generalization. But this doesn’t make sense to me, because whatever is being used to “choose” the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well. (For example, if you think that regularization is providing a simplicity bias that leads to better generalization, the regularization should also help models at the interpolation threshold, since you always regularize throughout training.)
Perhaps one explanation could be that in order for the regularization to work, there needs to be a “direction” in the space of model parameters that doesn’t lead to increased training error, so that the model can move along that direction towards a simpler model. Each training data point defines a particular direction in which training error will increase. So, when the number of training points is equal to the number of parameters, the training points just barely cover all of the directions, and then as you increase the number of parameters further, that starts creating new directions that are not constrained by the training points, allowing the regularization to work much better. (In fact, the original paper, summarized below, _defined_ the interpolation threshold as the point where number of parameters equals the size of the training dataset.) However, while this could explain model-wise double descent and training-set-size double descent, it’s not a great explanation for epoch-wise double descent.
This paper first proposed double descent as a general phenomenon, and demonstrated it in three machine learning models: linear predictors over Random Fourier Features, fully connected neural networks with one hidden layer, and forests of decision trees. Note that they define the interpolation threshold as the point where the number of parameters equals the number of training points, rather than using something like effective model complexiy.
For linear predictors over Random Fourier Features, their procedure is as follows: they generate a set of random features, and then find the linear predictor that minimizes the squared loss incurred. If there are multiple predictors that achieve zero squared loss, then they choose the one with the minimum L2 norm. The double descent curve for a subset of MNIST is very pronounced and has a huge peak at the point where the number of features equals the number of training points.
For the fully connected neural networks on MNIST, they make a significant change to normal training: prior to the interpolation threshold, rather than training the networks from scratch, they train them from the final solution found for the previous (smaller) network, but after the interpolation threshold they train from scratch as normal. With this change, you see a very pronounced and clear double descent curve. However, if you always train from scratch, then it’s less clear—there’s a small peak, which the authors describe as “clearly discernible”, but to me it looks like it could be noise.
For decision trees, if the dataset has n training points, they learn decision trees of size up to n leaves, and then at that point (the interpolation threshold) they switch to having ensembles of decision trees (called forests) to get more expressive function classes. Once again, you can see a clear, pronounced double descent curve.
Opinion:
I read this paper back when summarizing <@Are Deep Neural Networks Dramatically Overfitted?@> and found it uncompelling, and I’m really curious how the ML community correctly seized upon this idea as deserving of further investigation while I incorrectly dismissed it. None of the experimental results in this paper are particularly surprising to me, whereas double descent itself is quite surprising.
In the random Fourier features and decision trees experiments, there is a qualitative difference in the _learning algorithm_ before and after the interpolation threshold, that suffices to explain the curve. With the random Fourier features, we only start regularizing the model after the interpolation threshold; it is not surprising that adding regularization helps reduce test loss. With the decision trees, after the interpolation threshold, we start using ensembles; it is again not at all surprising that ensembles help reduce test error. (See also this comment.) So yeah, if you start regularizing (via L2 norm or ensembles) after the interpolation threshold, that will help your test error, but in practice we regularize throughout the training process, so this should not occur with neural nets.
The neural net experiments also have a similar flavor—the nets before the interpolation threshold are required to reuse weights from the previous run, while the ones after the interpolation threshold do not have any such requirement. When this is removed, the results are much more muted. The authors claim that this is necessary to have clear graphs (where training risk monotonically decreases), but it’s almost certainly biasing the results—at the interpolation threshold, with weight reuse, the test squared loss is ~0.55 and test accuracy is ~80%, while without weight reuse, test squared loss is ~0.35 and test accuracy is ~85%, a _massive_ difference and probably not within the error bars.
Some speculation on what’s happening here: neural net losses are nonconvex and training can get stuck in local optima. A pretty good way to get stuck in a local optimum is to initialize half your parameters to do something that does quite well while the other half are initialized randomly. So with weight reuse we might expect getting stuck in worse local optima. However, it looks like the training losses are comparable between the methods. Maybe what’s happening is that with weight reuse, the half of parameters that are initialized randomly memorize the training points that the good half of the parameters can’t predict, which doesn’t generalize well but does get low training error. Meanwhile, without weight reuse, all of the parameters end up finding a good model that does generalize well, for whatever reason it is that neural nets do work well.
But again, note that the authors were right about double descent being a real phenomenon, while I was wrong, so take all this speculation with many grains of salt.
Summary of this post:
This post explains deep double descent (in more detail than my summaries), and speculates on its relevance to AI safety. In particular, Evan believes that deep double descent shows that neural nets are providing strong inductive biases that are crucial to their performance—even _after_ getting to ~zero training loss, the inductive biases _continue_ to do work for us, and find better models that lead to lower test loss. As a result, it seems quite important to understand the inductive biases that neural nets use, which seems particularly relevant for e.g. <@mesa optimization and pseudo alignment@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@).
Opinion:
I certainly agree that neural nets have strong inductive biases that help with their generalization; a clear example of this is that neural nets can learn randomly labeled data (which can never generalize to the test set), but nonetheless when trained on correctly labeled data such nets do generalize to test data. Perhaps more surprising here is that the inductive biases help even _after_ fully capturing the data (achieving zero training loss) -- you might have thought that the data would swamp the inductive biases. This might suggest that powerful AI systems will become simpler over time (assuming an inductive bias towards simplicity). However, this is happening in the regime where the neural nets are overparameterized, so it makes sense that inductive biases would still play a large role. I expect that in contrast, powerful AI systems will be severely underparameterized, simply because of _how much data_ there is (for example, the largest GPT-2 model still underfits the data).
But this doesn’t make sense to me, because whatever is being used to “choose” the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well. (For example, if you think that regularization is providing a simplicity bias that leads to better generalization, the regularization should also help models at the interpolation threshold, since you always regularize throughout training.)
The idea—at least as I see it—is that the set of possible models that you can choose between increases with training. That is, there are many more models reachable within n+1 steps of training than there are models reachable within n steps of training. The interpolation threshold is the point at which there are the fewest reachable models with zero training error, so your inductive biases have the fewest choices—past that point, there are many more reachable models with zero training error, which lets the inductive biases be much more pronounced. One way in which I’ve been thinking about this is that ML models overweight the likelihood and underweight the prior, since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker. Thus, when there aren’t many ties to break—that is, at the interpolation threshold—you get worse performance.
since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker
If that were true, I’d buy the story presented in double descent. But we don’t do that; we regularize throughout training! The loss usually includes an explicit term that penalizes the L2 norm of the weights, and that loss is evaluated and trained against throughout training, and across models, and regardless of dataset size.
It might be that the inductive biases are coming from some other method besides regularization (especially since some of the experiments are done without regularization iirc). But even then, to be convinced of this story, I’d want to see some explanation of how in terms of the training dynamics the inductive biases act as a tiebreaker, and why that explanation doesn’t do anything before the interpolation threshold.
Reading your comment again, the first three sentences seem different from the last two sentences. My response above is responding to the last two sentences; I’m not sure if you mean something different by the first three sentences.
I ended up reading another paper on double descent:
More Data Can Hurt for Linear Regression: Sample-wise Double Descent(Preetum Nakkiran) (summarized by Rohin): This paper demonstrates the presence of double descent (in the size of the dataset) for unregularized linear regression. In particular, we assume that each data point x is a vector in independent samples from N(0, σ^2), and the output is y = βx + ε. Given a dataset of (x, y) pairs, we would like to estimate the unknown β, under the mean squared error loss, with no regularization.
In this setting, when the dimensionality d of the space (and thus number of parameters in β) is equal to the number of training points n, the training data points are linearly independent almost always / with probability 1, and so there will be exactly one β that solves the n linearly independent equalities of the form βx = y. However, such a β must also be fitting the noise variables ε, which means that it could be drastically overfitted, with very high norm. For example, imagine β = [1, 1], and the data points are (-1, 3): 3 and (0, 1): 0 (the data points had errors of +1 and −1 respectively). The estimate will be β = [-3, 0], which is going to generalize very poorly.
As we decrease the number of training points n, so that d > n, there are infinitely many settings of the d parameters of β that satisfy the n linearly independent equalities, and gradient descent naturally chooses the one with minimum norm (even without regularization). This limits how bad the test error can be. Similarly, as we increase the number of training points, so that d < n, there are too many constraints for β to satisfy, and so it ends up primarily modeling the signal rather than the noise, and so generalizing well.
Rohin’s opinion: Basically what’s happening here is that at the interpolation threshold, the model is forced to memorize noise, and it has only one way of doing so, which need not generalize well. However, past the interpolation threshold, when the model is overparameterized, there are many models that successfully memorize noise, and gradient descent “correctly” chooses one with minimum norm. This fits into the broader story being told in other papers that what’s happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn’t generalize, and after the interpolation threshold it fits the noise in a way that does generalize. Here that’s happening because gradient descent chooses the minimum norm estimator that fits the noise; perhaps something similar is happening with neural nets.
This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better. While I can’t rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn’t “come into effect” after the interpolation threshold.)
For the newsletter:
OpenAI blog post summary:
Opinion:
Original paper (Reconciling modern machine learning practice and the bias-variance trade-off) summary:
Opinion:
Summary of this post:
Opinion:
The idea—at least as I see it—is that the set of possible models that you can choose between increases with training. That is, there are many more models reachable within n+1 steps of training than there are models reachable within n steps of training. The interpolation threshold is the point at which there are the fewest reachable models with zero training error, so your inductive biases have the fewest choices—past that point, there are many more reachable models with zero training error, which lets the inductive biases be much more pronounced. One way in which I’ve been thinking about this is that ML models overweight the likelihood and underweight the prior, since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker. Thus, when there aren’t many ties to break—that is, at the interpolation threshold—you get worse performance.
If that were true, I’d buy the story presented in double descent. But we don’t do that; we regularize throughout training! The loss usually includes an explicit term that penalizes the L2 norm of the weights, and that loss is evaluated and trained against throughout training, and across models, and regardless of dataset size.
It might be that the inductive biases are coming from some other method besides regularization (especially since some of the experiments are done without regularization iirc). But even then, to be convinced of this story, I’d want to see some explanation of how in terms of the training dynamics the inductive biases act as a tiebreaker, and why that explanation doesn’t do anything before the interpolation threshold.
Reading your comment again, the first three sentences seem different from the last two sentences. My response above is responding to the last two sentences; I’m not sure if you mean something different by the first three sentences.
I ended up reading another paper on double descent:
More Data Can Hurt for Linear Regression: Sample-wise Double Descent (Preetum Nakkiran) (summarized by Rohin): This paper demonstrates the presence of double descent (in the size of the dataset) for unregularized linear regression. In particular, we assume that each data point x is a vector in independent samples from N(0, σ^2), and the output is y = βx + ε. Given a dataset of (x, y) pairs, we would like to estimate the unknown β, under the mean squared error loss, with no regularization.
In this setting, when the dimensionality d of the space (and thus number of parameters in β) is equal to the number of training points n, the training data points are linearly independent almost always / with probability 1, and so there will be exactly one β that solves the n linearly independent equalities of the form βx = y. However, such a β must also be fitting the noise variables ε, which means that it could be drastically overfitted, with very high norm. For example, imagine β = [1, 1], and the data points are (-1, 3): 3 and (0, 1): 0 (the data points had errors of +1 and −1 respectively). The estimate will be β = [-3, 0], which is going to generalize very poorly.
As we decrease the number of training points n, so that d > n, there are infinitely many settings of the d parameters of β that satisfy the n linearly independent equalities, and gradient descent naturally chooses the one with minimum norm (even without regularization). This limits how bad the test error can be. Similarly, as we increase the number of training points, so that d < n, there are too many constraints for β to satisfy, and so it ends up primarily modeling the signal rather than the noise, and so generalizing well.
Rohin’s opinion: Basically what’s happening here is that at the interpolation threshold, the model is forced to memorize noise, and it has only one way of doing so, which need not generalize well. However, past the interpolation threshold, when the model is overparameterized, there are many models that successfully memorize noise, and gradient descent “correctly” chooses one with minimum norm. This fits into the broader story being told in other papers that what’s happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn’t generalize, and after the interpolation threshold it fits the noise in a way that does generalize. Here that’s happening because gradient descent chooses the minimum norm estimator that fits the noise; perhaps something similar is happening with neural nets.
This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better. While I can’t rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn’t “come into effect” after the interpolation threshold.)