Rohin Shah comments on rohinmshah’s Shortform

Rohin Shah Jan 23, 2020, 11:51 PM
LW: 4 AF: 3
AF
In my double descent newsletter, I said:
This fits into the broader story being told in other papers that what’s happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn’t generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]

This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better. While I can’t rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn’t “come into effect” after the interpolation threshold.)
One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.
I don’t buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is $ln (1 / N) = - 2.3$ when $N = 10$ (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we’d expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.
There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy “overwhelms” the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can’t be true. When training on just L2 regularization, the gradient descent update is:
$w = w - λ w = (1 - λ) w = c w$ for some constant $c$ .
For MLPs with relu activations and no biases, if you multiply all the weights by $c$ , the logits get multiplied by $c^{d}$ (where d is the depth of the network), no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can’t see a double descent on test error in this setting. (This doesn’t eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can’t happen in the “first train to zero error with cross-entropy and then regularize” setting.)

It is possible that double descent doesn’t happen for MLPs with relu activations and no biases, but given how many other settings it seems to happen in I would be surprised.
What links here?
- Rohin Shah's comment on Hypothesis: gradient descent prefers general circuits by Quintin Pope (Feb 10, 2022, 11:03 AM; 11 points)
- Rohin Shah's comment on rohinmshah’s Shortform by Rohin Shah (Oct 24, 2020, 4:19 PM; 4 points)