One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. “train forever”.
If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models.
To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.
Evan’s response (copied from a direct message, before I was approved to post here):
It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about why bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there’s also a meaningful sense in which bigger models have better inductive biases.
The idea I’m objecting to is that there’s a sharp change from one regime (larger family of models) to the other (better inductive bias). I’d say that both factors smoothly improve performance over the full range of model sizes. I don’t fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.
One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. “train forever”.
If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models.
To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.
Evan’s response (copied from a direct message, before I was approved to post here):
The idea I’m objecting to is that there’s a sharp change from one regime (larger family of models) to the other (better inductive bias). I’d say that both factors smoothly improve performance over the full range of model sizes. I don’t fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.