Evan’s response (copied from a direct message, before I was approved to post here):
It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about why bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there’s also a meaningful sense in which bigger models have better inductive biases.
The idea I’m objecting to is that there’s a sharp change from one regime (larger family of models) to the other (better inductive bias). I’d say that both factors smoothly improve performance over the full range of model sizes. I don’t fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.
Evan’s response (copied from a direct message, before I was approved to post here):
The idea I’m objecting to is that there’s a sharp change from one regime (larger family of models) to the other (better inductive bias). I’d say that both factors smoothly improve performance over the full range of model sizes. I don’t fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.