The authors don’t really suggest an explanation; the closest they come is speculating that at the interpolation threshold there’s only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can “choose” from the various models that all fit the data, and that “choice” leads to better generalization. But this doesn’t make sense to me, because whatever is being used to “choose” the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well.
I don’t understand your objection here. If there is only ~one model that fits the data, and the training procedure is such that it will find that model, then aren’t you just stuck w/ whatever level of generalizability that model has? And isn’t it irrelevant that your procedure has some bias towards better generalizability?
Or are you saying that even if there’s only one model at the interpolation threshold that fits the data, you’d expect the training procedure to pick a different model (one that doesn’t completely fit the data) instead, because of the bias towards generalizability?
Or are you saying that even if there’s only one model at the interpolation threshold that fits the data, you’d expect the training procedure to pick a different model (one that doesn’t completely fit the data) instead, because of the bias towards generalizability?
I don’t understand your objection here. If there is only ~one model that fits the data, and the training procedure is such that it will find that model, then aren’t you just stuck w/ whatever level of generalizability that model has? And isn’t it irrelevant that your procedure has some bias towards better generalizability?
Or are you saying that even if there’s only one model at the interpolation threshold that fits the data, you’d expect the training procedure to pick a different model (one that doesn’t completely fit the data) instead, because of the bias towards generalizability?
Yup, that.