If this model is supposed to explain double descent the question is why the model at the first local minimum isn’t more intelligent than later models with lower loss? Shouldn’t it have learned the simple model of the data without the deviations?
If this model is supposed to explain double descent the question is why the model at the first local minimum isn’t more intelligent than later models with lower loss? Shouldn’t it have learned the simple model of the data without the deviations?