I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better
For example, the (random, meaningless) weights used to memorize noise can get spread across more degrees of freedom, so that on the test their sum will be closer to 0.
For example, the (random, meaningless) weights used to memorize noise can get spread across more degrees of freedom, so that on the test their sum will be closer to 0.
That does not intuitively make sense to me. I’d need to see an example or more fleshed out argument to be convinced.
(Also, it sounds like an argument for model-wise double descent, but not epoch-wise double descent.)