dvasya comments on [AN #77]: Double descent: a unification of statistical theory and modern ML practice

dvasya 19 Dec 2019 5:32 UTC
1 point
I don’t see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training “fixes” the weights to memorize noise in a different way that generalizes better
For example, the (random, meaningless) weights used to memorize noise can get spread across more degrees of freedom, so that on the test their sum will be closer to 0.
- Rohin Shah 19 Dec 2019 6:26 UTC
  2 points
  Parent
  That does not intuitively make sense to me. I’d need to see an example or more fleshed out argument to be convinced.
  (Also, it sounds like an argument for model-wise double descent, but not epoch-wise double descent.)