Fwiw, I really liked Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (summarized in AN #129), and I think I’m now at “double descent is real and occurs when (empirical) bias is high but later overshadowed by (empirical) variance”. (Part of it is that it explains a lot of existing evidence, but another part is that my prior on an explanation like that being true is much higher than almost anything else that’s been proposed.)
I was pretty uncertain about the arguments in this post and the followup when they first came out. (More precisely, for any underlying claim; my opinions about the truth value of the claim seemed nearly independent of my beliefs in double descent.) I’d be interested in seeing a rewrite based on the bias-variance trade-off explanation; my current guess is that they won’t hold up, but I haven’t thought about it much.
Finally got around to that one, and am also pretty into that explanation for the cases of double descent we observe. It also tentatively makes me want to say that the decrease in variance with model size is the ‘real story’/primary thing we should think about.
Fwiw, I really liked Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (summarized in AN #129), and I think I’m now at “double descent is real and occurs when (empirical) bias is high but later overshadowed by (empirical) variance”. (Part of it is that it explains a lot of existing evidence, but another part is that my prior on an explanation like that being true is much higher than almost anything else that’s been proposed.)
I was pretty uncertain about the arguments in this post and the followup when they first came out. (More precisely, for any underlying claim; my opinions about the truth value of the claim seemed nearly independent of my beliefs in double descent.) I’d be interested in seeing a rewrite based on the bias-variance trade-off explanation; my current guess is that they won’t hold up, but I haven’t thought about it much.
Finally got around to that one, and am also pretty into that explanation for the cases of double descent we observe. It also tentatively makes me want to say that the decrease in variance with model size is the ‘real story’/primary thing we should think about.