I think this paper does a good job at collecting papers about double descent into one place where they can be contrasted and discussed.
I am not convinced that deep double descent is a pervasive phenomenon in practically-used neural networks, for reasons described in Rohin’s opinion about Preetum et. al.. This wouldn’t be so bad, except the limitations of the evidence (smaller ResNets than usual, basically goes away without label noise in image classification, some sketchy choices made in the Belkin et al experiments) are not really addressed or highlighted, which I think has a real prospect of misleading the reader. For instance, someone reading the post could be forgiven for not realizing that the colour plots of double descent in ResNet-18s only hold for 15% label noise.
Fwiw, I really liked Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (summarized in AN #129), and I think I’m now at “double descent is real and occurs when (empirical) bias is high but later overshadowed by (empirical) variance”. (Part of it is that it explains a lot of existing evidence, but another part is that my prior on an explanation like that being true is much higher than almost anything else that’s been proposed.)
I was pretty uncertain about the arguments in this post and the followup when they first came out. (More precisely, for any underlying claim; my opinions about the truth value of the claim seemed nearly independent of my beliefs in double descent.) I’d be interested in seeing a rewrite based on the bias-variance trade-off explanation; my current guess is that they won’t hold up, but I haven’t thought about it much.
Finally got around to that one, and am also pretty into that explanation for the cases of double descent we observe. It also tentatively makes me want to say that the decrease in variance with model size is the ‘real story’/primary thing we should think about.
I think this paper does a good job at collecting papers about double descent into one place where they can be contrasted and discussed.
I am not convinced that deep double descent is a pervasive phenomenon in practically-used neural networks, for reasons described in Rohin’s opinion about Preetum et. al.. This wouldn’t be so bad, except the limitations of the evidence (smaller ResNets than usual, basically goes away without label noise in image classification, some sketchy choices made in the Belkin et al experiments) are not really addressed or highlighted, which I think has a real prospect of misleading the reader. For instance, someone reading the post could be forgiven for not realizing that the colour plots of double descent in ResNet-18s only hold for 15% label noise.
Related to the above, the comments on this post seem pretty valuable to me, in terms of explaining non-obvious aspects and implications of the discussed paper. The speculation about the lottery ticket hypothesis is interesting but not obviously true. Papers that I have found useful for understanding this phenomenon include Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask and Linear Mode Connectivity and the Lottery Ticket Hypothesis.
Overall, my guess is that it is a mistake to think of double descent as a fact about modern machine learning, rather than a plausible hypothesis.
Fwiw, I really liked Rethinking Bias-Variance Trade-off for Generalization of Neural Networks (summarized in AN #129), and I think I’m now at “double descent is real and occurs when (empirical) bias is high but later overshadowed by (empirical) variance”. (Part of it is that it explains a lot of existing evidence, but another part is that my prior on an explanation like that being true is much higher than almost anything else that’s been proposed.)
I was pretty uncertain about the arguments in this post and the followup when they first came out. (More precisely, for any underlying claim; my opinions about the truth value of the claim seemed nearly independent of my beliefs in double descent.) I’d be interested in seeing a rewrite based on the bias-variance trade-off explanation; my current guess is that they won’t hold up, but I haven’t thought about it much.
Finally got around to that one, and am also pretty into that explanation for the cases of double descent we observe. It also tentatively makes me want to say that the decrease in variance with model size is the ‘real story’/primary thing we should think about.