… Where is that impression coming from? If this is a widespread view, I could just be wrong about it; I have a cached belief that large language models and probably other models aren’t trained to the interpolation threshold and so aren’t leveraging double descent.
I haven’t kept track of dataset size vs model size, but things I’ve read on the double descent phenomenon have generally described it as a unified model of the “classic statistics” paradigm where you need to deal with the bias-variance tradeoff, versus the “modern ML” paradigm where bigger=better.
I guess it may depend on the domain? Generative tasks like language modelling or image encoding implicitly end up having a lot more bits/sample than discriminative tasks? So maybe generative tasks are usually not in the second descend regime while discriminative tasks usually are?
I was under the impression that basically all SOTA capabilities rely on double descent. Is that impression wrong?
… Where is that impression coming from? If this is a widespread view, I could just be wrong about it; I have a cached belief that large language models and probably other models aren’t trained to the interpolation threshold and so aren’t leveraging double descent.
I haven’t kept track of dataset size vs model size, but things I’ve read on the double descent phenomenon have generally described it as a unified model of the “classic statistics” paradigm where you need to deal with the bias-variance tradeoff, versus the “modern ML” paradigm where bigger=better.
I guess it may depend on the domain? Generative tasks like language modelling or image encoding implicitly end up having a lot more bits/sample than discriminative tasks? So maybe generative tasks are usually not in the second descend regime while discriminative tasks usually are?