even just from the simple Bayesian perspective, I suspect you can still get double descent.
Note that, in your example, if we do see double descent, it’s because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.
As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the “best” hypothesis need not be the truth. It could be that posterior(truth) < posterior(memorization hypothesis) < posterior(almost-right hypothesis that predicts the noise “by luck”).
Overall, my guess is that while you could engineer this if you tried, but it wouldn’t happen “naturally” in synthetic examples (though it might happen for datasets like MNIST, because maybe there’s some property of those datasets that causes double descent to happen).
first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you’re selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood.
That first stage is not just a “likelihood descent”, it is a “likelihood + prior descent”, since you are choosing hypotheses based on the posterior, not based on the likelihood. And with Bayesian methods it’s quite possible that you never get to perfect likelihood, because the prior contains enough information to weed out the memorization hypotheses.
Note that, in your example, if we do see double descent, it’s because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.
Yep, that’s exactly my model.
As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the “best” hypothesis need not be the truth.
If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
That first stage is not just a “likelihood descent”, it is a “likelihood + prior descent”, since you are choosing hypotheses based on the posterior, not based on the likelihood.
True for the Bayesian case, though unclear in the ML case—I think it’s quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).
If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
Sorry, “best” meant “the one that was chosen”, i.e. highest posterior, which need not be the truth. I agree that the truth generalizes at least as well as any other hypothesis.
True for the Bayesian case, though unclear in the ML case
I agree it’s unclear for the ML case, just because double descent happens and I have no idea why and “the prior doesn’t start affecting things until after interpolation” does explain that even though it itself needs explaining.
Note that, in your example, if we do see double descent, it’s because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.
As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the “best” hypothesis need not be the truth. It could be that posterior(truth) < posterior(memorization hypothesis) < posterior(almost-right hypothesis that predicts the noise “by luck”).
Overall, my guess is that while you could engineer this if you tried, but it wouldn’t happen “naturally” in synthetic examples (though it might happen for datasets like MNIST, because maybe there’s some property of those datasets that causes double descent to happen).
That first stage is not just a “likelihood descent”, it is a “likelihood + prior descent”, since you are choosing hypotheses based on the posterior, not based on the likelihood. And with Bayesian methods it’s quite possible that you never get to perfect likelihood, because the prior contains enough information to weed out the memorization hypotheses.
Yep, that’s exactly my model.
If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
True for the Bayesian case, though unclear in the ML case—I think it’s quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).
Sorry, “best” meant “the one that was chosen”, i.e. highest posterior, which need not be the truth. I agree that the truth generalizes at least as well as any other hypothesis.
I agree it’s unclear for the ML case, just because double descent happens and I have no idea why and “the prior doesn’t start affecting things until after interpolation” does explain that even though it itself needs explaining.