Note that, in your example, if we do see double descent, it’s because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.
Yep, that’s exactly my model.
As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the “best” hypothesis need not be the truth.
If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
That first stage is not just a “likelihood descent”, it is a “likelihood + prior descent”, since you are choosing hypotheses based on the posterior, not based on the likelihood.
True for the Bayesian case, though unclear in the ML case—I think it’s quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).
If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
Sorry, “best” meant “the one that was chosen”, i.e. highest posterior, which need not be the truth. I agree that the truth generalizes at least as well as any other hypothesis.
True for the Bayesian case, though unclear in the ML case
I agree it’s unclear for the ML case, just because double descent happens and I have no idea why and “the prior doesn’t start affecting things until after interpolation” does explain that even though it itself needs explaining.
Yep, that’s exactly my model.
If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
True for the Bayesian case, though unclear in the ML case—I think it’s quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).
Sorry, “best” meant “the one that was chosen”, i.e. highest posterior, which need not be the truth. I agree that the truth generalizes at least as well as any other hypothesis.
I agree it’s unclear for the ML case, just because double descent happens and I have no idea why and “the prior doesn’t start affecting things until after interpolation” does explain that even though it itself needs explaining.