Rohin Shah comments on Inductive biases stick around

Rohin Shah 19 Dec 2019 4:53 UTC
LW: 7 AF: 5
AF
even just from the simple Bayesian perspective, I suspect you can still get double descent.
Note that, in your example, if we do see double descent, it’s because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.
As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the “best” hypothesis need not be the truth. It could be that posterior(truth) < posterior(memorization hypothesis) < posterior(almost-right hypothesis that predicts the noise “by luck”).
Overall, my guess is that while you could engineer this if you tried, but it wouldn’t happen “naturally” in synthetic examples (though it might happen for datasets like MNIST, because maybe there’s some property of those datasets that causes double descent to happen).
first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you’re selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood.
That first stage is not just a “likelihood descent”, it is a “likelihood + prior descent”, since you are choosing hypotheses based on the posterior, not based on the likelihood. And with Bayesian methods it’s quite possible that you never get to perfect likelihood, because the prior contains enough information to weed out the memorization hypotheses.
What links here?
- Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI by bayesian_kitten (16 Dec 2021 22:41 UTC; 22 points)
- evhub 19 Dec 2019 7:34 UTC
  LW: 2 AF: 1
  AF Parent
  
  Note that, in your example, if we do see double descent, it’s because the best hypothesis was previously not in the class of hypotheses we were considering. Bayesian methods tend to do badly when the hypothesis class is misspecified.
  
  Yep, that’s exactly my model.
  
  As a counterpoint though, you could see double descent even if your hypothesis class always contains the truth, because the “best” hypothesis need not be the truth.
  
  If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
  
  That first stage is not just a “likelihood descent”, it is a “likelihood + prior descent”, since you are choosing hypotheses based on the posterior, not based on the likelihood.
  
  True for the Bayesian case, though unclear in the ML case—I think it’s quite plausible that current ML underweights the implicit prior of SGD relative to the maximizing the likelihood of the data (EDIT: which is another reason that better future ML might care more about inductive biases).
  - Rohin Shah 20 Dec 2019 0:31 UTC
    LW: 4 AF: 3
    AF Parent
    If “best” here means test error, then presumably the truth should generalize at least as well as any other hypothesis.
    Sorry, “best” meant “the one that was chosen”, i.e. highest posterior, which need not be the truth. I agree that the truth generalizes at least as well as any other hypothesis.
    True for the Bayesian case, though unclear in the ML case
    I agree it’s unclear for the ML case, just because double descent happens and I have no idea why and “the prior doesn’t start affecting things until after interpolation” does explain that even though it itself needs explaining.