evhub comments on Understanding “Deep Double Descent”

evhub 17 Dec 2019 19:58 UTC
LW: 5 AF: 2
AF

But this doesn’t make sense to me, because whatever is being used to “choose” the better model applies throughout training, and so even at the interpolation threshold the model should have been selected throughout training to be the type of model that generalized well. (For example, if you think that regularization is providing a simplicity bias that leads to better generalization, the regularization should also help models at the interpolation threshold, since you always regularize throughout training.)

The idea—at least as I see it—is that the set of possible models that you can choose between increases with training. That is, there are many more models reachable within $n + 1$ steps of training than there are models reachable within $n$ steps of training. The interpolation threshold is the point at which there are the fewest reachable models with zero training error, so your inductive biases have the fewest choices—past that point, there are many more reachable models with zero training error, which lets the inductive biases be much more pronounced. One way in which I’ve been thinking about this is that ML models overweight the likelihood and underweight the prior, since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker. Thus, when there aren’t many ties to break—that is, at the interpolation threshold—you get worse performance.
- Rohin Shah 18 Dec 2019 3:36 UTC
  LW: 5 AF: 4
  AF Parent
  since we train exclusively on loss and effectively only use our inductive biases as a tiebreaker
  If that were true, I’d buy the story presented in double descent. But we don’t do that; we regularize throughout training! The loss usually includes an explicit term that penalizes the L2 norm of the weights, and that loss is evaluated and trained against throughout training, and across models, and regardless of dataset size.
  It might be that the inductive biases are coming from some other method besides regularization (especially since some of the experiments are done without regularization iirc). But even then, to be convinced of this story, I’d want to see some explanation of how in terms of the training dynamics the inductive biases act as a tiebreaker, and why that explanation doesn’t do anything before the interpolation threshold.
  - Rohin Shah 18 Dec 2019 7:04 UTC
    LW: 2 AF: 2
    AF Parent
    Reading your comment again, the first three sentences seem different from the last two sentences. My response above is responding to the last two sentences; I’m not sure if you mean something different by the first three sentences.