John_Maxwell comments on Understanding “Deep Double Descent”

John_Maxwell 7 Dec 2019 8:27 UTC
LW: 17 AF: 8
AF
The decision tree result seemed counterintuitive to me, so I took a look at that section of the paper. I wasn’t impressed. In order to create a double descent curve for decision trees, they change their notion of “complexity” (midway through the graph… check Figure 5) from the number of leaves in a tree to the number of trees in a forest. Turns out that right after they change their notion of “complexity” from number of leaves to number of trees, generalization starts to improve :)

I don’t see this as evidence for double descent per se. Just that ensembling improves generalization. Which is something we’ve known for a long time. (And this fact doesn’t seem like a big mystery to me. From a Bayesian perspective, ensembling is like using the posterior predictive distribution instead of the MAP estimate. BTW I think there’s also a Bayesian story for why flat minima generalize better—the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam’s Razor works...)

Anyway, the authors’ rationale seems to be that once your decision tree has memorized the training set, the only way to increase the complexity of your hypothesis class is by adding trees to the forest. I’d rather they had kept the number of trees constant and only modulated the number of leaves.

However, the decision tree discussion does point to a possible explanation of the double descent phenomenon for neural networks. Maybe once you’ve got enough complexity to memorize the training set, adding more complexity allows for a kind of “implicit ensembling” which leads to memorizing the training set in many different ways and averaging the results together like an ensemble does.

It’s suspicious to me that every neural network case study in the paper modulates layer width. There’s no discussion of modulating depth. My guess is they tried modulating depth but didn’t get the double descent phenomenon and decided to leave those experiments out.

I think increased layer width fits pretty nicely with my implicit ensembling story. Taking a Bayesian perspective on the output neuron: After there are enough neurons to memorize the training set, adding more leads to more pieces of evidence regarding the final output, making estimates more robust. Which is more or less why ensembles work IMO.
What links here?
- [AN #77]: Double descent: a unification of statistical theory and modern ML practice by Rohin Shah (18 Dec 2019 18:30 UTC; 21 points)
- Rohin Shah's comment on Understanding “Deep Double Descent” by evhub (17 Dec 2019 9:32 UTC; 13 points)
- evhub 7 Dec 2019 21:05 UTC
  LW: 11 AF: 7
  AF Parent
  Note that double descent also happens with polynomial regression—see here for an example.
  - John_Maxwell 8 Dec 2019 22:59 UTC
    LW: 10 AF: 6
    AF Parent
    I took a look at the Colab notebook linked from that blog post, and there are some subtleties the blog post doesn’t discuss. First, the blog post says they’re using gradient descent, but if you look at the notebook, they’re actually computing a pseudoinverse. [EDIT: They state a justification in their notebook which I missed at first. See discussion below.] Second, the blog post talks about polynomial regression, but doesn’t mention that the notebook uses Legendre polynomials. I’m pretty sure high degree Legendre polynomials look like a squiggle which closely follows the x-axis on [-1, 1] (the interval used in their demo). If you fork the notebook and switch from Legendre polynomials to classic polynomial regression (1, x, x^2, x^3, etc.), the degree 100,000 fit appears to be worse than the degree 20 fit. I searched on Google and Google Scholar, and use of Legendre polynomials doesn’t seem to be common practice in ML.
    - Preetum Nakkiran 10 Dec 2019 0:54 UTC
      LW: 28 AF: 14
      AF Parent
      Hi, I’m one of the authors:
      Regarding gradient descent vs. pseudoinverse:
      As we mention in the note at the top of the Colab,for the linear regression objective, running gradient descent to convergence (from 0 initialization) is *equivalent* to applying the pseudoinverse—they both find the minimum-norm interpolating solution.
      So, we use the pseudoinverse just for computational efficiency; you would get equivalent results by actually running GD.
      
      You’re also right that the choice of basis matters. The choice of basis changes the “implicit bias”
      of GD in linear regression, analogous to how the choice of architecture changes the implicit bias of GD on neural nets.
      -- Preetum
      - John_Maxwell 10 Dec 2019 6:24 UTC
        LW: 8 AF: 4
        AF Parent
        Thanks Preetum. You’re right, I missed that note the first time—I edited my comment a bit.
        
        It might be illuminating to say “the polynomial found by iterating weights starting at 0” instead of “the polynomial found with gradient descent”, since in this case, the inductive bias comes from the initialization point, not necessarily gradient descent per se. Neural nets can’t learn if all the weights are initialized to 0 at the start, of course :)
        
        BTW, I tried switching from pseudoinverse to regularized linear regression, and the super high degree polynomials seemed more overfit to me.
- johnswentworth 9 Dec 2019 18:48 UTC
  LW: 10 AF: 3
  AF Parent
  the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam’s Razor works...
  That’s exactly correct. You can prove it via the Laplace approximation: the “width” of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue $λ_{i}$ contributes $- \frac{1}{2} l o g (λ_{i})$ to the marginal log likelihood $l o g P [d a t a | m o d e l]$ . So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by $\frac{1}{2} l o g (2)$ , or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the double-descent curve), this is the main term of interest in the marginal log likelihood.
  In Jaynes’ Logic of Science book, in the chapter on model comparison, I believe he directly claims that this is how/why Occam’s Razor works.
  That said, it’s not clear that this particular proof would generalize properly to systems which perfectly fit the data. In that case, there’s an entire surface on which $P [d a t a | θ]$ is constant, so Laplace needs some tweaking.