I took a look at the Colab notebook linked from that blog post, and there are some subtleties the blog post doesn’t discuss. First, the blog post says they’re using gradient descent, but if you look at the notebook, they’re actually computing a pseudoinverse. [EDIT: They state a justification in their notebook which I missed at first. See discussion below.] Second, the blog post talks about polynomial regression, but doesn’t mention that the notebook uses Legendre polynomials. I’m pretty sure high degree Legendre polynomials look like a squiggle which closely follows the x-axis on [-1, 1] (the interval used in their demo). If you fork the notebook and switch from Legendre polynomials to classic polynomial regression (1, x, x^2, x^3, etc.), the degree 100,000 fit appears to be worse than the degree 20 fit. I searched on Google and Google Scholar, and use of Legendre polynomials doesn’t seem to be common practice in ML.
As we mention in the note at the top of the Colab,for the linear regression objective, running gradient descent to convergence (from 0 initialization) is *equivalent* to applying the pseudoinverse—they both find the minimum-norm interpolating solution.
So, we use the pseudoinverse just for computational efficiency; you would get equivalent results by actually running GD.
You’re also right that the choice of basis matters. The choice of basis changes the “implicit bias” of GD in linear regression, analogous to how the choice of architecture changes the implicit bias of GD on neural nets.
Thanks Preetum. You’re right, I missed that note the first time—I edited my comment a bit.
It might be illuminating to say “the polynomial found by iterating weights starting at 0” instead of “the polynomial found with gradient descent”, since in this case, the inductive bias comes from the initialization point, not necessarily gradient descent per se. Neural nets can’t learn if all the weights are initialized to 0 at the start, of course :)
BTW, I tried switching from pseudoinverse to regularized linear regression, and the super high degree polynomials seemed more overfit to me.
Note that double descent also happens with polynomial regression—see here for an example.
I took a look at the Colab notebook linked from that blog post, and there are some subtleties the blog post doesn’t discuss. First, the blog post says they’re using gradient descent, but if you look at the notebook, they’re actually computing a pseudoinverse. [EDIT: They state a justification in their notebook which I missed at first. See discussion below.] Second, the blog post talks about polynomial regression, but doesn’t mention that the notebook uses Legendre polynomials. I’m pretty sure high degree Legendre polynomials look like a squiggle which closely follows the x-axis on [-1, 1] (the interval used in their demo). If you fork the notebook and switch from Legendre polynomials to classic polynomial regression (1, x, x^2, x^3, etc.), the degree 100,000 fit appears to be worse than the degree 20 fit. I searched on Google and Google Scholar, and use of Legendre polynomials doesn’t seem to be common practice in ML.
Hi, I’m one of the authors:
Regarding gradient descent vs. pseudoinverse:
As we mention in the note at the top of the Colab,for the linear regression objective, running gradient descent to convergence (from 0 initialization) is *equivalent* to applying the pseudoinverse—they both find the minimum-norm interpolating solution.
So, we use the pseudoinverse just for computational efficiency; you would get equivalent results by actually running GD.
You’re also right that the choice of basis matters. The choice of basis changes the “implicit bias”
of GD in linear regression, analogous to how the choice of architecture changes the implicit bias of GD on neural nets.
-- Preetum
Thanks Preetum. You’re right, I missed that note the first time—I edited my comment a bit.
It might be illuminating to say “the polynomial found by iterating weights starting at 0” instead of “the polynomial found with gradient descent”, since in this case, the inductive bias comes from the initialization point, not necessarily gradient descent per se. Neural nets can’t learn if all the weights are initialized to 0 at the start, of course :)
BTW, I tried switching from pseudoinverse to regularized linear regression, and the super high degree polynomials seemed more overfit to me.