suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse.
Only tangentially related, but your intuition about polynomial regression is not quite correct. A large range of polynomial regression learning tasks will display double descent where adding more and more higher degree polynomials consistently improves loss past the interpolation threshold.
IIRC this is probably the case for a broad range of non-NN models. I think the original Double Descent paper showed it for random Fourier features.
My current guess is that NN architectures are just especially affected by this, due to having even more degenerate behavioral manifolds, ranging very widely from tiny to large RLCTs.
I’m trying to get a quick intuition of this. I’ve not read the papers.
My attempt:
On a compact domain, any function can be uniformly approximated by a polynomial (Weierstrass)
Powers explode quickly, so you need many terms to make a nice function with a power series, to correct the high powers at the edges
As the domain gets larger, it is more difficult to make the approximation
So the relevant question is: how does the degree at training phase transition change with domain size, domain dimensionality, and Fourier series decay rate?
Only tangentially related, but your intuition about polynomial regression is not quite correct. A large range of polynomial regression learning tasks will display double descent where adding more and more higher degree polynomials consistently improves loss past the interpolation threshold.
Examples from here:
Relevant paper
IIRC this is probably the case for a broad range of non-NN models. I think the original Double Descent paper showed it for random Fourier features.
My current guess is that NN architectures are just especially affected by this, due to having even more degenerate behavioral manifolds, ranging very widely from tiny to large RLCTs.
That’s interesting, thank you for this!
I’m trying to get a quick intuition of this. I’ve not read the papers.
My attempt:
On a compact domain, any function can be uniformly approximated by a polynomial (Weierstrass)
Powers explode quickly, so you need many terms to make a nice function with a power series, to correct the high powers at the edges
As the domain gets larger, it is more difficult to make the approximation
So the relevant question is: how does the degree at training phase transition change with domain size, domain dimensionality, and Fourier series decay rate?
Does this make sense?