Adrià Garriga-alonso comments on Complexity Penalties in Statistical Learning

Adrià Garriga-alonso 6 Feb 2019 22:32 UTC
12 points
Great post, thank you for writing it!
By taking squares we are more forgiving when the model gets the answer almost right but much less forgiving when the model is way off
It’s really unclear why this would be better. I was going to ask for a clarification but I found something better instead.
The first justification for taking the least-squares estimator that came to mind is that it’s the maximum likelihood solution if you assume your errors E are independent and Gaussian. Under those conditions, the probability density of the data is $p ({y_{i}}_{i = 1, \dots, N} | {x_{i}}_{i = 1, \dots, N}, σ^{2}) = \prod_{i} Gaussian (y_{i} | g (x_{i}), σ^{2})$ . We take the product because the errors are independent for each point. If you take the logarithm (which is a transformation that preserves the maximum), this comes out to be $- 1 / 2 σ^{2} \cdot \sum_{i} (y_{i} - g (x_{i}))^{2}$ plus a constant that doesn’t depend on g.
It turns out there’s a stronger justification, which doesn’t require assuming the errors are Gaussian. The Gauss-Markov Theorem shows that the best unbiased estimator of the coefficients of g that is linear in the powers of the data, i.e. $x_{i}, x_{i}^{2}, x_{i}^{3}, \dots$ . That it’s unbiased means that, for several independent samples of data, the mean of the least-squares estimator will be the true coefficients of the polynomial you used to generate the data. The estimator is the best because it has the lowest variance under the above sampling scheme. If you assume you have a lot of data, by the central limit theorem, this is similar to lowest squared error in coefficient-space.
However, if you’re willing to drop the requirement that the estimator is unbiased, you can get even better error on average. The James-Stein estimator has less variance than least-squares, at the cost of biasing your estimates of the coefficients towards zero (or some other point). So, even within a certain maximum allowed degree of the polynomial, it is helpful to penalise complexity of the coefficients themselves.
On a related note, for some yet unknown reason, neural networks that have way more parameters than is necessary in a single layer seem to generalize better than networks with few parameters. See for example Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. It seems to be related to the fact that we’re not actually finding the optimal configuration of parameters with gradient descent, but only approximating it. So the standard statistical analysis that you outlined here doesn’t apply in that case.