Once we understand that relationship, it should become pretty clear why the overfitting argument doesn’t work: the overfit model is essentially the 2n model, where it takes more bits to specify the core logic, and then tries to “win” on the simplicity by having m unspecified bits of extra information. But that doesn’t really matter: what matters is the size of the core logic, and if there are simple patterns that can fit the data in n bits rather than 2n bits, you’ll learn those.
Under this picture, or any other simplicity bias, why NNs with more parameters generalize better?
The idea is that when you make your network larger, you increase the size of the search space and thus the number of algorithms that you’re considering to include algorithms which take more computation. That reduces the relative importance of the speed prior, but increases the relative importance of the simplicity prior, because your inductive biases are still selecting from among those algorithms according to the simplest pattern that fits the data, such that you get good generalization—and in fact even better generalization because now the space of algorithms in which you’re searching for the simplest one in is even larger.
Another way to think about this: if you really believe Occam’s razor, then any learning algorithm generalizes exactly to the extent that it approximates a simplicity prior—thus, since we know neural networks generalize better as they get larger, they must be approximating a simplicity prior better as they do so.
Under this picture, or any other simplicity bias, why NNs with more parameters generalize better?
Paradoxically, I think larger neural networks are more simplicity-biased.
The idea is that when you make your network larger, you increase the size of the search space and thus the number of algorithms that you’re considering to include algorithms which take more computation. That reduces the relative importance of the speed prior, but increases the relative importance of the simplicity prior, because your inductive biases are still selecting from among those algorithms according to the simplest pattern that fits the data, such that you get good generalization—and in fact even better generalization because now the space of algorithms in which you’re searching for the simplest one in is even larger.
Another way to think about this: if you really believe Occam’s razor, then any learning algorithm generalizes exactly to the extent that it approximates a simplicity prior—thus, since we know neural networks generalize better as they get larger, they must be approximating a simplicity prior better as they do so.