A technical difficulty with saying that overfitting happens when there are “too many parameters” is that the parameters may do arbitrarily complicated things. For example they may encode C functions, in which case a model with a single (infinite-precision) real parameter can fit anything very well! Functions that are linear in their parameters and inputs do not suffer from this problem; the number of parameters summarizes their overfitting capacity well. The same is not true of some nonlinear functions.
To avoid confusion it may be helpful to define overfitting more precisely. The gist of any reasonable definition of overfitting is: If I randomly perturb the desired outputs of my function, how well can I find new parameters to fit the new outputs? I can’t do a good job of giving more detail than that in a short comment, but if you feel confused about overfitting, here’s a good (and famous) article about frequentist learning theory by Vladimir Vapnik that may be useful:
This is about “reasonable encoding” not “linearity,” though. That is, linear functions of parameters encode reasonably, but not all reasonable encodings are linear. We can define a parameter to be precisely one bit of information, and then ask for the minimum of bits needed.
I don’t understand why people are so hung up on linearity.
A technical difficulty with saying that overfitting happens when there are “too many parameters” is that the parameters may do arbitrarily complicated things. For example they may encode C functions, in which case a model with a single (infinite-precision) real parameter can fit anything very well! Functions that are linear in their parameters and inputs do not suffer from this problem; the number of parameters summarizes their overfitting capacity well. The same is not true of some nonlinear functions.
To avoid confusion it may be helpful to define overfitting more precisely. The gist of any reasonable definition of overfitting is: If I randomly perturb the desired outputs of my function, how well can I find new parameters to fit the new outputs? I can’t do a good job of giving more detail than that in a short comment, but if you feel confused about overfitting, here’s a good (and famous) article about frequentist learning theory by Vladimir Vapnik that may be useful:
http://web.mit.edu/6.962/www/www_spring_2001/emin/slt.pdf
This is about “reasonable encoding” not “linearity,” though. That is, linear functions of parameters encode reasonably, but not all reasonable encodings are linear. We can define a parameter to be precisely one bit of information, and then ask for the minimum of bits needed.
I don’t understand why people are so hung up on linearity.