Because human flaws creep in in the process of modelling as well. Taking non linear relationships into account (unless there is a causal reason to do so) is asking for statistical trouble unless you very carefully account for how many models you have tested and tried (which almost nobody does).
How do I account for how many models I’ve tested? No, really, I don’t know what that’d even be called in the statistics literature, and it seems like if a general technique for doing this were known the big data people would be all over it.
What we’re doing at the FHI is acting like a machine learning problem: splitting the data into a training and a testing set, checking as much as we want on the training set, formulating the hypotheses, then testing them on the testing set.
I see a lot of stepwise regression being used by non-statisticians, but I think statisticians themselves think its something of a joke. If you have more predictors than you can fit coefficients for, and want an understandable linear model you are better off with something like LASSO.
So it wasn’t as clear with the previous link, but it seems to me that the nth step of this method doesn’t condition on the fact that the last n-1 steps failed.
If you array the full might of statistics/machine learning/knowledge representation in AI/math/signal processing, and took the very best, I am very sure they could beat a linear model for a non-linear ground truth very easily. If so, maybe the right thing to do here is to emulate those people when doing data analysis, and not use the model we know to be wrong.
Taking non linear relationships into account (unless there is a causal reason to do so) is asking for statistical trouble unless you very carefully account for how many models you have tested and tried (which almost nobody does).
First, the structure of your model should be driven by the structure you’re observing in your data. If you are observing nonlinearities, you’d better model nonlinearities.
Second, I don’t buy that going beyond linear models is asking for statistical trouble. It just ain’t so. People who overfit can (and actually do, all the time) stuff a ton of variables into a linear model and successfully overfit this way.
And the number of terms explode when you add non linearities.
5 independent variables with quadratic terms give you 21 values to play with (1 constant + 5 linear + 15 quadratic); it’s much easier to justify conceptually “lets look at quadratic terms” than “lets add in 15 extra variables” even though the effect on degrees of freedom is the same.
And the number of terms explode when you add non linearities
No, they don’t. You control the number of degrees of freedom in your models. If you don’t, linear models won’t help you much, and if you do linearity does not matter.
5 independent variables with quadratic terms give you 21 values to play with
I think you’re confusing quadratic terms and interaction terms. It also seems that you’re thinking of linear models solely as linear regressions. Do you consider, e.g. GLMs to be “linear” models? What about transformations of input variables, are they disallowed in your understanding of linear models?
I’m talking about practice, not theory. And most of the practical results that I’ve seen is that regression linear models are full of overfitting if they aren’t linear. Even beyond human error, it seems that in many social science areas the data quality is poor enough that adding non-linearities can be seen, a priori, to be a bad thing to do.
Except of course if there is a firm reason to add a particular non-linearity to the problem.
I’m not familiar with the whole spectrum of models (regression models, beta distributions, some conjugate prior distributions, and some machine learning techniques is about all I know), so I can’t confidently speak about the general case. But, extrapolating from what I’ve seen and known biases and incentives, I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
Oh, I agree completely with that. However there are a bunch of forces which make it so starting with the publication bias. Restricting the allowed classes of models isn’t going to fix the problem.
It’s like observing that teenagers overuse makeup and deciding that a good way to deal with that would be to sell lipstick only in three colors—black, brown, and red. Not only it’s not a solution, it’s not even wrong :-/
the data quality is poor enough that adding non-linearities can be seen, a priori, to be a bad thing to do.
Why do you believe that a straight-line fit should be the a priori default instead of e.g. a log or a power-law line fit?
Restricting the allowed classes of model isn’t going to fix the problem.
I disagree; it would help at the very least. I would require linear models only, unless a) there is a justification for non-linear terms or b) there is enough data that the result is still significant even if we inserted all the degrees of freedom that the degree of non-linearities would allow.
Why do you believe that a straight-line fit should be the a priori default instead of e.g. a log or a power-law line fit?
In most cases I’ve seen in the social science, the direction of the effect is of paramount importance, the other factor less so. It would probably be perfectly fine to restrict to only linear, only log, or only power-law; it’s the mixing of different approaches that explodes the degrees of freedom. And in practice letting people have one or the other just allows them to test all three before reporting the best fit. So I’d say pick one class and stick with it.
there is enough data that the result is still significant even if we inserted all the degrees of freedom that the degree of non-linearities would allow.
I think this translates to “Calculate the signficance correctly” which I’m all for, linear models included :-)
Otherwise, I still think you’re confused between the model class and the model complexity (= degrees of freedom), but we’ve set out our positions and it’s fine that we continue to disagree.
I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
It’s easy to regularize estimation in a model class that’s too rich for your data. You can’t “unregularize” a model class that’s restrictive enough not to contain an adequate approximation to the truth of what you’re modeling.
Because human flaws creep in in the process of modelling as well. Taking non linear relationships into account (unless there is a causal reason to do so) is asking for statistical trouble unless you very carefully account for how many models you have tested and tried (which almost nobody does).
How do I account for how many models I’ve tested? No, really, I don’t know what that’d even be called in the statistics literature, and it seems like if a general technique for doing this were known the big data people would be all over it.
What we’re doing at the FHI is acting like a machine learning problem: splitting the data into a training and a testing set, checking as much as we want on the training set, formulating the hypotheses, then testing them on the testing set.
The Bayesian approach with multiple models seems to be exactly what we need. eg http://www.stat.washington.edu/raftery/Research/PDF/socmeth1995.pdf
Another approach seems to be stepwise regression: http://en.wikipedia.org/wiki/Stepwise_regression
I see a lot of stepwise regression being used by non-statisticians, but I think statisticians themselves think its something of a joke. If you have more predictors than you can fit coefficients for, and want an understandable linear model you are better off with something like LASSO.
Edit: Don’t just take my word for it, google found this blog post for me: http://andrewgelman.com/2014/06/02/hate-stepwise-regression/
I concur. Stepwise regression is a very crude technique.
I find it useful as an initial filter if I have to dig through a LOT of potential predictors, but you can’t rely on it to produce a decent model.
So it wasn’t as clear with the previous link, but it seems to me that the nth step of this method doesn’t condition on the fact that the last n-1 steps failed.
If you array the full might of statistics/machine learning/knowledge representation in AI/math/signal processing, and took the very best, I am very sure they could beat a linear model for a non-linear ground truth very easily. If so, maybe the right thing to do here is to emulate those people when doing data analysis, and not use the model we know to be wrong.
Proper Bayesianism will triumph! But not in the hands of everyone.
First, the structure of your model should be driven by the structure you’re observing in your data. If you are observing nonlinearities, you’d better model nonlinearities.
Second, I don’t buy that going beyond linear models is asking for statistical trouble. It just ain’t so. People who overfit can (and actually do, all the time) stuff a ton of variables into a linear model and successfully overfit this way.
And the number of terms explode when you add non linearities.
5 independent variables with quadratic terms give you 21 values to play with (1 constant + 5 linear + 15 quadratic); it’s much easier to justify conceptually “lets look at quadratic terms” than “lets add in 15 extra variables” even though the effect on degrees of freedom is the same.
No, they don’t. You control the number of degrees of freedom in your models. If you don’t, linear models won’t help you much, and if you do linearity does not matter.
I think you’re confusing quadratic terms and interaction terms. It also seems that you’re thinking of linear models solely as linear regressions. Do you consider, e.g. GLMs to be “linear” models? What about transformations of input variables, are they disallowed in your understanding of linear models?
I’m talking about practice, not theory. And most of the practical results that I’ve seen is that regression linear models are full of overfitting if they aren’t linear. Even beyond human error, it seems that in many social science areas the data quality is poor enough that adding non-linearities can be seen, a priori, to be a bad thing to do.
Except of course if there is a firm reason to add a particular non-linearity to the problem.
I’m not familiar with the whole spectrum of models (regression models, beta distributions, some conjugate prior distributions, and some machine learning techniques is about all I know), so I can’t confidently speak about the general case. But, extrapolating from what I’ve seen and known biases and incentives, I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
Oh, I agree completely with that. However there are a bunch of forces which make it so starting with the publication bias. Restricting the allowed classes of models isn’t going to fix the problem.
It’s like observing that teenagers overuse makeup and deciding that a good way to deal with that would be to sell lipstick only in three colors—black, brown, and red. Not only it’s not a solution, it’s not even wrong :-/
Why do you believe that a straight-line fit should be the a priori default instead of e.g. a log or a power-law line fit?
I disagree; it would help at the very least. I would require linear models only, unless a) there is a justification for non-linear terms or b) there is enough data that the result is still significant even if we inserted all the degrees of freedom that the degree of non-linearities would allow.
In most cases I’ve seen in the social science, the direction of the effect is of paramount importance, the other factor less so. It would probably be perfectly fine to restrict to only linear, only log, or only power-law; it’s the mixing of different approaches that explodes the degrees of freedom. And in practice letting people have one or the other just allows them to test all three before reporting the best fit. So I’d say pick one class and stick with it.
I think this translates to “Calculate the signficance correctly” which I’m all for, linear models included :-)
Otherwise, I still think you’re confused between the model class and the model complexity (= degrees of freedom), but we’ve set out our positions and it’s fine that we continue to disagree.
It’s easy to regularize estimation in a model class that’s too rich for your data. You can’t “unregularize” a model class that’s restrictive enough not to contain an adequate approximation to the truth of what you’re modeling.