Stuart_Armstrong comments on Why the tails come apart

Stuart_Armstrong 31 Jul 2014 8:39 UTC
2 points
I’m talking about practice, not theory. And most of the practical results that I’ve seen is that regression linear models are full of overfitting if they aren’t linear. Even beyond human error, it seems that in many social science areas the data quality is poor enough that adding non-linearities can be seen, a priori, to be a bad thing to do.

Except of course if there is a firm reason to add a particular non-linearity to the problem.

I’m not familiar with the whole spectrum of models (regression models, beta distributions, some conjugate prior distributions, and some machine learning techniques is about all I know), so I can’t confidently speak about the general case. But, extrapolating from what I’ve seen and known biases and incentives, I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
- Lumifer 31 Jul 2014 14:46 UTC
  3 points
  Parent
  
  I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
  
  Oh, I agree completely with that. However there are a bunch of forces which make it so starting with the publication bias. Restricting the allowed classes of models isn’t going to fix the problem.
  
  It’s like observing that teenagers overuse makeup and deciding that a good way to deal with that would be to sell lipstick only in three colors—black, brown, and red. Not only it’s not a solution, it’s not even wrong :-/
  
  the data quality is poor enough that adding non-linearities can be seen, a priori, to be a bad thing to do.
  
  Why do you believe that a straight-line fit should be the a priori default instead of e.g. a log or a power-law line fit?
  - Stuart_Armstrong 31 Jul 2014 15:17 UTC
    1 point
    Parent
    
    Restricting the allowed classes of model isn’t going to fix the problem.
    
    I disagree; it would help at the very least. I would require linear models only, unless a) there is a justification for non-linear terms or b) there is enough data that the result is still significant even if we inserted all the degrees of freedom that the degree of non-linearities would allow.
    
    Why do you believe that a straight-line fit should be the a priori default instead of e.g. a log or a power-law line fit?
    
    In most cases I’ve seen in the social science, the direction of the effect is of paramount importance, the other factor less so. It would probably be perfectly fine to restrict to only linear, only log, or only power-law; it’s the mixing of different approaches that explodes the degrees of freedom. And in practice letting people have one or the other just allows them to test all three before reporting the best fit. So I’d say pick one class and stick with it.
    - Lumifer 31 Jul 2014 15:40 UTC
      1 point
      Parent
      
      there is enough data that the result is still significant even if we inserted all the degrees of freedom that the degree of non-linearities would allow.
      
      I think this translates to “Calculate the signficance correctly” which I’m all for, linear models included :-)
      
      Otherwise, I still think you’re confused between the model class and the model complexity (= degrees of freedom), but we’ve set out our positions and it’s fine that we continue to disagree.
- othercriteria 31 Jul 2014 14:40 UTC
  0 points
  Parent
  
  I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
  
  It’s easy to regularize estimation in a model class that’s too rich for your data. You can’t “unregularize” a model class that’s restrictive enough not to contain an adequate approximation to the truth of what you’re modeling.