Stuart_Armstrong comments on Why the tails come apart

Stuart_Armstrong 28 Jul 2014 18:31 UTC
2 points

If the underlying truth is not linear, why should we expect sensible answers from a linear model?

Because in many fields, linear models (even poor ones) are the best we’re going to get, with more complex models losing to overfitting.

http://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=1979-30170-001
- IlyaShpitser 28 Jul 2014 19:53 UTC
  5 points
  Parent
  I don’t follow you. Overfitting happens when your model has too many parameters, relative to the amount of data you have. It is true that linear models may have few parameters compared to some non-linear models (for example linear regression models vs regression models with extra interaction parameters). But surely, we can have sparsely parameterized non-linear models as well.
  
  All I am saying is that if things are surprising it is either due to “noise” (variance) or “getting the truth wrong” (bias). Or both.
  
  I agree that “models we can quickly and easily use while under publish-or-perish pressure” is an important class of models in practice :). Moreover, linear models are often in this class, while a ton of very interesting non-linear models in stats are not, and thus are rarely used. It is a pity.
  - henry4k2PH4 4 Aug 2014 0:40 UTC
    5 points
    Parent
    A technical difficulty with saying that overfitting happens when there are “too many parameters” is that the parameters may do arbitrarily complicated things. For example they may encode C functions, in which case a model with a single (infinite-precision) real parameter can fit anything very well! Functions that are linear in their parameters and inputs do not suffer from this problem; the number of parameters summarizes their overfitting capacity well. The same is not true of some nonlinear functions.
    
    To avoid confusion it may be helpful to define overfitting more precisely. The gist of any reasonable definition of overfitting is: If I randomly perturb the desired outputs of my function, how well can I find new parameters to fit the new outputs? I can’t do a good job of giving more detail than that in a short comment, but if you feel confused about overfitting, here’s a good (and famous) article about frequentist learning theory by Vladimir Vapnik that may be useful:
    
    http://web.mit.edu/6.962/www/www_spring_2001/emin/slt.pdf
    - IlyaShpitser 4 Aug 2014 16:48 UTC
      3 points
      Parent
      This is about “reasonable encoding” not “linearity,” though. That is, linear functions of parameters encode reasonably, but not all reasonable encodings are linear. We can define a parameter to be precisely one bit of information, and then ask for the minimum of bits needed.
      
      I don’t understand why people are so hung up on linearity.
  - A1987dM 1 Aug 2014 11:39 UTC
    4 points
    Parent
    
    I don’t follow you. Overfitting happens when your model has too many parameters, relative to the amount of data you have. It is true that linear models may have few parameters compared to some non-linear models (for example linear regression models vs regression models with extra interaction parameters). But surely, we can have sparsely parameterized non-linear models as well.
    
    Sure, technically if Alice fits a small noisy data set as y(x) = a*x+b and Bob fits it as y(x) = c*Ai(d*x) (where Ai is the Airy function) they’ve used the same number of parameters, but that won’t stop me from rolling my eyes at the latter unless he has a good first-principle reason to privilege the hypothesis.
  - Stuart_Armstrong 29 Jul 2014 9:44 UTC
    3 points
    Parent
    The problem is more practical than theoretical (don’t have the links to hand. but you can find some in my silos of expertise post). Statisticians do not adjust properly for extra degrees of freedom, so among some category of published models, the linear ones will be best. Also, it seems that linear models are very good for modelling human expertise—we might think we’re complex, but we behave pretty linearly.
    - IlyaShpitser 29 Jul 2014 18:48 UTC
      3 points
      Parent
      “Statisticians” is a pretty large set.
      
      I still don’t understand your original “because.” I am talking about modeling the truth, not modeling what humans do. If the truth is not linear and humans use a linear modeling algorithm, well then they aren’t a very good role model are they?
      
      [ edit: did not downvote. ]
      - Stuart_Armstrong 30 Jul 2014 9:42 UTC
        1 point
        Parent
        Because human flaws creep in in the process of modelling as well. Taking non linear relationships into account (unless there is a causal reason to do so) is asking for statistical trouble unless you very carefully account for how many models you have tested and tried (which almost nobody does).
        [deleted] 31 Jul 2014 21:20 UTC
        3 points
        Parent
        How do I account for how many models I’ve tested? No, really, I don’t know what that’d even be called in the statistics literature, and it seems like if a general technique for doing this were known the big data people would be all over it.
        Stuart_Armstrong 8 Aug 2014 11:11 UTC
        4 points
        Parent
        What we’re doing at the FHI is acting like a machine learning problem: splitting the data into a training and a testing set, checking as much as we want on the training set, formulating the hypotheses, then testing them on the testing set.
        Stuart_Armstrong 1 Aug 2014 15:11 UTC
        3 points
        Parent
        The Bayesian approach with multiple models seems to be exactly what we need. eg http://www.stat.washington.edu/raftery/Research/PDF/socmeth1995.pdf
        Stuart_Armstrong 7 Aug 2014 16:25 UTC
        1 point
        Parent
        Another approach seems to be stepwise regression: http://en.wikipedia.org/wiki/Stepwise_regression
        EHeller 7 Aug 2014 17:14 UTC
        6 points
        Parent
        I see a lot of stepwise regression being used by non-statisticians, but I think statisticians themselves think its something of a joke. If you have more predictors than you can fit coefficients for, and want an understandable linear model you are better off with something like LASSO.
        
        Edit: Don’t just take my word for it, google found this blog post for me: http://andrewgelman.com/2014/06/02/hate-stepwise-regression/
        Lumifer 7 Aug 2014 17:38 UTC
        1 point
        Parent
        I concur. Stepwise regression is a very crude technique.
        
        I find it useful as an initial filter if I have to dig through a LOT of potential predictors, but you can’t rely on it to produce a decent model.
        [deleted] 7 Aug 2014 16:30 UTC
        1 point
        Parent
        So it wasn’t as clear with the previous link, but it seems to me that the nth step of this method doesn’t condition on the fact that the last n-1 steps failed.
        IlyaShpitser 31 Jul 2014 20:46 UTC
        3 points
        Parent
        If you array the full might of statistics/machine learning/knowledge representation in AI/math/signal processing, and took the very best, I am very sure they could beat a linear model for a non-linear ground truth very easily. If so, maybe the right thing to do here is to emulate those people when doing data analysis, and not use the model we know to be wrong.
        Stuart_Armstrong 1 Aug 2014 14:31 UTC
        1 point
        Parent
        Proper Bayesianism will triumph! But not in the hands of everyone.
        Lumifer 30 Jul 2014 14:41 UTC
        3 points
        Parent
        
        Taking non linear relationships into account (unless there is a causal reason to do so) is asking for statistical trouble unless you very carefully account for how many models you have tested and tried (which almost nobody does).
        
        First, the structure of your model should be driven by the structure you’re observing in your data. If you are observing nonlinearities, you’d better model nonlinearities.
        
        Second, I don’t buy that going beyond linear models is asking for statistical trouble. It just ain’t so. People who overfit can (and actually do, all the time) stuff a ton of variables into a linear model and successfully overfit this way.
        Stuart_Armstrong 30 Jul 2014 16:09 UTC
        3 points
        Parent
        And the number of terms explode when you add non linearities.
        
        5 independent variables with quadratic terms give you 21 values to play with (1 constant + 5 linear + 15 quadratic); it’s much easier to justify conceptually “lets look at quadratic terms” than “lets add in 15 extra variables” even though the effect on degrees of freedom is the same.
        Lumifer 30 Jul 2014 16:46 UTC
        1 point
        Parent
        
        And the number of terms explode when you add non linearities
        
        No, they don’t. You control the number of degrees of freedom in your models. If you don’t, linear models won’t help you much, and if you do linearity does not matter.
        
        5 independent variables with quadratic terms give you 21 values to play with
        
        I think you’re confusing quadratic terms and interaction terms. It also seems that you’re thinking of linear models solely as linear regressions. Do you consider, e.g. GLMs to be “linear” models? What about transformations of input variables, are they disallowed in your understanding of linear models?
        Stuart_Armstrong 31 Jul 2014 8:39 UTC
        2 points
        Parent
        I’m talking about practice, not theory. And most of the practical results that I’ve seen is that regression linear models are full of overfitting if they aren’t linear. Even beyond human error, it seems that in many social science areas the data quality is poor enough that adding non-linearities can be seen, a priori, to be a bad thing to do.
        
        Except of course if there is a firm reason to add a particular non-linearity to the problem.
        
        I’m not familiar with the whole spectrum of models (regression models, beta distributions, some conjugate prior distributions, and some machine learning techniques is about all I know), so I can’t confidently speak about the general case. But, extrapolating from what I’ve seen and known biases and incentives, I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
        Lumifer 31 Jul 2014 14:46 UTC
        3 points
        Parent
        
        I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
        
        Oh, I agree completely with that. However there are a bunch of forces which make it so starting with the publication bias. Restricting the allowed classes of models isn’t going to fix the problem.
        
        It’s like observing that teenagers overuse makeup and deciding that a good way to deal with that would be to sell lipstick only in three colors—black, brown, and red. Not only it’s not a solution, it’s not even wrong :-/
        
        the data quality is poor enough that adding non-linearities can be seen, a priori, to be a bad thing to do.
        
        Why do you believe that a straight-line fit should be the a priori default instead of e.g. a log or a power-law line fit?
        Stuart_Armstrong 31 Jul 2014 15:17 UTC
        1 point
        Parent
        
        Restricting the allowed classes of model isn’t going to fix the problem.
        
        I disagree; it would help at the very least. I would require linear models only, unless a) there is a justification for non-linear terms or b) there is enough data that the result is still significant even if we inserted all the degrees of freedom that the degree of non-linearities would allow.
        
        Why do you believe that a straight-line fit should be the a priori default instead of e.g. a log or a power-law line fit?
        
        In most cases I’ve seen in the social science, the direction of the effect is of paramount importance, the other factor less so. It would probably be perfectly fine to restrict to only linear, only log, or only power-law; it’s the mixing of different approaches that explodes the degrees of freedom. And in practice letting people have one or the other just allows them to test all three before reporting the best fit. So I’d say pick one class and stick with it.
        Expand this thread
        Lumifer 31 Jul 2014 15:40 UTC
        1 point
        Parent
        
        there is enough data that the result is still significant even if we inserted all the degrees of freedom that the degree of non-linearities would allow.
        
        I think this translates to “Calculate the signficance correctly” which I’m all for, linear models included :-)
        
        Otherwise, I still think you’re confused between the model class and the model complexity (= degrees of freedom), but we’ve set out our positions and it’s fine that we continue to disagree.
        othercriteria 31 Jul 2014 14:40 UTC
        0 points
        Parent
        
        I’m quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.
        
        It’s easy to regularize estimation in a model class that’s too rich for your data. You can’t “unregularize” a model class that’s restrictive enough not to contain an adequate approximation to the truth of what you’re modeling.
- Lumifer 28 Jul 2014 18:52 UTC
  5 points
  Parent
  
  Because in many fields, linear models (even poor ones) are the best we’re going to get, with more complex models losing to overfitting.
  
  That’s privileging a particular class of models just because they historically were easy to calculate.
  
  If you’re concerned about overfitting you need to be careful with how many parameters are you using, but that does not translate into an automatic advantage of a linear model over, say, a log one.
  
  The article you linked to goes to pre-(personal)computer times when dealing with non-linear models was often just impractical.
- gwern 28 Jul 2014 18:52 UTC
  2 points
  Parent
  
  Because in many fields, linear models (even poor ones) are the best we’re going to get, with more complex models losing to overfitting.
  
  I don’t think that’s true. What fields show optimal performance from linear models where better predictions can’t be gotten from other techniques like decision trees or neural nets or ensembles of techniques?
  
  http://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=1979-30170-001
  
  Showing that crude linear models, with no form of regularization or priors, beats human clinical judgement, doesn’t show your previous claim.
  - Stuart_Armstrong 29 Jul 2014 9:42 UTC
    0 points
    Parent
    Modelling human clinical judgement is best done with linear models, for instance.
    What links here?
    gwern's comment on Why the tails come apart by Thrasymachus (5 Aug 2014 20:38 UTC; 10 points)
    - gwern 29 Jul 2014 16:07 UTC
      4 points
      Parent
      Best done? Better than, say, decision trees or expert systems or Bayesian belief networks? Citation needed.
      - Stuart_Armstrong 29 Jul 2014 17:20 UTC
        1 point
        Parent
        Goldberg, Lewis R. “Simple models or simple processes? Some research on clinical judgments.” American Psychologist 23.7 (1968): 483.
        gwern 29 Jul 2014 17:40 UTC
        5 points
        Parent
        1968? Seriously?
        Stuart_Armstrong 30 Jul 2014 9:40 UTC
        5 points
        Parent
        Well there’s Goldberg, Lewis R. “Five models of clinical judgment: An empirical comparison between linear and nonlinear representations of the human inference process.” Organizational Behavior and Human Performance 6.4 (1971): 458-479.
        
        The main thing is that these old papers seem to still be considered valid, see eg Shanteau, James. “How much information does an expert use? Is it relevant?.” Acta Psychologica 81.1 (1992): 75-86.
        gwern 5 Aug 2014 20:38 UTC
        10 points
        Parent
        (It would be nice if you would link fulltext instead of providing citations; if you don’t have access to the fulltext, it’s a bad idea to cite it, and if you do, you should provide it for other people who are trying to evaluate your claims and whether the paper is relevant or wrong.)
        
        I’ve put up the first paper at https://dl.dropboxusercontent.com/u/85192141/1971-goldberg.pdf / https://pdf.yt/d/Ux7RZXbo0n374dUU I don’t think this is particularly relevant: it only shows that 2 very specific equations (pg4, #3 & #4) did not outperform the linear model on a particular dataset. Too bad for Einhorn 1971.
        
        Your second paper doesn’t support the claims:
        
        A third possibility is that incorrect methods were used to measure the amount of information in experts’ judgments; use of the “correct” measurement method might support the Information-Use Hypothesis. In the studies reported here, four techniques were used to measure information use: protocol analysis, multiple regression analysis, analysis of variance, and self-ratings by judges. Despite differences in measurement methods, comparable results were reported. Other methodological issues might be raised, but the studies seem varied enough to rule out any artifactual explanation.
        
        These aren’t very good methods for extracting the full measure of information.
        
        So to summarize: reality isn’t entirely linear, so nonlinear methods frequently excel with modern developments to regularize and avoid overfitting (we can see this in the low prevalence of linear methods in demanding AI tasks like image recognition, or more generally, competitions like Kaggle on all sorts of domains); to the extent that humans are good predictors and classifiers too of reality, their predictions/classifications will be better mimicked by nonlinear methods; research showing the contrary typically does not compare very good methods and much more recent research may do much better (for example, parole/recidivism predictions by parole boards may be bad and easily improved on by linear models, but does that mean algorithms can’t do even better?), and to the extent linear methods succeed, it may reflect the lack of relevant data or inherent randomness of results for a particular cherrypicked task.
        
        To show your original claim (“in many fields, linear models (even poor ones) are the best we’re going to get, with more complex models losing to overfitting”), I would want to see linear models steadily beat all comers, from random forests to deep neural networks to ensembles of all of the above, on a wide variety of large datasets. I don’t think you can show that.
        Stuart_Armstrong 6 Aug 2014 11:28 UTC
        1 point
        Parent
        I tend to agree with you about models, once overfitting is sorted.
        
        to the extent that humans are good predictors and classifiers too of reality, their predictions/classifications will be better mimicked by nonlinear methods
        
        This I’ve still seen no evidence for.