Shmi comments on 2013 Survey Results

Shmi 19 Jan 2014 4:15 UTC
20 points

Yvain is not hugely on board with the idea of running correlations between everything and seeing what sticks, but will grudgingly publish the results because of the very high bar for significance (p < .001 on ~800 correlations suggests < 1 spurious result) and because he doesn’t want to have to do it himself.

The standard way to fix this is to run them on half the data only and then test their predictive power on the other half. This eliminates almost all spurious correlations.
- Nominull 19 Jan 2014 4:59 UTC
  14 points
  Parent
  Does that actually work better than just setting a higher bar for significance? My gut says that data is data and chopping it up cleverly can’t work magic.
  - Dan_Weinand 19 Jan 2014 5:53 UTC
    13 points
    Parent
    Cross validation is actually hugely useful for predictive models. For a simple correlation like this, it’s less of a big deal. But if you are fitting a local linearly weighted regression line for instance, chopping the data up is absolutely standard operating procedure.
  - ChristianKl 19 Jan 2014 16:04 UTC
    0 points
    Parent
    
    Does that actually work better than just setting a higher bar for significance? My gut says that data is data and chopping it up cleverly can’t work magic.
    
    How do you decide for how high to hang your bar for significance? It very hard to estimate how high you have to hang it depending on how you go fishing in your data. The advantage of the two step procedure is that you are completely free to fish how you want in the first step. There are even cases where you might want a three step procedure.
- Kawoomba 19 Jan 2014 8:48 UTC
  9 points
  Parent
  Alternatively, Bonferroni correction.
  - Pablo 19 Jan 2014 9:51 UTC
    13 points
    Parent
    That’s roughly what Yvain did, by taking into consideration the number of correlations tested when setting the significance level.