Yvain is not hugely on board with the idea of running correlations between everything and seeing what sticks, but will grudgingly publish the results because of the very high bar for significance (p < .001 on ~800 correlations suggests < 1 spurious result) and because he doesn’t want to have to do it himself.
The standard way to fix this is to run them on half the data only and then test their predictive power on the other half. This eliminates almost all spurious correlations.
Does that actually work better than just setting a higher bar for significance? My gut says that data is data and chopping it up cleverly can’t work magic.
Cross validation is actually hugely useful for predictive models. For a simple correlation like this, it’s less of a big deal. But if you are fitting a local linearly weighted regression line for instance, chopping the data up is absolutely standard operating procedure.
Does that actually work better than just setting a higher bar for significance? My gut says that data is data and chopping it up cleverly can’t work magic.
How do you decide for how high to hang your bar for significance? It very hard to estimate how high you have to hang it depending on how you go fishing in your data.
The advantage of the two step procedure is that you are completely free to fish how you want in the first step. There are even cases where you might want a three step procedure.
The standard way to fix this is to run them on half the data only and then test their predictive power on the other half. This eliminates almost all spurious correlations.
Does that actually work better than just setting a higher bar for significance? My gut says that data is data and chopping it up cleverly can’t work magic.
Cross validation is actually hugely useful for predictive models. For a simple correlation like this, it’s less of a big deal. But if you are fitting a local linearly weighted regression line for instance, chopping the data up is absolutely standard operating procedure.
How do you decide for how high to hang your bar for significance? It very hard to estimate how high you have to hang it depending on how you go fishing in your data. The advantage of the two step procedure is that you are completely free to fish how you want in the first step. There are even cases where you might want a three step procedure.
Alternatively, Bonferroni correction.
That’s roughly what Yvain did, by taking into consideration the number of correlations tested when setting the significance level.