Quick check: do you agree or disagree with the following statement:
If a study finds a result significant at a p=0.05 level, that means they have followed a methodology which produces this conclusion correctly 95 % of the time.
Yes or no? Keep that in mind, and we’ll get back to it.
I’m reading the Fisher book where he popularised the p-value[1], and I noticed he’s actually quite sensible about it:
The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available.
He is talking here about the normal distribution, and saying that if you have two dice that somehow generate numbers from the normal distribution, and you get an unexpectedly large value from the d6, you should double check that you are not accidentally throwing the d20. Makes complete sense.
It turns out that many situations in hypothesis testing are equivalent to “Wait am I still holding a d6?” so this is a useful rule of thumb.
But! In science there are so many other things that can go wrong. For example, Bakker and Wicherts[2] found that 15 % of the studies they looked at drew the wrong conclusion due to making dumb mistakes in computing the significance threshold.
Think about that! The significance test pales in comparison.
Regardless of what level of significance is used in the hypothesis test, regardless of accuracy effects from selection pressure, the base rate of getting the most fundamental maths of the last step right is only 85 %. Then other problems are piled on top of that[3], so no, a significant result at p=0.05 means nothing. It’s just a sign that you might be holding another die and it is time to double-check. (E.g. through replication, or further investigation.)
The lying p value
Link post
Quick check: do you agree or disagree with the following statement:
Yes or no? Keep that in mind, and we’ll get back to it.
I’m reading the Fisher book where he popularised the p-value[1], and I noticed he’s actually quite sensible about it:
He is talking here about the normal distribution, and saying that if you have two dice that somehow generate numbers from the normal distribution, and you get an unexpectedly large value from the d6, you should double check that you are not accidentally throwing the d20. Makes complete sense.
It turns out that many situations in hypothesis testing are equivalent to “Wait am I still holding a d6?” so this is a useful rule of thumb.
But! In science there are so many other things that can go wrong. For example, Bakker and Wicherts[2] found that 15 % of the studies they looked at drew the wrong conclusion due to making dumb mistakes in computing the significance threshold.
Think about that! The significance test pales in comparison.
Regardless of what level of significance is used in the hypothesis test, regardless of accuracy effects from selection pressure, the base rate of getting the most fundamental maths of the last step right is only 85 %. Then other problems are piled on top of that[3], so no, a significant result at p=0.05 means nothing. It’s just a sign that you might be holding another die and it is time to double-check. (E.g. through replication, or further investigation.)
Statistical Methods for Research Workers; Fisher; Oliver & Boyd; 1925.
The (mis)reporting of statistical results in psychology journals; Bakker, Wicherts; Behaviour Research Methods; 2011.
Consider for example the Forbes report of 88 % of spreadsheets containing errors.