Another entry from the ‘no one understands p-values’ files; “Policy: Twenty tips for interpreting scientific claims”, Sutherland et al 2013, Nature—there’s a lot to like in this article, and it’s definitely worth remembering most of the 20 tips, except for the one on p-values:
Significance is significant. Expressed as P, statistical significance is a measure of how likely a result is to occur by chance. Thus P = 0.01 means there is a 1-in-100 probability that what looks like an effect of the treatment could have occurred randomly, and in truth there was no effect at all. Typically, scientists report results as significant when the P-value of the test is less than 0.05 (1 in 20).
Whups. p=0.01 does not mean our subjective probability that the effect is zero is now just 1%, and there’s a 99% chance the effect is non-zero.
(The Bayesian probability could be very small or very large depending on how you set it up; if your prior is small, then data with p=0.01 will not shift your probability very much, for exactly the reason Sutherland et al 2013 explains in their section on base rates!)
Statistical training helps individuals analyze and interpret data. However, the emphasis placed on null hypothesis significance testing in academic training and reporting may lead researchers to interpret evidence dichotomously rather than continuously. Consequently, researchers may either disregard evidence that fails to attain statistical significance or undervalue it relative to evidence that attains statistical significance. Surveys of researchers across a wide variety of fields (including medicine, epidemiology, cognitive science, psychology, business, and economics) show that a substantial majority does indeed do so. This phenomenon is manifest both in researchers’ interpretations of descriptions of evidence and in their likelihood judgments. Dichotomization of evidence is reduced though still present when researchers are asked to make decisions based on the evidence, particularly when the decision outcome is personally consequential. Recommendations are offered.
...Formally defined as the probability of observing data as extreme or more extreme than that actually observed assuming the null hypothesis is true, the p-value has often been misinterpreted as, inter alia, (i) the probability that the null hypothesis is true, (ii) one minus the probability that the alternative hypothesis is true, or (iii) one minus the probability of replication (Bakan 1966, Sawyer and Peter 1983, Cohen 1994, Schmidt 1996, Krantz 1999, Nickerson 2000, Gigerenzer 2004, Kramer and Gigerenzer 20005).
...As an example of how dichotomous thinking manifests itself, consider how Messori et al.(1993) compared their findings with those of Hommes et al. (1992):
The result of our calculation was an odds ratio of 0.61 (95% CI [confidence interval]: 0.298–1.251; p>0.05); this figure differs greatly from the value reported by Hommes and associates (odds ratio: 0.62; 95% CI: 0.39–0.98; p<0.05)...we concluded that subcutaneous heparin is not more effective than intravenous heparin, exactly the opposite to that of Hommes and colleagues.(p. 77)
In other words, Messori et al. (1993) conclude that their findings are “exactly the opposite” of Hommes et al. (1992) because their odds ratio estimate failed to attain statistical significance whereas that of Hommes et al. attained statistical significance. In fact, however, the odds ratio estimates and confidence intervals of Messori et al. and Hommes et al. are highly consistent (for additional discussion of this example and others, see Rothman et al. 1993 and Healy 2006).
In a forthcoming paper, my colleague David Gal and I survey top academics across a wide variety of fields including the editorial board of Psychological Science and authors of papers published in the New England Journal of Medicine, the American Economic Review, and other top journals. We show:
Researchers interpret p-values dichotomously (i.e., focus only on whether p is below or above 0.05).
They fixate on them even when they are irrelevant (e.g., when asked about descriptive statistics).
These findings apply to likelihood judgments about what might happen to future subjects as well as to choices made based on the data.
We also show they ignore the magnitudes of effect sizes.
Another entry from the ‘no one understands p-values’ files; “Policy: Twenty tips for interpreting scientific claims”, Sutherland et al 2013, Nature—there’s a lot to like in this article, and it’s definitely worth remembering most of the 20 tips, except for the one on p-values:
Whups. p=0.01 does not mean our subjective probability that the effect is zero is now just 1%, and there’s a 99% chance the effect is non-zero.
(The Bayesian probability could be very small or very large depending on how you set it up; if your prior is small, then data with p=0.01 will not shift your probability very much, for exactly the reason Sutherland et al 2013 explains in their section on base rates!)
“Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence”, McShane & Gal 2015
Graph of how a p-value crossing a threshold dramatically increases choosing that option, regardless of effect size: http://andrewgelman.com/wp-content/uploads/2016/04/Screen-Shot-2016-04-06-at-3.03.29-PM-1024x587.png
via Gelman: