Unless the number of cases is very large, an apparent cluster can rarely be distinguished from a pure chance occurrence. Thus epidemiologists check for statistical significance of data, usually at the 95% level. They use statistical tools to help distinguish chance occurrences (like the “runs” of numbers on the dice throws above) from non-random increases, i.e. those due to an external cause. If pure chance cannot be excluded with at least 95% certainty, as is very frequently the case in EMF studies, the result is usually called not significant. The observation may not mean a thing outside the specific population studied. Most often the statistical information available is expressed as an odds ratio (OR) and confidence interval (CI). The OR is the estimate of an exposed person’s risk of the disease in question relative to an unexposed person’s risk of the same disease. The CI is the range of ORs within which the true OR is 95% likely to lie, and when the CI includes 1.0! (no difference in risk), the OR is commonly defined as not statistically significant...Mr. Brodeur notes, “the 50% increased risk of leukemia they observed in the highest exposure category—children in whose bedrooms magnetic fields of two and two-thirds milligauss or above were recorded—was not considered to be statistically significant”, as though this is an opinion. It is, however, a statement with a particular mathematical definition. The numbers of cases and controls in each category limit the certainty of the results, so that it cannot be said with 95% certainty! that the association seen is not a pure chance occurrence. In fact, it is within a 95% probability that the association is really inverse and residence in such high fields (compared to the rest of the population) actually protects against cancer.
No one understands p-values, not even the ones who use Bayesian methods in their other work… From “When Is Evidence Sufficient?”, Claxton et al 2005:
Classical statistics addresses this problem by calculating the probability that any difference observed between the treatment and the comparator (in this case the placebo) reflects noise rather than a “real” difference. Only if this probability is sufficiently small—typically 5 percent—is the treatment under investigation declared superior. In the example of the pain medication, a conventional decisionmaker would therefore reject adoption of this new treatment if the chance that the study results represent noise exceeds 5 percent...For example, suppose we know that the new pain medication has a low risk of side effects, low cost, and the possibility of offering relief for patients with severe symptoms. In that case, does it really make sense to hold the candidate medication to the stringent 5 percent adoption criterion? Similarly, let us suppose that there is a candidate medication for patients with a terminal illness. If the evidence suggesting that it works has a 20 percent chance of representing only noise (and hence an 80 percent chance that the observed efficacy is real), does it make sense to withhold it from patients who might benefit from its use?
Part of the problem, said Alex Adjei, PhD, the senior vice president of clinical research and professor and chair of the Department of Medicine at Roswell Park Cancer Institute in Buffalo, N.Y., is that oncology has lost focus on what exactly a P value means. “A P value of less than 0.05 simply means that there is less than a 5% chance that the difference between two medications—whatever it is—is not real, that it’s just chance. If there’s a four-week overall survival difference between two drugs and my P value is less than 0.05, it’s statistically significant, but that just means that the number in the study is large enough to tell me that the difference I’m seeing is not by chance. It doesn’t tell me if those additional four weeks are clinically significant.”
“P values are even more complicated than that,” said Dr. Berry. “No one understands P values, because they are fundamentally non-understandable.” (He elaborates on this problem in “Multiplicities in Cancer Research: Unique and Necessary Evils,” a commentary in August in the Journal of the National Cancer Institute [2012;104:1125-1133].)
Here’s what the p-value is not: “The probability that the null-hypothesis was true.” I didn’t choose this definition out of thin air to beat up on, it was the correct answer on a test I took asking, “Which of these is the definition of a p-value?”
Another entry from the ‘no one understands p-values’ files; “Policy: Twenty tips for interpreting scientific claims”, Sutherland et al 2013, Nature—there’s a lot to like in this article, and it’s definitely worth remembering most of the 20 tips, except for the one on p-values:
Significance is significant. Expressed as P, statistical significance is a measure of how likely a result is to occur by chance. Thus P = 0.01 means there is a 1-in-100 probability that what looks like an effect of the treatment could have occurred randomly, and in truth there was no effect at all. Typically, scientists report results as significant when the P-value of the test is less than 0.05 (1 in 20).
Whups. p=0.01 does not mean our subjective probability that the effect is zero is now just 1%, and there’s a 99% chance the effect is non-zero.
(The Bayesian probability could be very small or very large depending on how you set it up; if your prior is small, then data with p=0.01 will not shift your probability very much, for exactly the reason Sutherland et al 2013 explains in their section on base rates!)
Statistical training helps individuals analyze and interpret data. However, the emphasis placed on null hypothesis significance testing in academic training and reporting may lead researchers to interpret evidence dichotomously rather than continuously. Consequently, researchers may either disregard evidence that fails to attain statistical significance or undervalue it relative to evidence that attains statistical significance. Surveys of researchers across a wide variety of fields (including medicine, epidemiology, cognitive science, psychology, business, and economics) show that a substantial majority does indeed do so. This phenomenon is manifest both in researchers’ interpretations of descriptions of evidence and in their likelihood judgments. Dichotomization of evidence is reduced though still present when researchers are asked to make decisions based on the evidence, particularly when the decision outcome is personally consequential. Recommendations are offered.
...Formally defined as the probability of observing data as extreme or more extreme than that actually observed assuming the null hypothesis is true, the p-value has often been misinterpreted as, inter alia, (i) the probability that the null hypothesis is true, (ii) one minus the probability that the alternative hypothesis is true, or (iii) one minus the probability of replication (Bakan 1966, Sawyer and Peter 1983, Cohen 1994, Schmidt 1996, Krantz 1999, Nickerson 2000, Gigerenzer 2004, Kramer and Gigerenzer 20005).
...As an example of how dichotomous thinking manifests itself, consider how Messori et al.(1993) compared their findings with those of Hommes et al. (1992):
The result of our calculation was an odds ratio of 0.61 (95% CI [confidence interval]: 0.298–1.251; p>0.05); this figure differs greatly from the value reported by Hommes and associates (odds ratio: 0.62; 95% CI: 0.39–0.98; p<0.05)...we concluded that subcutaneous heparin is not more effective than intravenous heparin, exactly the opposite to that of Hommes and colleagues.(p. 77)
In other words, Messori et al. (1993) conclude that their findings are “exactly the opposite” of Hommes et al. (1992) because their odds ratio estimate failed to attain statistical significance whereas that of Hommes et al. attained statistical significance. In fact, however, the odds ratio estimates and confidence intervals of Messori et al. and Hommes et al. are highly consistent (for additional discussion of this example and others, see Rothman et al. 1993 and Healy 2006).
In a forthcoming paper, my colleague David Gal and I survey top academics across a wide variety of fields including the editorial board of Psychological Science and authors of papers published in the New England Journal of Medicine, the American Economic Review, and other top journals. We show:
Researchers interpret p-values dichotomously (i.e., focus only on whether p is below or above 0.05).
They fixate on them even when they are irrelevant (e.g., when asked about descriptive statistics).
These findings apply to likelihood judgments about what might happen to future subjects as well as to choices made based on the data.
We also show they ignore the magnitudes of effect sizes.
No one understands p-values: “Unfounded Fears: The Great Power-Line Cover-Up Exposed”, IEEE 1996, on the electricity/cancer panic (emphasis added to the parts clearly committing the misunderstanding of interpreting p-values as having anything at all to do with probability of a fact or with subjective beliefs):
No one understands p-values, not even the ones who use Bayesian methods in their other work… From “When Is Evidence Sufficient?”, Claxton et al 2005:
Another fun one is a piece which quotes someone making the classic misinterpretation and then someone else immediately correcting them. From “Drug Trials: Often Long On Hype, Short on Gains; The delusion of ‘significance’ in drug trials”:
Also fun, “You do not understand what a p-value is (p < 0.001)”:
Another entry from the ‘no one understands p-values’ files; “Policy: Twenty tips for interpreting scientific claims”, Sutherland et al 2013, Nature—there’s a lot to like in this article, and it’s definitely worth remembering most of the 20 tips, except for the one on p-values:
Whups. p=0.01 does not mean our subjective probability that the effect is zero is now just 1%, and there’s a 99% chance the effect is non-zero.
(The Bayesian probability could be very small or very large depending on how you set it up; if your prior is small, then data with p=0.01 will not shift your probability very much, for exactly the reason Sutherland et al 2013 explains in their section on base rates!)
“Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence”, McShane & Gal 2015
Graph of how a p-value crossing a threshold dramatically increases choosing that option, regardless of effect size: http://andrewgelman.com/wp-content/uploads/2016/04/Screen-Shot-2016-04-06-at-3.03.29-PM-1024x587.png
via Gelman: