I don’t think it works in the sense of refuting the earlier results by Ioannidis etc.
Remember that much of that previous work is based on looking at replication rates and changes as sample sizes increase—so actually empirical in a meaningful way.
This simply aggregates all p-values, takes them at face value, and tries to infer what the false positive rate ‘should’ be. It doesn’t seem to account in any way for the many systematic errors involved or biases or problems in the process, only covers false positives and not false negatives (so ignores issues of statistical power, which is a serious problem in psychology, anyway, although I think medical trials are better powered).
I’d take their estimate of a 17% false positive rate as a lower bound.
I also question some other aspects; for example, they dismiss the idea that the false positive rate is increasing because it hits p=0.18 - but if you look at pg11, every journal sees a net increase in false positive rates from the beginning of their sample to the end, although there’s enough variation that the beginning/end difference doesn’t hit 0.05. So there is a clear trend here, and I have to wonder: if they looked at more than 5 journals over a decade, would the extra data make it hit significance? (A 0.5% increase each year is very troubling, since that implies very bad things for the long-term.)
I liked their data collection strategy, though; scraping—not just for hackers!
We wrote a computer program in the R statistical programming language (http://www.R-project.org/) to collect the abstracts of all papers published in The Lancet, The Journal of the American Medical Association (JAMA), The New England Journal of Medicine (NEJM), The British Medical Journal (BMJ), and The American Journal of Epidemiology (AJE) between 2000 and 2010. Our program then parsed the text of these abstracts to identify all instances of the phrases “P =”, “P <”, “P ≤”, allowing for a space or no space between “P” and the comparison symbols. Our program then extracted both the comparison symbol and the numeric symbol following the comparison symbol. We scraped all reported P-values in abstracts, independent of study type. The P-values were scraped from http://www.ncbi.nlm.nih.gov/pubmed/ on January 24, 2012. A few manual changes were performed to correct errors in the reported P-values due to variations in the reporting of scientific notation as detailed in the R code. To validate our procedure, we selected a random sample (using the random number generator in R) of abstracts and compared our collected P-values to the observed P-values manually. The exact R code used for scraping and sampling and the validated abstracts are available in the Supplemental Material.
Yep, I agree. This is definitely an (optimistic) lower limit. Good that these studies are gaining attention, though a systemic change would be needed to get us out of this.
Now to the details of the paper. Based on the word “empirical” title, I thought the authors were going to look at a large number of papers with p-values and then follow up and see if the claims were replicated. But no, they don’t follow up on the studies at all! What they seem to be doing is collecting a set of published p-values and then fitting a mixture model to this distribution, a mixture of a uniform distribution (for null effects) and a beta distribution (for non-null effects). Since only statistically significant p-values are typically reported, they fit their model restricted to p-values less than 0.05. But this all assumes that the p-values have this stated distribution. You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on. Also, as noted above, the problem isn’t really effects that are exactly zero, the problem is that a lot of effects are lots in the noise and are essentially undetectable given the way they are studied....So, no, I don’t at all believe Jager and Leek when they write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time.” They’re doing this by basically assuming the model that is being questioned, the textbook model in which effects are pure and in which there is no p-hacking.
One of the authors replies in the comments:
That being said, our paper is a direct response to the original work, which defined “correct” and “incorrect” in the medical literature by the truth of the null hypothesis. We totally agree that that is a very debatable definition of correct. However, we felt it was important to point out that when using that definition you can actually estimate the rate of false discoveries with principled methods. These methods are well justified in the statistical literature and we took pains to point out our assumptions in both the paper and the supplemental material. Whether you agree with those assumptions is of course, a totally reasonable thing to talk about.
Empirical estimates suggest most published medical research is true
http://arxiv.org/abs/1301.3718
OK, so now we need a meta-analysis of these meta-analyses...
I don’t think it works in the sense of refuting the earlier results by Ioannidis etc.
Remember that much of that previous work is based on looking at replication rates and changes as sample sizes increase—so actually empirical in a meaningful way.
This simply aggregates all p-values, takes them at face value, and tries to infer what the false positive rate ‘should’ be. It doesn’t seem to account in any way for the many systematic errors involved or biases or problems in the process, only covers false positives and not false negatives (so ignores issues of statistical power, which is a serious problem in psychology, anyway, although I think medical trials are better powered).
I’d take their estimate of a 17% false positive rate as a lower bound.
I also question some other aspects; for example, they dismiss the idea that the false positive rate is increasing because it hits p=0.18 - but if you look at pg11, every journal sees a net increase in false positive rates from the beginning of their sample to the end, although there’s enough variation that the beginning/end difference doesn’t hit 0.05. So there is a clear trend here, and I have to wonder: if they looked at more than 5 journals over a decade, would the extra data make it hit significance? (A 0.5% increase each year is very troubling, since that implies very bad things for the long-term.)
I liked their data collection strategy, though; scraping—not just for hackers!
Yep, I agree. This is definitely an (optimistic) lower limit. Good that these studies are gaining attention, though a systemic change would be needed to get us out of this.
Gelman’s comments:
One of the authors replies in the comments: