gwern comments on Open Thread, January 16-31, 2013

gwern 19 Jan 2013 19:09 UTC
3 points
I don’t think it works in the sense of refuting the earlier results by Ioannidis etc.

Remember that much of that previous work is based on looking at replication rates and changes as sample sizes increase—so actually empirical in a meaningful way.

This simply aggregates all p-values, takes them at face value, and tries to infer what the false positive rate ‘should’ be. It doesn’t seem to account in any way for the many systematic errors involved or biases or problems in the process, only covers false positives and not false negatives (so ignores issues of statistical power, which is a serious problem in psychology, anyway, although I think medical trials are better powered).

I’d take their estimate of a 17% false positive rate as a lower bound.

I also question some other aspects; for example, they dismiss the idea that the false positive rate is increasing because it hits p=0.18 - but if you look at pg11, every journal sees a net increase in false positive rates from the beginning of their sample to the end, although there’s enough variation that the beginning/end difference doesn’t hit 0.05. So there is a clear trend here, and I have to wonder: if they looked at more than 5 journals over a decade, would the extra data make it hit significance? (A 0.5% increase each year is very troubling, since that implies very bad things for the long-term.)

I liked their data collection strategy, though; scraping—not just for hackers!

We wrote a computer program in the R statistical programming language (http://www.R-project.org/) to collect the abstracts of all papers published in The Lancet, The Journal of the American Medical Association (JAMA), The New England Journal of Medicine (NEJM), The British Medical Journal (BMJ), and The American Journal of Epidemiology (AJE) between 2000 and 2010. Our program then parsed the text of these abstracts to identify all instances of the phrases “P =”, “P <”, “P ≤”, allowing for a space or no space between “P” and the comparison symbols. Our program then extracted both the comparison symbol and the numeric symbol following the comparison symbol. We scraped all reported P-values in abstracts, independent of study type. The P-values were scraped from http://www.ncbi.nlm.nih.gov/pubmed/ on January 24, 2012. A few manual changes were performed to correct errors in the reported P-values due to variations in the reporting of scientific notation as detailed in the R code. To validate our procedure, we selected a random sample (using the random number generator in R) of abstracts and compared our collected P-values to the observed P-values manually. The exact R code used for scraping and sampling and the validated abstracts are available in the Supplemental Material.
- rxs 19 Jan 2013 21:55 UTC
  0 points
  Parent
  Yep, I agree. This is definitely an (optimistic) lower limit. Good that these studies are gaining attention, though a systemic change would be needed to get us out of this.