Using degrees of freedom to change the past for fun and profit

Follow-up to: Follow-up on ESP study: “We don’t publish replications”, Feed the Spinoff Heuristic!

Using the same method as in Study 1, we asked 20 University of Pennsylvania undergraduates to listen to either “When I’m Sixty-Four” by The Beatles or “Kalimba.” Then, in an ostensibly unrelated task, they indicated their birth date (mm/dd/yyyy) and their father’s age. We used father’s age to control for variation in baseline age across participants. An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040

That’s from “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” which runs simulations of a version of Shalizi’s “neutral model of inquiry,” with random (null) experimental results, augmented with a handful of choices in the setup and analysis of an experiment. Even before accounting for publication bias, these few choices produced a desired result “significant at the 5% level” 60.7% of the time, and at the 1% level 21.5% at the time.

I found it because of another paper claiming time-defying effects, during a search through all of the papers on Google Scholar citing Daryl Bem’s precognition paper, which I discussed in a past post about the problems of publication bias and selection over the course of a study. For Bem, Richard Wiseman established a registry for the methods, and tests of the registered studies could be set prior to seeing the data (in addition to avoiding the file drawer).

Now a number of purported replications have been completed, with several available as preprints online, including a large “straight replication” carefully following the methods in Bem’s paper, with some interesting findings discussed below. The picture does not look good for psi, and is a good reminder of the sheer cumulative power of applying a biased filter to many small choices.

Background

When Bem’s article was published the skeptic David Alcock argued that Bem’s experiments involved midstream changes of methods, choices in the transformation of data (raw data was not available), and other signs of modifying the experiment and analysis in response to the data. Wagenmakers et al drew attention to writing by Bem advising young psychologists to take experiments that failed to show predicted effects and relentlessly explore the data in hopes of generating an attractive and significant effect. In my post, I emphasized the importance of “straight replications,” with methodology, analytical tests, and intent to publish established in advance, as in Richard Wiseman’s registry of studies.

An article by Gregory Francis uses a standard test for publication bias on Bem’s article: comparing the number of findings reaching significance to the number predicted by the power of the study to detect the claimed effect. 9 of the 10 experiments mentioned described in Bem’s article¹ find positive effects using Bem’s measures and tests, but those 9 were all statistically significant despite the small size of the effects. Francis calculates a 5.8% probability of so many reaching significance by chance (given the estimated effect power and effect size).

Other complaints included declining effect size with sample size (driven mostly by one larger experiment), the use of one-tailed tests (Bem justified this as following an early hypothesis, but claims of “psi-missing” due to boredom or repelling stimuli are found in the literature and could have been mustered), and the failure to directly replicate a single experiment or concentrate subjects.

Subsequent replications

At the time of my first post, I was able to find several replication attempts already online. Richard Wiseman and his coauthors had not found psi, and were refused consideration for publication at the journal which had hosted the original article. Galak and Nelson had tried and failed to replicate experiment 8. A a pro-psi researcher had pulled a different 2006 experiment from the file drawer and retitled as a purported “replication” of the 2011 paper. Samuel Moulton, who previously worked with Bem, writes that he tried to replicate Bem with 200 subjects and found no effect (not just not a significant effect, but a significantly lower effect), but that Bem would not mention this in the 2011 publication. Bem confirms this in a video of a Harvard debate.

Since then, there have been more replications. This New Scientist article claims to have found 7 replications of Bem, with six failures and one success. The success is said to be by a researcher who has previously studied the effect of “geomagnetic pulsations” on ESP, but I could not locate it online.

Snodgrass (2011) failed to replicate Bem using a version of the Galak and Nelson experiment. Wagenmaker et al posted their methods in advance, but have not yet posted their results, although news media have reported that they also got a negative result Bem. Wiseman and his coauthors posted their abstract online, and claim to have performed a close replication of one of Bem’s experiments with three times the subjects, finding no effect (despite 99%+ power to detect Bem’s claimed effect). Another paper, “Correcting the Past: Failures to Replicate Psi,” by Galak, LeBoeuf, Nelson, and Simmons, combines 6 experiments by the researchers (who are at four separate universities) with 820 subjects and finds no effect in a very straight replication. More on it in a moment.

I also found the abstracts of the 2011 Towards a Science of Consciousness conference. On page 166 Whitmarsh and Bierman claim to have conducted a replication of a Bem experiment involving meditators, but do not give their results, although it appears they may have looked for effects of meditation on the results. On page 176, there is an abstract from Franklin and Schooler, claiming success in a new and different precognition experiment, as well as predicting the outcome of a roulette wheel (n=204, hit rate 57%, p<.05). In the New Scientist article they claim to have replicated their experiment (with much reduced effect size and just barely above the 0.05 significance level), although past efforts to use psi in casino games have not been repeatable (nor have the experimenters become mysteriously wealthy, or easily able to fund their research, apparently). The move to a new and ill-described format prevents it from being used as a straight replication (in Shalizi’s neutral model of inquiry using only publication bias, it is the move to new effects lets a field sustain itself in the absence of a subject matter), it was not registered, and the actual study is not available, so I will leave it be until publication.

Correcting the Past: Failures to Replicate Psi

Throughout this paper the researchers try to specify their procedures unambigously and as closely aligned with Bem as they can, for instance in transforming the data² so as to avoid cherry-picking in the fashion they argue:

Results

To test for the presence of precognition, Bem (2011) computed a weighted differential recall score (DR) for each participant using the formula:

DR = (Recalled Practiced Words—Recalled Control Words) ×

(Recalled Practice Words + Recalled Control Words)

In the paper, for descriptive purposes, Bem frequently reports this number as DR%, which is the percentage that a participant’s score deviated from random chance towards the highest or lowest scores possible (-576 to 576). We conducted the identical analysis on our data and also report DR% (see Table 1). In addition to using the weighted differential recall score, we also report the results using a simple unweighted recall score, which is the difference between recalled practice words and recalled control words (see Appendix B). For both of these measures, random chance would lead to a score of 0, and analysis was conducted using a one-sample t-test.

This prevents them from choosing the more favorable (or less favorable) of several transformations, as they seem to suggest Bem did in the next quote, bumping a result to significance in the original paper. This is a recurrent problem across many fields, and a reason to seek out raw data whenever possible, or datasets collected by neutral parties (on your question of interest):

Still, even in Experiments 8 and 9, it is unclear how Bem could find significant support for a hypothesis that appears to be untrue. Elsewhere, critics of Bem have implicated his use of a one-tailed statistical test (Wagenmakers et al. 2011), testing multiple comparisons without correction (Wagenmakers et al. 2011), or perhaps simply a lurking file drawer with some less successful pilot experiments. All of these concerns fall under a larger category of researcher degrees of freedom, which raise the likelihood of falsely rejecting the null hypothesis (Simmons et al., 2011). Some of these can be easily justifiable and have small and seemingly inconsequential effects. For example, Bem analyzes participant recall using an algorithm which weights the total number of correctly recalled words (i.e., DR%). He could, presumably, have just as easily analyzed simple difference scores and found a similar, but not quite identical, result (indeed, re-analyzing the data from Bem (2011) Experiment 8 with a simple difference score yields no Psi effects (M = .49, t(99) = 1.48, p = .14), though it does for Experiment 9 (M =.96; t(49) = 2.46, p = .02)).

They mention others which they did not have data to test:

The scoring distinction is just a single example, but even for Bem’s simple procedure there are many others. For example, Bem’s words are evenly split between common and uncommon words, a difference that was not analyzed (or reported) in the original paper, but may reflect an alternative way to consider the data (perhaps psi only persists for uncommon words? Perhaps only for common words?). He reports the results of his two-item sensation seeking measure, but he does not analyze (or report collecting) additional measures of participant anxiety or experimenter-judged participant enthusiasm. Presumably these were collected because there was a possibility that they may be influential as well, but when analysis revealed that they were not, that analysis was dropped from the paper.

Other elements providing degrees of freedom were left out of the Bem paper. A published paper can only provide so much confidence that it actually describes the experiment as it happened (or didn’t!):

Despite our best efforts to conduct identical replications of Bem’s Experiments 8 and 9, it is possible that the detection of psi requires certain methodological idiosyncrasies that we failed to incorporate into our experiments. For instance, after reading the replication packet (personal communication with Bem, November, 1 2010) provided by Bem, we noticed that there were at least three differences between our experiments (which mirrored the exact procedure described in Bem’s published paper) and the procedure actually employed by Bem...the experimenter was required to have a conversation with each participant in order to relax him or her...participants were asked two questions in addition to the sensation seeking scale...the set of words used by Bem were divided into common and uncommon words, something that we did not do in our Experiments 1 and 2.

The experiments, with several times the collective sample size of the Bem experiments (8 and 9) they replicate, look like chance:

Main Results

Table 1 presents the results of our six experiments as well as the results from Bem’s (2011) Experiments 8 and 9, for comparison. Bem found DR% = 2.27% in Experiment 8 and 4.21% in Experiment 9, effects that were significant at p = .03 and p = .002, one-tailed.

In contrast, none of our six experiments showed a significant effect suggesting precognition.

In Experiment 1, DR% = −1.21%, t(111) = −1.201, p = .23 (all p-values in this paper are two-tailed). Bayesian t-tests (advocated by Wagenmakers et al., 2011) suggest that this is “substantial” support for the null hypothesis of no precognition.

In Experiment 2, DR% = 0.00%, t(157) = .00, p = .99. Bayesian t-tests suggest that this is “strong” support for the null hypothesis.

In Experiment 3, DR% = 1.17%, t(123) = 1.28, p = .20. Although DR% was indeed above zero, in the direction predicted by the ESP hypothesis, the test statistic did not reach conventional levels of significance, and Bayesian t-tests suggest that this is nevertheless “substantial” support for the null hypothesis.

In Experiment 4, DR% = 1.59%, t(108) = 1.77, p = .08. Again, although DR% was above zero, the test statistic did not reach conventional levels of significance, and Bayesian t-tests still suggest that this is “substantial” support for the null hypothesis.

In Experiment 5, which contained our largest sample of participants, DR% = -.49%, t(210) = -.71, p = .48. Bayesian t-tests suggest that this is “strong” support for the null hypothesis.

Finally, in Experiment 6’s Test-Before-Practice condition, DR% = -.29%, t(105) = -.33, p = .74. Bayesian t-tests suggest that this is “strong” support for the null hypothesis.

In sum, in four of our experiments, participants recalled more control words than practice words (Experiments 1, 2, 5, and 6) and in two of our experiments, participants recalled more practice words than control words (Experiments 3 and 4). None of these effects were statistically reliable using conventional t-tests (see Table 1). As noted, Bayesian t-tests suggest that even the two findings that were directionally consistent with precognition show substantial support for the null hypothesis of no precognition.

Perhaps the reported positive replication will hold up to scrutiny (with respect to sample size, power, closeness of replication, data mining, etc), or some other straight replication will come out convincingly positive (in light of the aggregate evidence). I doubt it.

Psi and science

Beating up on parapsychology may be cheap and easy in the scientific, skeptical, and Less Wrong communities, a low-status outgroup belief. But the abuse of many degrees of freedom, and shortage of close replication, is widespread in science and particularly in psychology. The heuristics and biases literature, studies of cognitive enhancement, social psychology and other areas often used in Less Wrong are not so different. This suggests a candidate hack to fight confirmation bias in assessing the evidentiary value of experiments that confirm one’s views: ask yourself how much evidentiary weight (in log odds) you would place on the same methods and results showing a novel psi effect?³

Notes

¹ In addition to the nine numbered experiments, there is a footnote referring to a small early tenth study which did not find an effect in Bem 2011.

²One of the bigger differences is that some of the experiments were online rather than in the lab, but this didn’t seem to matter much. They also switched from blind human coding of misspelled words to computerized coding.

³ This heuristic has not been tested, beyond the general (psychology!) results suggesting that arguing for a position opposite your own can help to see otherwise selectively missed considerations.

ETA: This blog post also discusses the signs of optional stopping, multiple hypothesis testing, use of one-tailed tests where a negative result could also have been reported as due to psi, etc.

ETA2: A post at the Bare Normality blog tracks down earlier presentation of some of the experiments going into Bem (2011), back in 2003, and notes that the data seem to bee selectively ported to the 2011 paper, described quite differently, and discusses other signs of unreported experiments. The post also expresses concern about reconciling these data with Bem’s explicit denial of optional stopping, selective reporting, and similar.

ETA3: Bem’s paper cites an experiment by Savva as evidence for precognition (by arachnophobes), but leaves out the fact that Savva’s follow-up experiments failed to replicate the effect. Links and references are provided in a post at the James Randi forums. Savva also says that Bem had “extracted” several supposedly significant precognition correlations from Savva’s data, and upon checking Savva found they were generated by calculation errors. Bem also is said to have claimed Savva’s first result had passed the 0.05 significance test, when it was actually just short of doing so (0.051, not a substantial difference, and perhaps defensible, but another sign of bias).