Follow-up on ESP study: “We don’t publish replications”
Related to: Parapsychology: the control group for science, Dealing with the high quantity of scientific error in medicine
Some of you may remember past Less Wrong discussion of the Daryl Bem study, which claimed to show precognition, and was published with much controversy in a top psychology journal, JPSP. The editors and reviewers explained their decision by saying that the paper was clearly written and used standard experimental and statistical methods so that their disbelief in it (driven by physics, the failure to show psi in the past, etc) was not appropriate grounds for rejection.
Because of all the attention received by the paper (unlike similar claims published in parapsychology journals) it elicited a fair amount of both critical review and attempted replication. Critics pointed out that the hypotheses were selected and switched around ‘on the fly’ during Bem’s experiments, with the effect sizes declining with sample size (a strong signal of data mining). More importantly, Richard Wiseman established a registry for advance announcement of new Bem replication attempts.
A replication registry guards against publication bias, and at least 5 attempts were registered. As far as I can tell, at the time of this post the subsequent replications have, unsurprisingly, failed to replicate Bem’s results.1 However, JPSP and the other high-end psychology journals refused to publish the results, citing standing policies of not publishing straight replications.
From the journals’ point of view, this (common) policy makes sense: bold new claims will tend to be cited more and raise journal status (which depends on citations per article), even though this means most of the ‘discoveries’ they publish will be false despite their p-values. However, this means that overall the journals are giving career incentives for scientists to massage and mine their data for bogus results, but not to challenge bogus results by others. Alas.
1 A purported “successful replication” by a pro-psi researcher in Vienna turns out to be nothing of the kind. Rather, it is a study conducted in 2006 and retitled to take advantage of the attention on Bem’s article, selectively pulled from the file drawer.
ETA: The wikipedia article on Daryl Bem makes an unsourced claim that one of the registered studies has replicated Bem.
ETA2: Samuel Moulton, who formerly worked with Bem, mentions an unpublished (no further details) failed replication of Bem’s results conducted before Bem submitted his article (the failed replication was not mentioned in the article).
ETA3: There is mention of a variety of attempted replications at this blog post, with 6 failed replications, and 1 successful replication from a pro-psi researcher (not available online). It is based on this ($) New Scientist article.
ETA4: This large study performs an almost straight replication of Bem (same methods, same statistical tests, etc) and finds the effect vanishes.
ETA5: Apparently, the mentioned replication was again submitted to the British Journal of Psychology:
When we submitted it to the British Journal of Psychology, it was finally sent for peer review. One referee was very positive about it but the second had reservations and the editor rejected the paper. We were pretty sure that the second referee was, in fact, none other than Daryl Bem himself, a suspicion that the good professor kindly confirmed for us. It struck us that he might possibly have a conflict of interest with respect to our submission. Furthermore, we did not agree with the criticisms and suggested that a third referee be brought in to adjudicate. The editor rejected our appeal.
- How to Fix Science by 7 Mar 2012 2:51 UTC; 70 points) (
- Using degrees of freedom to change the past for fun and profit by 7 Mar 2012 2:51 UTC; 64 points) (
- Against “Context-Free Integrity” by 14 Apr 2021 8:20 UTC; 62 points) (
- Satire of Journal of Personality and Social Psychology’s publication bias by 5 Jun 2012 0:08 UTC; 43 points) (
- 26 Jan 2012 23:19 UTC; 5 points) 's comment on [post redacted] by (
- 6 May 2012 22:52 UTC; 1 point) 's comment on Goertzel on Psi in H+ Magazine by (
I’m at a loss of words at the inanity of this policy.
It is a policy that doesn’t just exist in psychology. Some journals in other fields have similar policies requiring that the work include something more than just a replication of the study in question, but my impression is that this is much more common in the less rigorous areas like psychology. Journals probably do this because they want to be considered cutting edge and they get less of that if they publish replication attempts. Given that it makes some sense to reject both successful and unsuccessful replications, since if one only included unsuccessful replications then there would be a natural publication bias. So they more or less successfully fob the whole thing off on other journals. (There’s something like an n player’s prisoner dilemma here with journals as the players trying to decide if they accept replications in general.) So this is bad, but it is understandable when one remembers that journals are driven by selfish, status-driven humans, just like everything else in the world.
Yes, this is a standard incentives problem. But one to keep in mind when parsing the literature.
What rules of thumb do you use to ‘keep this in mind’? I generally try to never put anything in my brain that just has one or two studies behind it. I’ve been thinking of that more as ‘it’s easy to make a mistake in a study’ and ‘maybe this author has some bias that I am unaware of’, but perhaps this cuts in the opposite direction.
Actually, even with many studies and a meta-analysis, you can still get blindsided by publication bias. There are plenty of psi meta-analyses showing positive effects (with studies that were not pre-registered, and are probably very selected), and many more in medicine and elsewhere.
If it’s something I trust an idiot to make the right conclusion on with good data, I’ll look for meta-analyses, p<<0.05, or do a quick and dirty meta analysis myself if the number of studies is sufficiently small. If it’s something I’m surprised has even been tested, I’ll give one study more weight. If it’s something that I’d expect to be tested a lot, I’d give it less. If the data I’m looking for is orthogonal to the data they’re being published for, it probably doesn’t suffer from selection bias so I’ll take it at face value. If the studies result is ‘convenient’ in some way for the source that showed it to me, I’ll be more skeptical of selection bias and misinterpretation.
If it’s a topic where I see very easy to make methodological flaws or interpretation errors, then I’ll try to actually dig in and look for them and see if there’s a new obvious set of conclusions to draw.
Separately from determining how strong the evidence is, I’ll try to ‘put it in my brain’ if there’s only a study or two if it’s testing a hypothesis I already suspected of being true, or if it makes too much sense in hindsight (aka high priors), or put it in my brain with a ‘probably untrue but something to watch out for’ tag otherwise.
How much money do you think it would take to give replications a journal with status on par with the new-studies-only ones?
Or alternately, how much advocacy of what sort? Is there someone in particular to convince?
It’s not something you can simply buy with money. It’s about getting scientists to cite papers in the replications journal.
What about influencing high-status actors (e.g. prominent universities)? I don’t know what the main influence points are for an academic journal, and I don’t know what things it’s considered acceptable for a university to accept money for, but it seems common to endow a professorship or a (quasi-academic) program.
Probably this method would cost many millions of dollars, but it would be interesting to know the order of magnitude required.
We simply do not have a scientific process any more.
This is both unfair to scientists and inaccurate. In 2011, we’ve had such novel scientific discoveries as snails that can survive being eaten by birds, we’ve estimated the body temperature of dinosaurs, we’ve captured the most detailed picture of a dying star ever taken, and we’ve made small but significant progress to resolving P ?= NP. These are but a few of the highlights that happened to both be in my recent memory and which I could easily locate links to. I’ve also not included anything that could be argued to be engineering rather than science. There are many achievements just like this.
Why might it seem like we don’t have a scientific process?
First, there’s simple nostalgia. As I write this, the space shuttle is on its very last mission. I suspect that almost everyone here either longs for the days of their youth when humans walked on the moon, or wish they had lived then to witness that. Thus, the normal human nostalgia is wrapped up in some actual problems of stagnation and lack of funding. This creates a potential halo effect for the past.
Second, as the number of scientists increases over time, the number of scientists who are putting out poor science will increase. Similarly, the amount of stuff that gets through peer review even when it shouldn’t will increase as the number of journals and the number of papers submitted goes up. So the amount of bad science will go up.
Third, the internet, and similar modern communication technologies lets us find out about so-called bad science much faster than we would otherwise. Much of that would get buried in obscure journals but instead we have bloggers commenting and respected scientists responding. So as time goes on, even if the amount of bad science stays constant, the perception would be of an increase.
I would go so far as to venture that we might have a more robust and widespread scientific process than at any other time in history. To put the Bem study in perspective, keep in mind that a hundred years ago, psychology wasn’t even trying to use statistical methods; look at how Freud and Jung’s ideas were viewed. Areas like sociology and psychology have if anything become more scientific over time. From that standpoint, a paper that uses statistics in a flawed fashion is indicative of how much progress the soft sciences have made in terms of being real sciences in that one needs bad stats to get bad ideas through rather than just anecdotal evidence.
To paraphrase someone speaking on a completely different issue, the arc of history is long, but it bends towards science.
That’s not really true. Experimental, quantitative, and even fairly advanced statistical methods were definitely used in psychology a century ago. (As a notable milestone, Spearman’s factor analysis that started the still ongoing controversy over the general factor of intelligence was published in 1904.) My impression is that ever since Wilhelm Wundt’s pioneering experimental work that first separated psychology from philosophy in the late 19th century, psychology has been divided between quantitative work based on experiment and observation, which makes at least some pretense of real science, and quack soft stuff that’s usually presented in a medical or ideological context (or some combination thereof). Major outbursts of the latter have happened fairly recently—remember the awful “recovered memories” trend in the 1980s and 1990s (and somewhat even in the 2000s) and its consequences.
But more importantly, I’m not at all sure that the mathematization of soft fields has made them more scientific. One could argue that the contemporary standards for using statistics in soft fields only streamline the production of plausible-looking nonsense. Even worse, sometimes mathematization leads to pseudoscience that has no more connection to reality than mere verbal speculations and sophistries, but looks so impressive and learned that a common-sense criticism can be effectively met with scorn and stonewalling. As the clearest example, it appears evident that macroeconomics is almost complete quackery despite all the abstruse statistics and math used in it, and I see no evidence that the situation in other wannabe-exact soft fields is much better. Or to take another example, at one point I got intensely interested in IQ-related controversies and read a large amount of academic literature in the area—eventually finding that the standards of statistics (and quantitative reasoning in general) on all sides in the controversy are just depressingly bad, often hiding awful lapses of reasoning that would be unimaginable in a real hard science behind a veneer of seeming rigor.
(And ultimately, I notice that your examples of recent discoveries are from biology, astronomy/physics, and math—fields whose basic soundness has never been in doubt. But what non-trivial, correct, and useful insight has come from all these mathematized soft fields?)
This is a very good point. You make a compelling case that the use of careful statistics is not a recent trend in psychology. In that regard, my penultimate paragraph is clearly just deeply and irrecoverably wrong.
Well, I was responding to Eliezer’s claim about a general lack of a scientific process. So the specific question then becomes can one give examples of “non-trivial, correct, and useful” psychological results that have occurred in the last year or so. There’s a steady output of decent psychology results. While the early work on cognitive biases was done in the 1980s by Kahneman and Tversky, a lot of work has occurred in the last decade after. But, I agree that the amount of output is slow enough that I can’t point to easy, impressive studies that have occurred in the last few months off the top of my head like I can for other areas of research. Sharon Bertsch and Bryan Pesta’s investigation of different explanations for negative correlation between IQ and religion came out in 2009 and 2010, which isn’t even this year.
However, at the same time, I’m not sure that this is a strike against psychology. Psychology has a comparatively small field of study. Astronomy gets to investigate most of the universe. Math gets to investigate every interesting axiomatic system one can imagine. Biology gets to investigate millions of species. Psychology just gets to investigate one species, and only certain aspects of that species. When psychology does investigate other intelligent species it is often categorized as belonging to other areas. So we shouldn’t be that surprised if psychology doesn’t have as high a production rate. On the other hand, this argument isn’t very good because one could make up for it by lumping all the classical soft sciences together into one area, and one would still have this problem. So overall, your point seems valid in regards to psychology.
I didn’t have in mind just psychology; I was responding to your comment about soft and wannabe-hard fields in general. In particular, this struck me as unwarranted optimism:
That is true if these sciences are nowadays overwhelmingly based on sound math and statistics, and these bad stats papers are just occasional exceptions. The pessimistic scenario I have in mind is the emergence of bogus fields in which bad formalism is the standard—i.e., in which verbal bad reasoning of the sort seen in, say, old-school Freudianism is replaced by standardized templates of bad formalism. (These are most often, but not always, in the form of bad statistics.)
This, in my opinion, results in an even worse situation. Instead of bad verbal reasoning, which can be criticized convincingly in a straightforward way, as an outside critic you’re now faced with an abstruse bad formalism. This not only makes it more difficult to spot the holes in the logic, but even if you identify them correctly, the “experts” can sneer at you and dismiss you as a crackpot, which will sound convincing to people who have’t taken the effort to work through the bad formalism themselves.
Unless you believe that such bogus fields don’t exist (and I think many examples are fairly obvious), they are clear counterexamples to your above remark. Their “mathematization” has resulted in bullshit being produced in even greater quantities, and shielded against criticism far more strongly that if they were still limited to verbal sophistry.
Another important point, which I think you’re missing, concerns your comment about problematic fields having a relatively small, and arguably less important scope relative to the (mostly) healthy hard fields. The trouble is, the output of some of the most problematic fields is used to direct the decisions and actions of the government and other powerful institutions. From miscarriages of justice due to pseudoscience used in courts to catastrophic economic crises, all kinds of calamities can directly follow from this.
No substantial disagreement with most of your comment. I will just note that most of your points (which do show that I was being overly optimistic) don’t as a whole substantially undermine the basic point being made about Eliezer’s claim.
I think your point about small fields being able to do damage is an interesting one (and one I’ve never seen before) and raises all sorts of issues that I’ll need to think about.
(...)
Have these results been replicated? Are you sure they’re correct? Merely citing cool-looking results isn’t evidence that the scientific process is working.
Remember, “the scientific process not working” doesn’t look like “cool results stop showing up”, but looks like “cool results keeping showing up except they no longer correspond to reality”. If you have no independent way of verifying the results in question, it’s hard to tell the above scenarios apart.
Bertsch and Pesta’s work has been replicated. The dinosaur temperature estimate is close to estimates made by other techniques—the main interesting thing here is that this is a direct estimate made using the fossil remains rather than working off of metabolic knowledge, body size, and the like. So the dinosaur temperature estimate is in some sense the replication by another technique of strongly suspected results. The snail result is very new; I’m not aware of anything that replicates it.
Hearing one probably bad thing and deciding we fell from grace and should shake our heads in bitter nostalgia? That’s what the villains do. We, the ingroup of Truth and Right and Apple Pie, are dutifully skeptical of claims of increasing stupidity. You taught me that.
Even if true that wouldn’t be simple.
What do you mean by that?
To the point responses = encouraging to me that your busy :)
It’s clear that the incentives for journals are terrible. We should be looking to fix this. We seem to have a Goodhart’s Law problem, where credibility is measured in citations, but refutations count in the wrong direction. Right now, there are a bunch of web sites that collect abstracts and metadata about citations, but none of them include commenting, voting, or any sort of explicit reputation system. As a result, discussions about papers ends up on blogs like this one, where academics are unlikely to ever see them.
Suppose we make an abstracts-and-metadata archive, along the lines of CiteSeer, but with comments and voting. This would give credibility scores, similar to impact ratings, but also accounting for votes. The reputation system could be refined somewhat beyond that (track author credibility by field and use it to weight votes, collect metadata about what’s a replication or refutation, etc.)
Academics know, or at least ought to know, that most new publications are either wrong or completely uninteresting. This is the logical side-effect from the publish-or-perish policy in academia and the increase of PhD students worldwide. Estimated 1.346 million papers per year are published in journals alone[1]. If humanity produced interesting papers at that rate scientific progress would go a lot quicker!
So if it’s true that most publications are uninteresting and if it’s true that most academics have to publish at a high rate in order to protect their career and send the right signals we don’t want to punish and humiliate academics for publishing stupid ideas or badly executed experiments. And when you publish a paper that demonstrates the other party did a terrible job it does exactly that. The signal to noise ratio in academic journals wouldn’t increase by much but suddenly academics can simply reach their paper quota by picking the ideas of other academics apart. You’d get an even more poisonous environment as a result!
In our current academic environment (or at least my part of it) most papers without a large number of citations are ignored. A paper without any citations is generally considered such a bad source that it’s only one step up from wikipedia. You can cite it, if you must, but you better not base your research on it. So in practice I don’t think it’s a big deal that mistakes aren’t corrected and that academics typically aren’t expected to publicly admit that they were wrong. It’s just not necessary.
Suppose the paper supposedly proves something that lots of people wish was true. Surely it is likely to get an immense number of citations.
For example,the paper supposedly proves that America always had strict gun control, or that the world is doomed unless government transfers trillions of dollars from group A to group B, by restricting the usage of evil substance X, where group A tends to have rather few academics, and group B tends to have rather a lot of academics.
Surely it’s better to have academics picking apart crap than producing crap.
Not necessarily. Ignoring crap may be a better strategy than picking it apart.
Cooperation is also easier when different groups in the same research area don’t try too hard to invalidate each other’s claims. If the problem in question is interesting you’re much better off writing your own paper on it with your own claims and results. You can dismiss the other paper with a single paragraph: “Contrary to the findings of I.C. Wiener in [2] we observe that...” and leave it at that.
The system is entirely broken but I don’t see an easy way to make it better.
If this were true how would anyone ever get the first citation?
(Incidentally in my own field, there are a lot of papers that don’t get cited. It isn’t because the papers are wrong (although some very small fraction of them have that problem) but that they just aren’t interesting. But math is very different from most other fields.)
Some papers (those written by high status authors) are ones that everyone knows will get citations soon after they are published, and so they feel safe in citing them since others are soon to do so. Self-fulfilling prophecy.
Because the policy wasn’t applied until after a cutoff date, so the recursion bottoms out at an author from before the cutoff. Obviously. Edit: Non-obviously. Edit2: HOW AM I SUPPOSED TO END THIS COMMENT FOR YOU MEATBAGS NOT TO VOTE ME DOWN???
I don’t know, but I’m pretty sure that’s not it.
I think your comment is getting voted down because it doesn’t actually answer the issue in question. It does allow there to be a set of citable papers, but it doesn’t deal with the actual question which is how any given paper would ever get its first citation.
Yes, it does, because paper B, from after the cutoff, cites a cite-less paper A, from before the cutoff. Then a paper C can cite B (or A), as B cites a previous paper, and A is from a time for which the standard today is not applied. (Perhaps I wasn’t clear that the cutoff also applies to citable papers—papers from before the cutoff don’t themselves need citations in them to be citable.)
Edit: Also, papers from before the cutoff cited other prior papers.
It’s not citing but being cited, I think. So if A and B are both before the cutoff, and A cites B, then C from after the cutoff can cite B (but not necessarily A).
Personally I thought it was a good comment even before the edit.
Zed didn’t say you should never cite a previously uncited paper, only that you shouldn’t invest time and effort into work that depends on the assumption that its conclusions are sound. There are many possible reasons why you might nevertheless want to cite it, and perhaps even give it some lip service.
Especially if it’s your own.
Self-citations are usually counted separately (both for formal purposes and in informal assessments of this sort).
I’m confused. I parsed this as “papers which contain no citations are considered bad sources,” but it seems that everyone else is parsing it as “papers which have not been cited are considered bad sources.” Am I making a mistake here? The latter doesn’t make much sense to me, but Zed hasn’t stepped in to correct that interpretation.
Look at the context of the first two paragraphs and the comment that Zed was replying to. The discussion was about how many papers never get cited at all. In that context, he seems to be talking about people not citing papers unless they have already been cited.
It’s not clear to me that he was talking about studies being ignored because they’re not interesting enough to cite, rather than studies being ignored because they’re not trustworthy enough to cite.
In any case, I think both are dubious safety mechanisms. John Ioannidis found that even most the most commonly cited studies in medical research are highly likely to be false. If researchers are basing their trust in studies on the rate at which they’re cited, they’re likely to be subject to information cascades, double counting the information that led other researchers to cite the same study.
In the case of only citing papers that contain numerous citations, this is helpful if the papers contain many redundant citations, demonstrating that the factual claims have been replicated, but if a paper relies on many uncertain findings, then its own uncertainty will be multiplied. The conclusion is at most as strong as its weakest link.
I think that you can cite it if you really think that it’s good (and perhaps this is what Zed meant by “if you must”), but you’d better be citing something more widely accepted too. Then if lots of people think that it’s really good, it will join the canon of widely cited papers.
Also, people outside of the publish-or-perish framework (respected elders, hobbyists outside of the research track, grad students in some fields) can get the ball rolling.
One idea to compensate for such effects: journals dedicated to negative results. Here is one I know of in the computer science field, with links to more.
I got an advertisement for the All Results Journals the other day.
The problem isn’t so much about being published but about being published in a journal with a decent impact factor. It’s likely that your journal for negative results gets few citations and there has a low impact factor.
A better way for JPSP to have handled Bem’s paper would be to invite comments from reputable scholars before publishing it, and then print the comments (and replies from Bem et al.) in the same issue. Other journals have followed this strategy with particularly controversial articles.
As it stands now, JPSP (the premier social psych journal) just looks ridiculous.
I believe they did publish a rebuttal in the same issue, but that didn’t allow the time needed for replications.
As I cynically comment on the DNB ML: ‘Summary: Bem proves ESP using standard psychology methods; prestigious journal refuses, in a particularly rude fashion, to publish both a failure and a successful replication. You get what you pay for; what are you paying for here?’
This is short and linky but I suspect it belongs in Main for increased audience. Upvoted, as it should be.
Thanks for the reminder to upvote; it didn’t occur to me to do so, because this news (about refusal to publish replications) annoyed me, and upvoting is associated with positive affect. Oops!
A former professor and co-author of mine has a paper about publication bias in PLoS Medicine: http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0050201
He has a number of suggestions for fixing things, but the main thrust appears to be that in a digital world, there is no longer any reason for journals to only publish papers that are “interesting” as well as methodologically decent, and they have no excuse not to adopt a policy of publishing all papers that appear correct. But he has a number of other suggestions too.