A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it’s often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does—just update and move on, don’t obsess.
It’s kind of how some of the landmark studies on priming failed to replicate, but there are so many followup studies which are explained by priming really well that it seems a bit silly to throw out the notion of priming just because of that.
Keep in mind, while you are unlikely to hit statistically significance where there is no real result, it’s not statistically unlikely to have a real result that doesn’t hit significance the next time you do it. Significance tests are attuned to get false negatives more often than false positives.
Emotionally though… when you get a positive result in breast cancer screening even when you’re not at risk, you don’t just shrug and say “probably a false positive” even though it is. Instead, you irrationally do more screenings and possibly get a needless operation. Similarly, when the experiment fails to replicate, people don’t shrug and say “probably a false negative”, even though that is, in fact, very likely. Instead, they start questioning the reputation of the experimenter. Understandably, this whole process is nerve wracking for the original experimenter. Which I think is where Mitchel was—admittedly clumsily—groping towards with the talk of “impugning scientific integrity”.
If the first experiment was wrong, the second experiment will end up wrong too.
I guess the context is important here. If the first experiment was wrong, and the second experiment is wrong, will you publish the failure of the second experiment? Will you also publish your suspicion that the first experiment was wrong? How likely will people believe you that your results prove the first experiment was wrong, if you did something else?
Here is what the selection bias will do otherwise:
20 people will try 20 “second experiments” with p = 0,05. 19 of them will fail, one will succeed and publish the results of their successful second experiment. Then, using the same strategy, 20 people will try 20 “third experiments”, and again, one of them will succeed… Ten years later, you can have dozen experiments examining and confirming the theory from dozen different angles, so the theory seems completely solid.
It’s kind of how some of the landmark studies on priming failed to replicate, but there are so many followup studies which are explained by priming really well that it seems a bit silly to throw out the notion of priming just because of that.
Is there a chance that the process I described was responsible for this?
I guess the context is important here. If the first experiment was wrong, and the second experiment is wrong, will you publish the failure of the second experiment? Will you also publish your suspicion that the first experiment was wrong? How likely will people believe you that your results prove the first experiment was wrong, if you did something else?
In practice, individual scientists like to be able to say “my work causes updates”. If you do something that rests on someone else’s work and the experiment doesn’t come out, you have an incentive to say “Someonewrongonthenet’s hypothesis X implies A and B. Someonewrongonthenet showed A [citation], but I tried B and that means X isn’t completely right.
Cue further investigation which eventually tosses out X. Whether or not A was a false positive is less important than whether or not X is right.
Is there a chance that the process I described was responsible for this?
Yes, that’s possible. I’m not sure direct replication actually solves that issue, though—you’d just shift over to favoring false negatives instead false positives. The existing mechanism that works against this is the incentive to overturn other people’s work.
A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it’s often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does—just update and move on, don’t obsess.
Tell me, does anyone actually do what you think they should do? That is, based on a long chain of ideas A->B->C->D, none of which have been replicated, upon experimenting and learning ~Z, do they ever reject the bogus theory D? (Or wait, was it C that should be rejected, or maybe the ~Z should be rejected as maybe the experiment just wasn’t powered enough to be meaningful as almost all studies are underpowered or, can you really say that Z logically entailed A...D? Maybe some other factor interfered with Z and so we can ‘save the appearances’ of A..Z! Yes, that’s definitely it!) “Theory-testing in psychology and physics: a methodological paradox”, Meehl 1967, puts it nicely (and this is as true as the day he wrote it half a century ago):
This last methodological sin is especially tempting in the “soft” fields of (personality and social) psychology, where the profession highly rewards a kind of “cuteness” or “cleverness” in experimental design, such as a hitherto untried method for inducing a desired emotional state, or a particularly “subtle” gimmick for detecting its influence upon behavioral output. The methodological price paid for this highly-valued “cuteness” is, of course, (d) an unusual ease of escape from modus tollens refutation. For, the logical structure of the “cute” component typically involves use of complex and rather dubious auxiliary assumptions,which are required to mediate the original prediction and are therefore readily available as (genuinely) plausible “outs” when the prediction fails. It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments,in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program”, without ever once refuting or corroborating so much as a single strand of the network.
To give a concrete example of why your advice is absurd and impractical and dangerous...
One of the things I am most proud of is my work on dual n-back not increasing IQ; the core researchers, in particular, the founder Jaeggi, are well-aware that their results have not replicated very well and that the results are almost entirely explained by bad control groups, and this is in part thanks to increased sample size from various followup studies which tried to repeat the finding while doing something else like an fMRI study or trying an emotional processing variant. So, what are they doing now, the Buschkuel lab and the new Jaeggi lab? Have they abandoned DNB/IQ, reasoning that since “the first experiment was wrong, the second experiment will end up wrong too”? Have they taken your advice to “just update and move on, don’t obsess”? Maybe taken serious stock of their methods and other results involving benefits to working memory training in general?
No. They are now busily investigating whether individual personality differences can explain transfer or not to IQ, whether other tasks can transfer, whether manipulation motivation can moderate transfer to IQ, and so on and so forth, and reaching p<0.05 and publishing papers just like they were before; but I suppose that’s all OK, because after all, “there are so many followup studies which are explained by [dual n-back transferring] really well that it seems a bit silly to throw out the notion of [dual n-back increasing IQ] just because of that”.
Wait, I’m not sure we’re talking about the same thing. I’m saying direct replication isn’t the most useful way to spend time. You’re talking about systematic experiment design flaws.
According to your writing, the failures in this example stem from methodological issues (not using an active control group). A direct replication of the n-back-IQ transfer would have just hit p<.05 again, as it would have had the same methodological issues. Of course, if the methodological issue is not repaired, all subsequent findings will suffer from the same issues.
I’m strictly saying that direct replication isn’t useful. Rigorous checking of methods and doing it over again correctly where there is a failure in the documented methodology is always a good idea.
But the Jaeggi cluster also sometimes use active control groups, with various kinds of differences in the intervention, metrics, and interpretations. In fact, Jaeggi was co-author on a new dual n-back meta-analysis released this month*; the meta-analysis finds the passive-active difference I did, and you know what their interpretation is? That it’s due to the correlated classification of US vs international laboratories conducting particular experiments. (It never even occurred to me to classify the studies this way.) They note that sometimes psychology experiments reach different conclusions in other cultures/countries—which they do—so perhaps the lower results in American studies using active control groups is because Americans gain less from n-back training. The kindest thing I can say about this claim is that I may be able to falsify it with my larger collection of studies (they threw out or missed a lot).
So, after performing these conceptual extensions of their results—as you suggest—they continue to
...slowly wend [their] way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program”, without ever once refuting or corroborating so much as a single strand of the network.
You should probably have read part of the second sentence: “active vs passive control groups criticism: found, and it accounts for most of the net effect size”.
If the first experiment was wrong, the second experiment will end up wrong too
This is not good, and I guess is not what he meant.
You design the second experiment so that it aims to find something assuming the first was right, but if the first was wrong, it can expose that too. Basically, it has to be a stronger experiment than the first one.
A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it’s often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does—just update and move on, don’t obsess.
If you’re concerned about the velocity of scientific progress, you should also be concerned about wrong turns. A Type 1 Error (establishing a wrong result by incorrectly rejecting a null hypothesis) is, IMHO, far more damaging to science than failure to establish a correct result—possibly due to an insufficient experimental setup.
Yeah, there’s definitely an “exploration / rigor” trade-off here (or maybe “speed / accuracy”) and I’m not sure it’s clear which side we are erring on right now. I’m not terribly surprised that LW favors rigor, just due to the general personality profile of the users here, and that my favoring of exploration at the cost of being wrong a few times is in the minority.
I definitely think a rational agent would be more exploratory than science currently is, but on the other hand we’ve got systematic biases to contend with and rigor might offset that.
Emotionally though… when you get a positive result in breast cancer screening even when you’re not at risk, you don’t just shrug and say “probably a false positive” even though it is. Instead, you irrationally do more screenings and possibly get a needless operation.
If you get a positive result, you run another test. If you keep getting positive results, you probably have breast cancer.
Similarly, if an experiment fails to replicate, you try again. If it replicates this time, then it’s probably fine. If it keeps failing to replicate, then there’s a problem.
At the very least, you need to try to replicate a random sample of studies, just to make sure there aren’t more false studies than you’ve been assuming.
Not an expert on cancer, but I don’t think it works that way .I think the cancer test accurately tests a variable wihch is a proxy for cancer risk. So a patient who doesn’t have cancer but tests positive will continue testing positive, because the variable that the cancer test measures as a proxy for cancer is elevated in that patient.
Experiments do work that way, but I’m not arguing against that. I’m only arguing that direct replication isn’t a better use of resources than just going on to a followup experiment with a different methodology (unless direct replication is really easy and you can just have some students do it or something).
Is there only one kind of test? Couldn’t they find another proxy?
I’m only arguing that direct replication isn’t a better use of resources than just going on to a followup experiment with a different methodology
If the followup is testing the same thing with a different methodology, then the metaphor works. If you run followup experiments just to find more detail, it would be like someone testing positive for cancer so then you run a test for what kind of cancer. You’re assuming they have cancer when you run the second test, so the results could be misleading.
If the followup is testing the same thing with a different methodology, then the metaphor works.
Generally an idea is considered well supported when multiple methodologies support it, yes. In the psychology lab I used to work in, at least, we never try to replicate, but we do try to show the same thing in multiple different ways. There are maybe 15 different experiments a year, but they’re generally all centered around proving or dis-proving a cluster of 2 or 3 broad, conceptually linked hypotheses.
Biology labs I’ve worked with do often do the whole “okay, the results are in and this is established now, let’s find additional detail’ thing, but that’s because they were usually looking at much simpler systems, like a single protein or something, so they could afford to take liberties and not be so paranoid about experimental methods.
It would be irrational to go for medical check ups when they aren’t necessary—if you did it every 3 days, for example.
I’m looking at this from a birds eye view. A lot of people get unnecessary screenings, which tell them information which is not worth acting upon no matter whether it says that it is positive or negative, and then start worrying and getting unnecessary testing and treatment. Information is only useful to the extent that you can act upon it.
And from up there you take it upon yourself to judge whether personal decisions are rational or not? I think you’re way too far away for that.
A lot of people get unnecessary screenings
That’s a different issue. In a post upstream you made a rather amazing claim that additional tests after testing positive for cancer on a screening would be irrational. Do you stand by that claim?
And from up there you take it upon yourself to judge whether personal decisions are rational or not? I think you’re way too far away for that.
Er...I think that’s a little harsh of you. Overscreening is recognized as a problem among epidemiologists. When I say overscreening is a problem, I’m mostly just trusting expert consensus on the matter.
That’s a different issue. In a post upstream you made a rather amazing claim that additional tests after testing positive for cancer on a screening would be irrational. Do you stand by that claim?
I stand by that a lot of smart people who study this issue believe that in actual medical practice, these screenings are either a problem in themselves, or that the information from the screenings can lead people to irrational behavior, and I do trust them.
But really, that was just an illustrative example used to steelman Michael. You don’t have to except the actual example, just the general concept that this sort of thing can happen.
Overscreening is recognized as a problem among epidemiologists.
Rationality does not specify values. I rather suspect that the cost-benefit analysis that epidemiologists look at is quite different from the cost-benefit analysis that individuals look at.
these screenings are either a problem in themselves, or that the information from the screenings can lead people to irrational behavior
LOL. Don’t bother you pretty little head with too much information. No, you don’t need to know that. No, you can’t decide what you need to know and what you don’t need to know. X-/
Scientists as community of humans should expect there research to return false positives sometimes, because that is what is going to happen, and they should publish those results. Scientists should also expect experiments to demonstrate that some of their hypotheses are just plain wrong. It seems to me replication is only not very useful if the replications of the experiment are likely prone to all the same crap that currently makes original experiments from social psychology not all that reliable. I don’t have experience, or practical knowledge of the field, though, so I wouldn’t know.
I sort of side with Mitchel on this.
A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it’s often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does—just update and move on, don’t obsess.
It’s kind of how some of the landmark studies on priming failed to replicate, but there are so many followup studies which are explained by priming really well that it seems a bit silly to throw out the notion of priming just because of that.
Keep in mind, while you are unlikely to hit statistically significance where there is no real result, it’s not statistically unlikely to have a real result that doesn’t hit significance the next time you do it. Significance tests are attuned to get false negatives more often than false positives.
Emotionally though… when you get a positive result in breast cancer screening even when you’re not at risk, you don’t just shrug and say “probably a false positive” even though it is. Instead, you irrationally do more screenings and possibly get a needless operation. Similarly, when the experiment fails to replicate, people don’t shrug and say “probably a false negative”, even though that is, in fact, very likely. Instead, they start questioning the reputation of the experimenter. Understandably, this whole process is nerve wracking for the original experimenter. Which I think is where Mitchel was—admittedly clumsily—groping towards with the talk of “impugning scientific integrity”.
I guess the context is important here. If the first experiment was wrong, and the second experiment is wrong, will you publish the failure of the second experiment? Will you also publish your suspicion that the first experiment was wrong? How likely will people believe you that your results prove the first experiment was wrong, if you did something else?
Here is what the selection bias will do otherwise:
20 people will try 20 “second experiments” with p = 0,05. 19 of them will fail, one will succeed and publish the results of their successful second experiment. Then, using the same strategy, 20 people will try 20 “third experiments”, and again, one of them will succeed… Ten years later, you can have dozen experiments examining and confirming the theory from dozen different angles, so the theory seems completely solid.
Is there a chance that the process I described was responsible for this?
In practice, individual scientists like to be able to say “my work causes updates”. If you do something that rests on someone else’s work and the experiment doesn’t come out, you have an incentive to say “Someonewrongonthenet’s hypothesis X implies A and B. Someonewrongonthenet showed A [citation], but I tried B and that means X isn’t completely right.
Cue further investigation which eventually tosses out X. Whether or not A was a false positive is less important than whether or not X is right.
Yes, that’s possible. I’m not sure direct replication actually solves that issue, though—you’d just shift over to favoring false negatives instead false positives. The existing mechanism that works against this is the incentive to overturn other people’s work.
Tell me, does anyone actually do what you think they should do? That is, based on a long chain of ideas A->B->C->D, none of which have been replicated, upon experimenting and learning ~Z, do they ever reject the bogus theory D? (Or wait, was it C that should be rejected, or maybe the ~Z should be rejected as maybe the experiment just wasn’t powered enough to be meaningful as almost all studies are underpowered or, can you really say that Z logically entailed A...D? Maybe some other factor interfered with Z and so we can ‘save the appearances’ of A..Z! Yes, that’s definitely it!) “Theory-testing in psychology and physics: a methodological paradox”, Meehl 1967, puts it nicely (and this is as true as the day he wrote it half a century ago):
To give a concrete example of why your advice is absurd and impractical and dangerous...
One of the things I am most proud of is my work on dual n-back not increasing IQ; the core researchers, in particular, the founder Jaeggi, are well-aware that their results have not replicated very well and that the results are almost entirely explained by bad control groups, and this is in part thanks to increased sample size from various followup studies which tried to repeat the finding while doing something else like an fMRI study or trying an emotional processing variant. So, what are they doing now, the Buschkuel lab and the new Jaeggi lab? Have they abandoned DNB/IQ, reasoning that since “the first experiment was wrong, the second experiment will end up wrong too”? Have they taken your advice to “just update and move on, don’t obsess”? Maybe taken serious stock of their methods and other results involving benefits to working memory training in general?
No. They are now busily investigating whether individual personality differences can explain transfer or not to IQ, whether other tasks can transfer, whether manipulation motivation can moderate transfer to IQ, and so on and so forth, and reaching p<0.05 and publishing papers just like they were before; but I suppose that’s all OK, because after all, “there are so many followup studies which are explained by [dual n-back transferring] really well that it seems a bit silly to throw out the notion of [dual n-back increasing IQ] just because of that”.
Wait, I’m not sure we’re talking about the same thing. I’m saying direct replication isn’t the most useful way to spend time. You’re talking about systematic experiment design flaws.
According to your writing, the failures in this example stem from methodological issues (not using an active control group). A direct replication of the n-back-IQ transfer would have just hit p<.05 again, as it would have had the same methodological issues. Of course, if the methodological issue is not repaired, all subsequent findings will suffer from the same issues.
I’m strictly saying that direct replication isn’t useful. Rigorous checking of methods and doing it over again correctly where there is a failure in the documented methodology is always a good idea.
But the Jaeggi cluster also sometimes use active control groups, with various kinds of differences in the intervention, metrics, and interpretations. In fact, Jaeggi was co-author on a new dual n-back meta-analysis released this month*; the meta-analysis finds the passive-active difference I did, and you know what their interpretation is? That it’s due to the correlated classification of US vs international laboratories conducting particular experiments. (It never even occurred to me to classify the studies this way.) They note that sometimes psychology experiments reach different conclusions in other cultures/countries—which they do—so perhaps the lower results in American studies using active control groups is because Americans gain less from n-back training. The kindest thing I can say about this claim is that I may be able to falsify it with my larger collection of studies (they threw out or missed a lot).
So, after performing these conceptual extensions of their results—as you suggest—they continue to
So it goes.
* http://www.gwern.net/docs/dnb/2014-au.pdf / https://pdf.yt/d/VMPWmd0jpDYvZIjm / https://dl.dropboxusercontent.com/u/85192141/2014-au.pdf ; initial comments on it: https://groups.google.com/forum/#!topic/brain-training/GYqqSyfqffA
The first sentence in your dual-n-back article is:
If you believe that there’s a net gain of medium effect size then why do you think we should throw dual n-back under the bus?
You should probably have read part of the second sentence: “active vs passive control groups criticism: found, and it accounts for most of the net effect size”.
This is not good, and I guess is not what he meant.
You design the second experiment so that it aims to find something assuming the first was right, but if the first was wrong, it can expose that too. Basically, it has to be a stronger experiment than the first one.
Agreed, that is a better way to say what I was trying to say.
If you’re concerned about the velocity of scientific progress, you should also be concerned about wrong turns. A Type 1 Error (establishing a wrong result by incorrectly rejecting a null hypothesis) is, IMHO, far more damaging to science than failure to establish a correct result—possibly due to an insufficient experimental setup.
Yeah, there’s definitely an “exploration / rigor” trade-off here (or maybe “speed / accuracy”) and I’m not sure it’s clear which side we are erring on right now. I’m not terribly surprised that LW favors rigor, just due to the general personality profile of the users here, and that my favoring of exploration at the cost of being wrong a few times is in the minority.
I definitely think a rational agent would be more exploratory than science currently is, but on the other hand we’ve got systematic biases to contend with and rigor might offset that.
If you get a positive result, you run another test. If you keep getting positive results, you probably have breast cancer.
Similarly, if an experiment fails to replicate, you try again. If it replicates this time, then it’s probably fine. If it keeps failing to replicate, then there’s a problem.
At the very least, you need to try to replicate a random sample of studies, just to make sure there aren’t more false studies than you’ve been assuming.
Not an expert on cancer, but I don’t think it works that way .I think the cancer test accurately tests a variable wihch is a proxy for cancer risk. So a patient who doesn’t have cancer but tests positive will continue testing positive, because the variable that the cancer test measures as a proxy for cancer is elevated in that patient.
Experiments do work that way, but I’m not arguing against that. I’m only arguing that direct replication isn’t a better use of resources than just going on to a followup experiment with a different methodology (unless direct replication is really easy and you can just have some students do it or something).
Is there only one kind of test? Couldn’t they find another proxy?
If the followup is testing the same thing with a different methodology, then the metaphor works. If you run followup experiments just to find more detail, it would be like someone testing positive for cancer so then you run a test for what kind of cancer. You’re assuming they have cancer when you run the second test, so the results could be misleading.
Generally an idea is considered well supported when multiple methodologies support it, yes. In the psychology lab I used to work in, at least, we never try to replicate, but we do try to show the same thing in multiple different ways. There are maybe 15 different experiments a year, but they’re generally all centered around proving or dis-proving a cluster of 2 or 3 broad, conceptually linked hypotheses.
Biology labs I’ve worked with do often do the whole “okay, the results are in and this is established now, let’s find additional detail’ thing, but that’s because they were usually looking at much simpler systems, like a single protein or something, so they could afford to take liberties and not be so paranoid about experimental methods.
...and now you have two problems X-)
It’s not a matter of speed, it’s a matter of velocity. Going fast in the wrong direction is (much) worse than useless.
You are quite likely. You start with a 5% chance under ideal circumstances and that chance only climbs from there. P-hacking is very widespread.
8-0 You think getting additional screenings after testing positive for cancer is “irrational”??
The process of screening itself involves risks, not to mention the misplaced stress and possibility of unnecessary surgery.
This is true for e.g. any visit to the doctor. Are you saying that it’s irrational to go for medical checkups?
In the cancer screening case, what do you think does the cost-benefit analysis say?
It would be irrational to go for medical check ups when they aren’t necessary—if you did it every 3 days, for example.
I’m looking at this from a birds eye view. A lot of people get unnecessary screenings, which tell them information which is not worth acting upon no matter whether it says that it is positive or negative, and then start worrying and getting unnecessary testing and treatment. Information is only useful to the extent that you can act upon it.
And from up there you take it upon yourself to judge whether personal decisions are rational or not? I think you’re way too far away for that.
That’s a different issue. In a post upstream you made a rather amazing claim that additional tests after testing positive for cancer on a screening would be irrational. Do you stand by that claim?
Er...I think that’s a little harsh of you. Overscreening is recognized as a problem among epidemiologists. When I say overscreening is a problem, I’m mostly just trusting expert consensus on the matter.
I stand by that a lot of smart people who study this issue believe that in actual medical practice, these screenings are either a problem in themselves, or that the information from the screenings can lead people to irrational behavior, and I do trust them.
But really, that was just an illustrative example used to steelman Michael. You don’t have to except the actual example, just the general concept that this sort of thing can happen.
Rationality does not specify values. I rather suspect that the cost-benefit analysis that epidemiologists look at is quite different from the cost-benefit analysis that individuals look at.
LOL. Don’t bother you pretty little head with too much information. No, you don’t need to know that. No, you can’t decide what you need to know and what you don’t need to know. X-/
Scientists as community of humans should expect there research to return false positives sometimes, because that is what is going to happen, and they should publish those results. Scientists should also expect experiments to demonstrate that some of their hypotheses are just plain wrong. It seems to me replication is only not very useful if the replications of the experiment are likely prone to all the same crap that currently makes original experiments from social psychology not all that reliable. I don’t have experience, or practical knowledge of the field, though, so I wouldn’t know.