I never got this. Surely, the two researchers’ data, while ostensibly the same, is in fact drawn from two different distributions? Let’s make the example a bit more brutal. Two researchers are given, in turn, the same coin. The first one, by coincidence, gets 100 heads. The second one has staked his career on the coin being weighted and silently discards tails results, of which he gets plenty. The two report the same evidence—but surely, once we learn about the two scientists and their predilections, we would evaluate their evidence differently? I mean, the second scientist’s results are most certainly not evidence for or against the coin being weighted, right? Similarly, if a scientist runs 500 different trials, and only reports those who are statistically significant and in favor of his point, we have a higher expectation of finding that his results support his point independent of whether his point is actually valid, no? How is the one-retrofitted-trial version any different?
The difference is that in your example we got different sets of data, and simply discarded some of the data from one of them to make them look the same, whereas in the original we got the same set of data by the same method, everything that happened in the real world was the same, the only difference was in counterfactual scenarios.
Yeah but our perception is the same, no? Besides, in a sense, the original researcher also discards bits of data—he discards all possible stopping points that do not confirm his hypothesis, and all those after his hypothesis has been “confirmed”.
Let’s imagine a scientist did 500 tests. Then he started discarding tests, from the end, until the remaining data supported some hypothesis (or he ran out of tests). Is this to be treated as evidence of the same strength as it would if he had precommitted to only doing that many tests?
I may be wrong here because I’m tired, but I think the way the maths comes out is that this would be as strong if he only removed tests from the end, whereas if he removed them from anywhere he chose depending on how they came out it would not be as strong.
I never got this. Surely, the two researchers’ data, while ostensibly the same, is in fact drawn from two different distributions? Let’s make the example a bit more brutal. Two researchers are given, in turn, the same coin. The first one, by coincidence, gets 100 heads. The second one has staked his career on the coin being weighted and silently discards tails results, of which he gets plenty. The two report the same evidence—but surely, once we learn about the two scientists and their predilections, we would evaluate their evidence differently? I mean, the second scientist’s results are most certainly not evidence for or against the coin being weighted, right? Similarly, if a scientist runs 500 different trials, and only reports those who are statistically significant and in favor of his point, we have a higher expectation of finding that his results support his point independent of whether his point is actually valid, no? How is the one-retrofitted-trial version any different?
The difference is that in your example we got different sets of data, and simply discarded some of the data from one of them to make them look the same, whereas in the original we got the same set of data by the same method, everything that happened in the real world was the same, the only difference was in counterfactual scenarios.
Yeah but our perception is the same, no? Besides, in a sense, the original researcher also discards bits of data—he discards all possible stopping points that do not confirm his hypothesis, and all those after his hypothesis has been “confirmed”.
He does not discard anything that actually happened.
This is the key difference. We are evaluating the effectiveness of the drug by looking at what the drug actually did, not what it could have done.
I can give a much more precise mathematical proof if you want.
Let’s imagine a scientist did 500 tests. Then he started discarding tests, from the end, until the remaining data supported some hypothesis (or he ran out of tests). Is this to be treated as evidence of the same strength as it would if he had precommitted to only doing that many tests?
I may be wrong here because I’m tired, but I think the way the maths comes out is that this would be as strong if he only removed tests from the end, whereas if he removed them from anywhere he chose depending on how they came out it would not be as strong.