The difference is that in your example we got different sets of data, and simply discarded some of the data from one of them to make them look the same, whereas in the original we got the same set of data by the same method, everything that happened in the real world was the same, the only difference was in counterfactual scenarios.
Yeah but our perception is the same, no? Besides, in a sense, the original researcher also discards bits of data—he discards all possible stopping points that do not confirm his hypothesis, and all those after his hypothesis has been “confirmed”.
Let’s imagine a scientist did 500 tests. Then he started discarding tests, from the end, until the remaining data supported some hypothesis (or he ran out of tests). Is this to be treated as evidence of the same strength as it would if he had precommitted to only doing that many tests?
I may be wrong here because I’m tired, but I think the way the maths comes out is that this would be as strong if he only removed tests from the end, whereas if he removed them from anywhere he chose depending on how they came out it would not be as strong.
The difference is that in your example we got different sets of data, and simply discarded some of the data from one of them to make them look the same, whereas in the original we got the same set of data by the same method, everything that happened in the real world was the same, the only difference was in counterfactual scenarios.
Yeah but our perception is the same, no? Besides, in a sense, the original researcher also discards bits of data—he discards all possible stopping points that do not confirm his hypothesis, and all those after his hypothesis has been “confirmed”.
He does not discard anything that actually happened.
This is the key difference. We are evaluating the effectiveness of the drug by looking at what the drug actually did, not what it could have done.
I can give a much more precise mathematical proof if you want.
Let’s imagine a scientist did 500 tests. Then he started discarding tests, from the end, until the remaining data supported some hypothesis (or he ran out of tests). Is this to be treated as evidence of the same strength as it would if he had precommitted to only doing that many tests?
I may be wrong here because I’m tired, but I think the way the maths comes out is that this would be as strong if he only removed tests from the end, whereas if he removed them from anywhere he chose depending on how they came out it would not be as strong.