Fixing my predictions now, before going to investigate this issue further (I have Mackay’s book within the hand’s reach and would also like to run some Monte-Carlo simulations to check the results; going to post the resolution later):
a) It seems that we ought to treat the results differently, because the second researcher in effect admits to p-hacking his results. b) But on the other hand, what if we modify the scenario slightly: suppose we get the results from both researchers 1 patient at a time. Surely we ought to update the priors by the same amount each time? And so by the time we get the 100th individual result from each researcher, the priors should be the same, even if we then find out that they had different stopping criteria.
My prediction is that argument a) turns out to be right and argument b) contains some subtle mistake.
Update: a) is just wrong and b) is right, but unsatisfying because it doesn’t address the underlying intuition which says that the stopping criterion ought to matter. I’m very glad that I decided to investigate this issue in full detail and run my own simulations instead of just accepting some general principle from either side.
MacKay presents it as a conflict between frequentism vs bayesianism and argues why frequentism is wrong. But I started out with a bayesian model and still felt that motivated stopping would have some influence. I’m going to try to articulate the best argument why the stopping criterion must matter and then explain why it fails.
First of all the scenario doesn’t describe exactly what the stopping criterion was. So I made up one: The (second) researcher treats patients and gets the results one at a time. He has some particular threshold for the probability that the treatment is >60% effective and he is going to stop and report the results the moment the probability reaches the threshold. He derives this probability by calculating a beta distribution for the data and integrating it from 0.6 to 1. (for those who are unfamiliar with the beta distribution, I recommend this excellent video by 3Blue1Brown) In this case the likelihood of seeing the data given underlying probability x is given by beta f(x)=(10070)x70(1−x)30, and the probability that treatment is >60% effective is α:=∫10.6f(x)dx.
Now the argument: motivated stopping ensures that we don’t just get 70 successes and 30 failures. We have an additional constraint that after each of the 99 outcomes for treatment the probability is strictly <α and only after the 100th patient it reaches α. Surely then, we must modify f(x) to reflect this constraint. And if the true probability was really >60%, then surely there are many Everett branches where the probability reaches αbefore we ever get to the 100th patient. If it really took so long, then it must be because it’s actually less likely that the true probability is >60%.
And indeed, the likelihood of seeing 70 successes and 30 failures with such stopping criterion is less than is initially given by f(x). BUT! The constraint is independent of the probability x! It is purely about the order in which the outcomes appear. In other words, it changes the constant (10070), which originally indicated the total number of all different ways to order 70 positive and 30 negative instances. And this constant reduces the likelihood for every probability equally! It doesn’t reduce it more in universes where x>0.6 compared to where x≤0.6. This means that the shape of the original distribution stays the same, only the amplitude changes. But because we condition on seeing 70 successes and 30 failures anyway, this means that the area under the curve must be equal to 1. So we have to re-normalize f(x) , and it comes out as f(x)=(10070)x70(1−x)30 again!
Another way to think about it is that the stopping criterion is not entangled with the actual underlying probability in a given universe. There is zero mutual information between the stopping criterion and x. And yes, if this was not the case, if for example, the researcher had decided that he would also treat one more patient after reaching the threshold α and only publish the results if this patient recovered (but not mention them in the report), then it would absolutely affect the results, because a positive outcome for the patient is more likely in universes where x>0.6. But then it also wouldn’t be purely about his state of mind, we would have an additional data point.
Fixing my predictions now, before going to investigate this issue further (I have Mackay’s book within the hand’s reach and would also like to run some Monte-Carlo simulations to check the results; going to post the resolution later):
a) It seems that we ought to treat the results differently, because the second researcher in effect admits to p-hacking his results. b) But on the other hand, what if we modify the scenario slightly: suppose we get the results from both researchers 1 patient at a time. Surely we ought to update the priors by the same amount each time? And so by the time we get the 100th individual result from each researcher, the priors should be the same, even if we then find out that they had different stopping criteria.
My prediction is that argument a) turns out to be right and argument b) contains some subtle mistake.
Update: a) is just wrong and b) is right, but unsatisfying because it doesn’t address the underlying intuition which says that the stopping criterion ought to matter. I’m very glad that I decided to investigate this issue in full detail and run my own simulations instead of just accepting some general principle from either side.
MacKay presents it as a conflict between frequentism vs bayesianism and argues why frequentism is wrong. But I started out with a bayesian model and still felt that motivated stopping would have some influence. I’m going to try to articulate the best argument why the stopping criterion must matter and then explain why it fails.
First of all the scenario doesn’t describe exactly what the stopping criterion was. So I made up one: The (second) researcher treats patients and gets the results one at a time. He has some particular threshold for the probability that the treatment is >60% effective and he is going to stop and report the results the moment the probability reaches the threshold. He derives this probability by calculating a beta distribution for the data and integrating it from 0.6 to 1. (for those who are unfamiliar with the beta distribution, I recommend this excellent video by 3Blue1Brown) In this case the likelihood of seeing the data given underlying probability x is given by beta f(x)=(10070)x70(1−x)30, and the probability that treatment is >60% effective is α:=∫10.6f(x)dx.
Now the argument: motivated stopping ensures that we don’t just get 70 successes and 30 failures. We have an additional constraint that after each of the 99 outcomes for treatment the probability is strictly <α and only after the 100th patient it reaches α. Surely then, we must modify f(x) to reflect this constraint. And if the true probability was really >60%, then surely there are many Everett branches where the probability reaches α before we ever get to the 100th patient. If it really took so long, then it must be because it’s actually less likely that the true probability is >60%.
And indeed, the likelihood of seeing 70 successes and 30 failures with such stopping criterion is less than is initially given by f(x). BUT! The constraint is independent of the probability x! It is purely about the order in which the outcomes appear. In other words, it changes the constant (10070), which originally indicated the total number of all different ways to order 70 positive and 30 negative instances. And this constant reduces the likelihood for every probability equally! It doesn’t reduce it more in universes where x>0.6 compared to where x≤0.6. This means that the shape of the original distribution stays the same, only the amplitude changes. But because we condition on seeing 70 successes and 30 failures anyway, this means that the area under the curve must be equal to 1. So we have to re-normalize f(x) , and it comes out as f(x)=(10070)x70(1−x)30 again!
Another way to think about it is that the stopping criterion is not entangled with the actual underlying probability in a given universe. There is zero mutual information between the stopping criterion and x. And yes, if this was not the case, if for example, the researcher had decided that he would also treat one more patient after reaching the threshold α and only publish the results if this patient recovered (but not mention them in the report), then it would absolutely affect the results, because a positive outcome for the patient is more likely in universes where x>0.6. But then it also wouldn’t be purely about his state of mind, we would have an additional data point.