Doug, your analogy is not valid because a biased reporting method has a different likelihood function to the possible prior states, compared to an unbiased one. In this case the single, fixed dataset that we see, has a different likelihood to the possible prior states, depending on the reporting method.
If a researcher who happens to be thinking biased thoughts carries out a fixed sequence of experimental actions, the resulting dataset we see does not have a different likelihood function to the possible prior states. All that a Bayesian needs to know is the experimental actions that were actually carried out and the data that was actually observed—not what the researcher was thinking at the time, or what other actions the researcher might have performed if things had gone differently, or what other dataset might then have been observed. We need only consider the actual experimental results.
Londenio, see Ron’s comment—it’s not a strawperson.
Just a note here: the fact that a dataset has the same likelihood function regardless of the procedure that produced it is actually NOT a trivial statement—the way I see it, it a somewhat deep result which follows from the optional stopping theorem and the fact that the likelihood function is bounded. Not trying to nitpick, just pointing out that there is something to think about here. According to my initial intuitions, this was actually rather surprising—I didn’t expect experimental results constructed using biased data (in the sense of non-fixed stopping time) to end up yielding unbiased results, even with full disclosure of all data.
It’s worth revising your intuitions if you found if surprising that a fixed physical act had the same likelihood to data regardless of researcher thoughts. It is indeed possible to see the mathematical result as “obvious at a glance”.
That’s not quite what I meant. It is not the experimenter’s thoughts that I am uncomfortable with- it is the collection of possible experimental outcomes.
I will try to illustrate with an example. Let us say that I toss a coin either (i) two times, or (ii) until it comes up heads. In the first case, the possible outcomes are HH, HT, TH, or TT; in the second case, they are H, TH, TTH, TTTH, TTTTH, etc. It isn’t obvious to me that a TH outcome has the same meaning in both cases. If, for instance, we were not talking about likelihood and instead decided to measure something else, e.g. the portion of tosses landing on heads, this wouldn’t be the case; in scenario (i), the expected portion of tosses landing on heads is 1⁄4 + .5/4 + .5/4 + 0⁄4 = .5, but in scenario (ii), it would be 1⁄2 + .5/4 + (1/3)/8 + .25/16 + … = log(2); i.e. a little under .7.
I think in this case, we are assuming total and honest reporting of results (including publication); otherwise, we would be back to the story of filtered evidence. Therefore, the publication is not a result of the plans—it was going to happen in either case.
Thanks, I understood the mathematical point but was wondering if there is any practical significance since it seems in the real world that we cannot make such an assumption, and that in the real world we should trust the results of the two researchers differently (since the one researcher likely published no matter what, whereas the second probably only published the experiments which came out favorably (even if he didn’t publish false information)).
What is the practical import of this idea? In the real world with all of people’s biases shouldn’t we distinguish between the two researchers as a general heuristic for good research standards?
(If this is addressed in a different post on this site feel free to point me there since I have not read the majority of the site)
You can claim that it should have the same likelihood either way, but you have to put the discrepancy somewhere.
Knowing the choice of stopping rule is evidence about the experimenter’s state of knowledge about the efficacy. You can say that it should be treated as a separate piece of evidence, or that knowing about the stopping rule should change your prior, but if you don’t bring it in somewhere, you’re ignoring critical information.
No, it practical terms it’s negligible. There’s a reason that double-blind trials are the gold standard—it’s because doctors are as prone to cognitive biases as anyone else.
Let me put it this way: recently a pair of doctors looked at the available evidence and concluded (foolishly!) that putting fecal bacteria in the brains of brain cancer patients was such a promising experimental treatment that they did an end-run around the ethics review process—and after leaving that job under a cloud, one of them was still considered a “star free agent”. Well, perhaps so—but I think this little episode illustrates very well that a doctor’s unsupported opinion about the efficacy of his or her novel experimental treatment isn’t worth the shit s/he wants to place inside your skull.
Of course experimental design is very important in general. But VAuroch and I agree that when two designs give rise to the same likelihood function, the information that comes in from the data are equivalent. We disagree about the weight to give to the information that comes in from what the choice of experimental design tells us about the experimenter’s prior state of knowledge.
Double-blind trials aren’t the gold standard, they’re the best available standard. They still don’t replicate far too often, because they don’t remove bias (and I’m not just referring to publication bias). Which is why, when considering how to interpret a study, you look at the history of what scientific positions the experimenter has supported in the past, and then update away from that to compensate for bias which you have good reason to think will show up in their data.
In the example, past results suggest that, even if the trial was double-blind, someone who is committed to achieving a good result for the treatment will get more favorable data than some other experimenter with no involvement.
And that’s on top of the trivial fact that someone with an interest in getting a successful trial is more likely to use a directionally-slanted stopping rule if they have doubts about the efficacy than if they are confident it will work, which is not explicitly relevant in Eliezer’s example.
I think I figured where the source of confusion is. From the wording of the problem I assume that:
The first researcher is going to publish anyways once he reaches 100 patients, no matter what the results are.
The second researcher will continue as long as he doesn’t meet his desired ratio, and had he not reached these results—he would continue forever without publishing and we’d never even heard of his experiment.
For the first researcher, a failure would update our belief in the treatment’s effectiveness downward and a success would update it upward. For the second researcher, a failure will not update our belief—because we wouldn’t even know the research existed—so for a success to update our belief upward would violate the Conservation of Expected Evidence.
But—if we do know about the second researcher’s experiment, we can interpret the fact that he didn’t publish as a failure to reach a sufficient ratio of success, and update our belief down—which makes it meaningful to update our belief up when he publishes the results.
So—it’s not about state of mind—it’s about the researchers actions in other Everett branches where their experiments failed.
Great point but I worry that people will point to this post and say “See? Publication bias/questionable study design/corporate funding/varying peer review processes don’t matter!”
In other words, it’s good to strive for a fixed experimental process but reality is rarely that tidy.
Emil, thanks, fixed.
Doug, your analogy is not valid because a biased reporting method has a different likelihood function to the possible prior states, compared to an unbiased one. In this case the single, fixed dataset that we see, has a different likelihood to the possible prior states, depending on the reporting method.
If a researcher who happens to be thinking biased thoughts carries out a fixed sequence of experimental actions, the resulting dataset we see does not have a different likelihood function to the possible prior states. All that a Bayesian needs to know is the experimental actions that were actually carried out and the data that was actually observed—not what the researcher was thinking at the time, or what other actions the researcher might have performed if things had gone differently, or what other dataset might then have been observed. We need only consider the actual experimental results.
Londenio, see Ron’s comment—it’s not a strawperson.
Just a note here: the fact that a dataset has the same likelihood function regardless of the procedure that produced it is actually NOT a trivial statement—the way I see it, it a somewhat deep result which follows from the optional stopping theorem and the fact that the likelihood function is bounded. Not trying to nitpick, just pointing out that there is something to think about here. According to my initial intuitions, this was actually rather surprising—I didn’t expect experimental results constructed using biased data (in the sense of non-fixed stopping time) to end up yielding unbiased results, even with full disclosure of all data.
It’s worth revising your intuitions if you found if surprising that a fixed physical act had the same likelihood to data regardless of researcher thoughts. It is indeed possible to see the mathematical result as “obvious at a glance”.
That’s not quite what I meant. It is not the experimenter’s thoughts that I am uncomfortable with- it is the collection of possible experimental outcomes.
I will try to illustrate with an example. Let us say that I toss a coin either (i) two times, or (ii) until it comes up heads. In the first case, the possible outcomes are HH, HT, TH, or TT; in the second case, they are H, TH, TTH, TTTH, TTTTH, etc. It isn’t obvious to me that a TH outcome has the same meaning in both cases. If, for instance, we were not talking about likelihood and instead decided to measure something else, e.g. the portion of tosses landing on heads, this wouldn’t be the case; in scenario (i), the expected portion of tosses landing on heads is 1⁄4 + .5/4 + .5/4 + 0⁄4 = .5, but in scenario (ii), it would be 1⁄2 + .5/4 + (1/3)/8 + .25/16 + … = log(2); i.e. a little under .7.
The TH outcome tells you the same thing about the coin because the coin does not know what your plans were like.
I’m convinced. Having though about this a little more, I think I see the model you are working under, and it does make a good deal of intuitive sense.
Does the publication of the result tell you the same thing, since the fact that it was published is a result of the plans?
I think in this case, we are assuming total and honest reporting of results (including publication); otherwise, we would be back to the story of filtered evidence. Therefore, the publication is not a result of the plans—it was going to happen in either case.
Thanks, I understood the mathematical point but was wondering if there is any practical significance since it seems in the real world that we cannot make such an assumption, and that in the real world we should trust the results of the two researchers differently (since the one researcher likely published no matter what, whereas the second probably only published the experiments which came out favorably (even if he didn’t publish false information)). What is the practical import of this idea? In the real world with all of people’s biases shouldn’t we distinguish between the two researchers as a general heuristic for good research standards?
(If this is addressed in a different post on this site feel free to point me there since I have not read the majority of the site)
You can claim that it should have the same likelihood either way, but you have to put the discrepancy somewhere. Knowing the choice of stopping rule is evidence about the experimenter’s state of knowledge about the efficacy. You can say that it should be treated as a separate piece of evidence, or that knowing about the stopping rule should change your prior, but if you don’t bring it in somewhere, you’re ignoring critical information.
No, it practical terms it’s negligible. There’s a reason that double-blind trials are the gold standard—it’s because doctors are as prone to cognitive biases as anyone else.
Let me put it this way: recently a pair of doctors looked at the available evidence and concluded (foolishly!) that putting fecal bacteria in the brains of brain cancer patients was such a promising experimental treatment that they did an end-run around the ethics review process—and after leaving that job under a cloud, one of them was still considered a “star free agent”. Well, perhaps so—but I think this little episode illustrates very well that a doctor’s unsupported opinion about the efficacy of his or her novel experimental treatment isn’t worth the shit s/he wants to place inside your skull.
Hold on- aren’t you saying the choice of experimental rule is VERY important (i.e. double blind vs. not double blind,etc)?
If so you are agreeing with VAuroch. You have to include the details of the experiment somewhere. The data does not speak for itself.
Of course experimental design is very important in general. But VAuroch and I agree that when two designs give rise to the same likelihood function, the information that comes in from the data are equivalent. We disagree about the weight to give to the information that comes in from what the choice of experimental design tells us about the experimenter’s prior state of knowledge.
Double-blind trials aren’t the gold standard, they’re the best available standard. They still don’t replicate far too often, because they don’t remove bias (and I’m not just referring to publication bias). Which is why, when considering how to interpret a study, you look at the history of what scientific positions the experimenter has supported in the past, and then update away from that to compensate for bias which you have good reason to think will show up in their data.
In the example, past results suggest that, even if the trial was double-blind, someone who is committed to achieving a good result for the treatment will get more favorable data than some other experimenter with no involvement.
And that’s on top of the trivial fact that someone with an interest in getting a successful trial is more likely to use a directionally-slanted stopping rule if they have doubts about the efficacy than if they are confident it will work, which is not explicitly relevant in Eliezer’s example.
I can’t say I disagree.
I think I figured where the source of confusion is. From the wording of the problem I assume that:
The first researcher is going to publish anyways once he reaches 100 patients, no matter what the results are.
The second researcher will continue as long as he doesn’t meet his desired ratio, and had he not reached these results—he would continue forever without publishing and we’d never even heard of his experiment.
For the first researcher, a failure would update our belief in the treatment’s effectiveness downward and a success would update it upward. For the second researcher, a failure will not update our belief—because we wouldn’t even know the research existed—so for a success to update our belief upward would violate the Conservation of Expected Evidence.
But—if we do know about the second researcher’s experiment, we can interpret the fact that he didn’t publish as a failure to reach a sufficient ratio of success, and update our belief down—which makes it meaningful to update our belief up when he publishes the results.
So—it’s not about state of mind—it’s about the researchers actions in other Everett branches where their experiments failed.
Great point but I worry that people will point to this post and say “See? Publication bias/questionable study design/corporate funding/varying peer review processes don’t matter!”
In other words, it’s good to strive for a fixed experimental process but reality is rarely that tidy.