John, I consider myself a ‘Bayesian wannabe’ and my favorite author thereon is E. T. Jaynes. As such, I follow Jaynes in vehemently denying that the posterior probability following an experiment should depend on “whether Alice decided ahead of time to conduct 12 trials or decided to conduct trials until 3 successes were achieved”. See Jaynes’s Probability Theory: The Logic of Science.
The 0.05 significance level is not just “arbitrary”, it is demonstrably too high—in some fields the actual majority of “statistically significant” results fail to replicate, but the failures to replicate don’t get into the prestigious journals, and are not talked about and remembered.
I follow Jaynes in vehemently denying that the posterior probability following an experiment should depend on “whether Alice decided ahead of time to conduct 12 trials or decided to conduct trials until 3 successes were achieved”.
I’m sorry, that seems just wrong. The statistics work if there’s an unbiased process that determines which events you observe. If Alice conducts trails until 3 successes were achieved, that’s a biased process that’s sure to ensure that the data ends with a least one success.
Surely you accept that if Alice conducts 100 trials and only gives you the successes, you’ll get the wrong result no matter the statistical procedure used, so you can’t say that biased data collection is irrelevant. You have to either claim that continuing until 3 successes were achieved is an unbiased process, or retreat from the claim that that procedure for collecting the data does not influence the correct interpretation of the results.
The universe doesn’t care about Alice’s intentions. The trials give information and that information would have been the same even if the trials were run because a rock fell on Alice’s keyboard when she wasn’t watching.
Surely you accept that if Alice conducts 100 trials and only gives you the successes, you’ll get the wrong result no matter the statistical procedure used
so you can’t say that biased data collection is irrelevant.
Here is where the mistake starts creeping in. You are setting up “biased data collection” to mean selective reporting. Cherry picking the trials that succeed while discarding trials that do not. But in the case of Alice the evidence is all being considered.
You have to either claim that continuing until 3 successes were achieved is an unbiased process, or retreat from the claim that that procedure for collecting the data does not influence the correct interpretation of the results.
The necessary claim is “continuing until 3 successes are achieved does not produce biased data”, which is true.
This is a question that is empirically testable. Run a simulation of agents that try to guess, say, which of a set of weighted dice are in use. Pit your ‘care what Alice thinks’ agents against the bayesian agent. Let them bet among themselves. See which one ends up with all the money.
I thought the exact same thing, and wrote a program to test it. Program is below:
from random import random
p_success = 0.10
def twelve_trials(p_success = 0.25):
>>>># Runs twelve trials, counts the successes
>>>>success_count = 0
>>>>num_trials = 0
>>>>for i in range(12):
>>>>>>>>if random() < p_success:
>>>>>>>>>>>>success_count += 1
>>>>>>>>num_trials += 1
>>>>return success_count
def trials_until_3(p_success = 0.25):
>>>># Runs trials until it hits three successes, counts the trials
>>>>success_count = 0
>>>>num_trials = 0
>>>>while success_count < 3:
>>>>>>>>if random() < p_success:
>>>>>>>>>>>>success_count += 1
>>>>>>>>num_trials += 1
>>>>return num_trials
for i in range(100):
>>>>num_tests = 10000
>>>>twelve_trials_successes = 0
>>>>for i in range(num_tests):
>>>>>>>># See how often there are at least 3 successes in 12 trials
>>>>>>>>twelve_trials_successes += (twelve_trials(p_success) >= 3)
>>>>
>>>>trials_until_3_successes = 0
>>>>for i in range(num_tests):
>>>>>>>># See how often 3 successes happen in 12 trials or less
>>>>>>>>trials_until_3_successes += (trials_until_3(p_success) <= 12)
>>>>print '{0}\t{1}'.format(twelve_trials_successes, trials_until_3_successes)
Turns out they actually are equivalent. I tested with all manner of probabilities of success. Obviously, if what you’re actually doing is running a set number of trials in one case and running trials until you reach significance or give up in the second case, you will come up with different results. However, if you have a set number of trials and a set success threshold set beforehand, it doesn’t matter whether or not you run all the trials, or just run until the success threshold (which actually seems fairly obvious in retrospect).
Edit: formatting sucks
Actually, it’s quite interesting what happens if you run trials until you reach significance. Turns out that if you want a fraction p of all trials you do to end up positive, but each trial only ends up positive with probability q<p, then with some positive probability (a function of p and q) you will have to keep going forever.
(This is a well-known result if p=1/2. Then you can think of the trials as a biased random walk on the number line, in which you go left with probability q<1/2 and right otherwise, and you want to return to the place you started. The probability that you’ll ever return to the origin is 2q, which is less than 1.)
Ah, but that’s not what it means to run until significance—in my interpretation in any case. A significant result would mean that you run until you have either p < 0.005 that your hypothesis is correct, or p < 0.005 that it’s incorrect. Doing the experiment in this way would actually validate it for “proof” in conventional Science.
Since he mentions “running until you’re bored”, his interpretation may be closer to yours though.
Obviously, if what you’re actually doing is running a set number of trials in one case and running trials until you reach significance or give up in the second case, you will come up with different results.
I don’t believe this is true. Every individual trial is individual Bayesian evidence, unrelated to the rest of the trials except in the fact that your priors are different. If you run until significance you will have updated to a certain probability, and if you run until you’re bored you’ll also have updated to a certain probability.
Sure, if you run a different amount of trials, you may end up with a different probability. At worst, if you keep going until you’re bored, you may end up with results insignificant for the strict rules of “proof” in Science. But as long as you use Bayesian updating, neither method produces some form of invalid results.
which actually seems fairly obvious in retrospect
Ding ding ding! That’s my hindsight-bias-reminder-heuristic going off. It tells me when I need to check myself for hindsight bias, and goes off on thoughts like “That seems obvious in retrospect” and “I knew that all along.” At the risk of doing your thinking for you, I’d say this is a case of hindsight bias: It wasn’t obvious beforehand, since otherwise you wouldn’t have felt the need to do the test. This means it’s not an obvious concept in the first place, and only becomes clear when you consider it more closely, which you did. Then saying that “it’s obvious in retrospect” has no value, and actually devalues the time you put in.
formatting sucks
Try this:
To make a paragraph where your indentation is preserved and no characters are treated specially, precede each line with (at least) four spaces. This is commonly used for computer program source code.
I don’t believe this is true. Every individual trial is individual Bayesian evidence, unrelated to the rest of the trials except in the fact that your priors are different. If you run until significance you will have updated to a certain probability, and if you run until you’re bored you’ll also have updated to a certain probability.
You have to be very careful you’re actually asking the same question in both cases. In the case I tested above, I was asking exactly the same question (my intuition said very strongly that I wasn’t, but that’s because I was thinking of the very similar but subtly different question below). The “fairly obvious in retrospect” refers to that particular phrasing of the problem (I would have immediately understood that the probabilities had to be equal if I had phrased it that way, but since I didn’t, that insight was a little harder-earned).
The question I was actually thinking of is as follows.
Scenario A: You run 12 trials, then check whether your odds ratio reaches significance and report your results.
Scenario B: You run trials until either your odds ratio reaches significance or you hit 12 trials, then report your results.
I think scenario A is different from scenario B, and that’s the one I was thinking of (it’s the “run subjects until you hit significance or run out of funding” model).
A new program confirms my intuition about the question I had been thinking of when I decided to test it. I agree with Eliezer that it shouldn’t matter whether the researcher goes to a certain number of trials or a certain number of positive results, but I disagree with the implication that the same dataset always gives you the same information.
The program is here, you can fiddle with the parameters if you want to look at the result yourself.
formatting sucks
Try this:
I did. It didn’t indent properly. I tried again, and it still doesn’t.
If Alice decides to conduct 12 trials, then the sampling distribution of the data is the binomial distribution. If Alice decides to sample until 3 successes are achieved, then the sampling distribution of the data is the negative binomial distribution. These two distributions are proportional when considered as functions of the parameter p (i.e., as likelihood functions). So in this specific case, from a Bayesian point of view the sampling mechanism does not influence the conclusions. (This is in contradistinction to inference based on p-values.)
In general, you are correct to say that biased data collection is not irrelevant; this idea is given a complete treatment in Chapter 6 (or 7, I forget which) of Gelman et al.’s Bayesian Data Analyses, 2nd ed.
John, I consider myself a ‘Bayesian wannabe’ and my favorite author thereon is E. T. Jaynes. As such, I follow Jaynes in vehemently denying that the posterior probability following an experiment should depend on “whether Alice decided ahead of time to conduct 12 trials or decided to conduct trials until 3 successes were achieved”. See Jaynes’s Probability Theory: The Logic of Science.
The 0.05 significance level is not just “arbitrary”, it is demonstrably too high—in some fields the actual majority of “statistically significant” results fail to replicate, but the failures to replicate don’t get into the prestigious journals, and are not talked about and remembered.
I’m sorry, that seems just wrong. The statistics work if there’s an unbiased process that determines which events you observe. If Alice conducts trails until 3 successes were achieved, that’s a biased process that’s sure to ensure that the data ends with a least one success.
Surely you accept that if Alice conducts 100 trials and only gives you the successes, you’ll get the wrong result no matter the statistical procedure used, so you can’t say that biased data collection is irrelevant. You have to either claim that continuing until 3 successes were achieved is an unbiased process, or retreat from the claim that that procedure for collecting the data does not influence the correct interpretation of the results.
The universe doesn’t care about Alice’s intentions. The trials give information and that information would have been the same even if the trials were run because a rock fell on Alice’s keyboard when she wasn’t watching.
Yes, he does.
Here is where the mistake starts creeping in. You are setting up “biased data collection” to mean selective reporting. Cherry picking the trials that succeed while discarding trials that do not. But in the case of Alice the evidence is all being considered.
The necessary claim is “continuing until 3 successes are achieved does not produce biased data”, which is true.
This is a question that is empirically testable. Run a simulation of agents that try to guess, say, which of a set of weighted dice are in use. Pit your ‘care what Alice thinks’ agents against the bayesian agent. Let them bet among themselves. See which one ends up with all the money.
I thought the exact same thing, and wrote a program to test it. Program is below:
Turns out they actually are equivalent. I tested with all manner of probabilities of success. Obviously, if what you’re actually doing is running a set number of trials in one case and running trials until you reach significance or give up in the second case, you will come up with different results. However, if you have a set number of trials and a set success threshold set beforehand, it doesn’t matter whether or not you run all the trials, or just run until the success threshold (which actually seems fairly obvious in retrospect). Edit: formatting sucks
Actually, it’s quite interesting what happens if you run trials until you reach significance. Turns out that if you want a fraction p of all trials you do to end up positive, but each trial only ends up positive with probability q<p, then with some positive probability (a function of p and q) you will have to keep going forever.
(This is a well-known result if p=1/2. Then you can think of the trials as a biased random walk on the number line, in which you go left with probability q<1/2 and right otherwise, and you want to return to the place you started. The probability that you’ll ever return to the origin is 2q, which is less than 1.)
Ah, but that’s not what it means to run until significance—in my interpretation in any case. A significant result would mean that you run until you have either p < 0.005 that your hypothesis is correct, or p < 0.005 that it’s incorrect. Doing the experiment in this way would actually validate it for “proof” in conventional Science.
Since he mentions “running until you’re bored”, his interpretation may be closer to yours though.
Upvoted for actually testing the theory :)
I don’t believe this is true. Every individual trial is individual Bayesian evidence, unrelated to the rest of the trials except in the fact that your priors are different. If you run until significance you will have updated to a certain probability, and if you run until you’re bored you’ll also have updated to a certain probability.
Sure, if you run a different amount of trials, you may end up with a different probability. At worst, if you keep going until you’re bored, you may end up with results insignificant for the strict rules of “proof” in Science. But as long as you use Bayesian updating, neither method produces some form of invalid results.
Ding ding ding! That’s my hindsight-bias-reminder-heuristic going off. It tells me when I need to check myself for hindsight bias, and goes off on thoughts like “That seems obvious in retrospect” and “I knew that all along.” At the risk of doing your thinking for you, I’d say this is a case of hindsight bias: It wasn’t obvious beforehand, since otherwise you wouldn’t have felt the need to do the test. This means it’s not an obvious concept in the first place, and only becomes clear when you consider it more closely, which you did. Then saying that “it’s obvious in retrospect” has no value, and actually devalues the time you put in.
Try this:
(From the Comment Formatting Help)
You have to be very careful you’re actually asking the same question in both cases. In the case I tested above, I was asking exactly the same question (my intuition said very strongly that I wasn’t, but that’s because I was thinking of the very similar but subtly different question below). The “fairly obvious in retrospect” refers to that particular phrasing of the problem (I would have immediately understood that the probabilities had to be equal if I had phrased it that way, but since I didn’t, that insight was a little harder-earned).
The question I was actually thinking of is as follows.
Scenario A: You run 12 trials, then check whether your odds ratio reaches significance and report your results.
Scenario B: You run trials until either your odds ratio reaches significance or you hit 12 trials, then report your results.
I think scenario A is different from scenario B, and that’s the one I was thinking of (it’s the “run subjects until you hit significance or run out of funding” model).
A new program confirms my intuition about the question I had been thinking of when I decided to test it. I agree with Eliezer that it shouldn’t matter whether the researcher goes to a certain number of trials or a certain number of positive results, but I disagree with the implication that the same dataset always gives you the same information.
The program is here, you can fiddle with the parameters if you want to look at the result yourself.
I did. It didn’t indent properly. I tried again, and it still doesn’t.
If Alice decides to conduct 12 trials, then the sampling distribution of the data is the binomial distribution. If Alice decides to sample until 3 successes are achieved, then the sampling distribution of the data is the negative binomial distribution. These two distributions are proportional when considered as functions of the parameter p (i.e., as likelihood functions). So in this specific case, from a Bayesian point of view the sampling mechanism does not influence the conclusions. (This is in contradistinction to inference based on p-values.)
In general, you are correct to say that biased data collection is not irrelevant; this idea is given a complete treatment in Chapter 6 (or 7, I forget which) of Gelman et al.’s Bayesian Data Analyses, 2nd ed.