Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying “the coin has tails on both sides at a 95% confidence level” because that’s what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin is two-tailed with a 100% confidence.
I have a coin which I claim is fair: that is, there is equal chance that it lands on heads and tails, and each flip is independent of every other flip.
But when we look at 60 trials of the coin flipped 5 times (that is, 300 total flips), we see that there are no trials in which either 0 heads were flipped or 5 heads were flipped. Every time, it’s 1 to 4 heads.
This is odd- for a fair coin, there’s a 6.25% chance that we would see 5 tails in a row or 5 heads in a row in a set of 5 flips. To not see that 60 times in a row has a probability of only 2.1%, which is rather unlikely! We can state with some confidence that this coin does not look fair; there is some structure to it that suggests the flips are not independent of each other.
One mistake is treating 95% as the chance of the study indicating two-tailed coins, given that they were two-tailed coins. More likely it was meant as the chance of the study not indicating two-tailed coins, given that they were not two-tailed coins.
Try this:
You want to test if a coin is biased towards heads. You flip it 5 times, and consider 5 heads as a positive result, 4 heads or fewer as negative. You’re aiming for 95% confidence but have to get 31⁄32 = 96.875%. Treating 4 heads as a possible result wouldn’t work either, as that would get you less than 95% confidence.
This doesn’t seem like a good analogy to any real-world situation. The null hypothesis (“the coin really has two tails”) predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.
While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study.
Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally not (1-p), much less (1-alpha).
I don’t see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?
(1) is obvious, of course—in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates.
On another level, I suppose it can be said that of course they are all biased! But, by the actual two-tailed coin rather than researchers’ prejudice against normal coins.
Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the “the”. The final overconfidence is the least of the mistakes in the story.
Mistakes more relevant to practical empiricism: Treating “>= 95%” as “= 95%” is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively ‘correct’, but they will be less informed.
Hence my question in another thread: Was that “exactly 95% confidence” or “at least 95% confidence”? However when researchers say “at a 95% confidence level” they typically mean “p < 0.05″, and reporting the actual p-values is often even explicitly discouraged (let’s not digress into whether it is justified).
Yet the mistake I had in mind (as opposed to other, less relevant, merely “a” mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn’t guarantee you an automatic 5% chance of getting the other.
Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying “the coin has tails on both sides at a 95% confidence level” because that’s what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin is two-tailed with a 100% confidence.
What was the mistake?
I don’t know if the original post was changed, but it explicitly addresses this point:
The actual situation is described this way:
I have a coin which I claim is fair: that is, there is equal chance that it lands on heads and tails, and each flip is independent of every other flip.
But when we look at 60 trials of the coin flipped 5 times (that is, 300 total flips), we see that there are no trials in which either 0 heads were flipped or 5 heads were flipped. Every time, it’s 1 to 4 heads.
This is odd- for a fair coin, there’s a 6.25% chance that we would see 5 tails in a row or 5 heads in a row in a set of 5 flips. To not see that 60 times in a row has a probability of only 2.1%, which is rather unlikely! We can state with some confidence that this coin does not look fair; there is some structure to it that suggests the flips are not independent of each other.
One mistake is treating 95% as the chance of the study indicating two-tailed coins, given that they were two-tailed coins. More likely it was meant as the chance of the study not indicating two-tailed coins, given that they were not two-tailed coins.
Try this:
You want to test if a coin is biased towards heads. You flip it 5 times, and consider 5 heads as a positive result, 4 heads or fewer as negative. You’re aiming for 95% confidence but have to get 31⁄32 = 96.875%. Treating 4 heads as a possible result wouldn’t work either, as that would get you less than 95% confidence.
This doesn’t seem like a good analogy to any real-world situation. The null hypothesis (“the coin really has two tails”) predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.
While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study.
Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally not (1-p), much less (1-alpha).
I don’t see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?
(1) is obvious, of course—in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates.
On another level, I suppose it can be said that of course they are all biased! But, by the actual two-tailed coin rather than researchers’ prejudice against normal coins.
Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the “the”. The final overconfidence is the least of the mistakes in the story.
Mistakes more relevant to practical empiricism: Treating “>= 95%” as “= 95%” is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively ‘correct’, but they will be less informed.
Hence my question in another thread: Was that “exactly 95% confidence” or “at least 95% confidence”? However when researchers say “at a 95% confidence level” they typically mean “p < 0.05″, and reporting the actual p-values is often even explicitly discouraged (let’s not digress into whether it is justified).
Yet the mistake I had in mind (as opposed to other, less relevant, merely “a” mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn’t guarantee you an automatic 5% chance of getting the other.