Lets do a check. Assume a worst case scenario where nobody publishes false results at all.
To get three p < 0.05 studies if the hypothesis is false requires on average 60 experiments. This is a lot but is within the realms of possibility if the issue is one which many people are interested in, so there is still grounds for scepticism of this result.
To get one p < 0.001 study if the hypothesis is false requires on average 1000 experiments. This is pretty implausible, so I would be much happier to treat this result as an indisputable fact, even in a field with many vested interests (assuming everything else about the experiment is sound).
Running “1000 experiments” if you don’t have to publish negative results, can mean just slicing data until you find something. Someone with a large data set can just do this 100% of the time.
A replication is more informative, because it’s not subject to nearly as much “find something new and publish it” bias.
This is assuming proper methodology and statistics so that the p-value actually matches the chance of the result arising by chance. In practice, since even your best judgment of the methodology is not going to account for certainty in the soundness of the experiment, I would say that a p-value of 0.001 constitutes considerably less than 10 bits of evidence, because the odds that something was wrong with the experiment are better than the odds that the results were coincidental. Multiple experiments with lower cumulative p-value can still be stronger evidence if they all make adjustments to account for possible sources of error.
Lets do a check. Assume a worst case scenario where nobody publishes false results at all.
To get three p < 0.05 studies if the hypothesis is false requires on average 60 experiments. This is a lot but is within the realms of possibility if the issue is one which many people are interested in, so there is still grounds for scepticism of this result.
To get one p < 0.001 study if the hypothesis is false requires on average 1000 experiments. This is pretty implausible, so I would be much happier to treat this result as an indisputable fact, even in a field with many vested interests (assuming everything else about the experiment is sound).
Running “1000 experiments” if you don’t have to publish negative results, can mean just slicing data until you find something. Someone with a large data set can just do this 100% of the time.
A replication is more informative, because it’s not subject to nearly as much “find something new and publish it” bias.
This is assuming proper methodology and statistics so that the p-value actually matches the chance of the result arising by chance. In practice, since even your best judgment of the methodology is not going to account for certainty in the soundness of the experiment, I would say that a p-value of 0.001 constitutes considerably less than 10 bits of evidence, because the odds that something was wrong with the experiment are better than the odds that the results were coincidental. Multiple experiments with lower cumulative p-value can still be stronger evidence if they all make adjustments to account for possible sources of error.
One too many zeros in the p value there. The 1,000 figure matches p<0.001, which is also what Anders mentioned. (So your point is fine.)
Thanks