The realities of scientific publishing are unfortunate (and yes, I know of efforts to ameliorate the problem in medical research). If people published all their research (“We did 50 runs with the following parameters, all failed, sure #39 showed statistical significance but we don’t believe it”) I would have zero problems with it. But that’s not how the world currently works.
That would be a better world. But in this world, it would still be true that there is no universal, absolute, optimal percentage of experiments failing to replicate, and the optimal percentage is set by decision-theoretic/economic concerns.
Experiments that fail to replicate at percentages greater than those expected from published confidence values (say, posterior probabilities) are evidence that the published confidence values are wrong.
A research process that consistently produces wrong confidence values has serious problems.
Experiments that fail to replicate at percentages greater than those expected from published confidence values (say, posterior probabilities) are evidence that the published confidence values are wrong.
How would you know? People do not produce posterior probabilities or credible intervals, they produce confidence intervals and p-values.
Either the p-values in the papers are worthless in the sense of not reflecting the probability that the observed effect is real—in which case the issue in the parent post stands.
Or the p-values, while not perfect, do reflect the probability the effect is real—in which case they are falsified by the replication rates and in which case the issue in the parent post stands.
Either the p-values in the papers are worthless in the sense of not reflecting the probability that the observed effect is real
p-values do not reflect the probability that the observed effect is real but the inverse, and no one has ever claimed that, so we can safely dismiss this entire line of thought.
Or the p-values, while not perfect, do reflect the probability the effect is real
p-values can, with some assumptions and choices, be used to calculate other things like positive predictive value/PPV, which are more meaningful. However, the issue still stands. Suppose a field’s studies have a PPV of 20%. Is this good or bad? I don’t know—it depends on the uses you intend to put it to and the loss function on the results.
Maybe it would be helpful if I put it in Bayesian terms where the terms are more meaningful & easier to understand. Suppose an experiment turns in a posterior with 80% of the distribution >0. Subsequent experiments or additional data collection will agree with and ‘replicate’ this result the obvious amount.
Now, was this experiment ‘underpowered’ (it collected too little data and is bad) or ‘overpowered’ (too much and inefficient/unethical) or just right? Was this field too tolerant of shoddy research practices in producing that result?
Well, if the associated loss function has a high penalty on true values being <0 (because the cancer drugs have nasty side-effects and are expensive and only somewhat improve on the other drugs) then it was probably underpowered; if it has a small loss function (because it was a website A/B test and you lose little if it was a worse variant) then it was probably overpowered because you spent more traffic/samples than you had to to choose a variant.
The ‘replication crises’ are a ‘crisis’ in part because people are basing meaningful decisions on the results to an extent that cannot be justified if one were to explicitly go through a Bayesian & decision theory analysis with informative data. eg pharmacorps probably should not be spending millions of dollars to buy and do preliminary trials on research which is not much distinguishable from noise, as they have learned to their intense frustration & financial cost, to say nothing of diet research. If the results did not matter to anyone, then it would not be a big deal if the PPV were 5% rather than 50%: the researchers would cope, and other people would not make costly suboptimal decisions.
There is no single replication rate which is ideal for cancer trials and GWASes and individual differences psychology research and taxonomy and ecology and schizophrenia trials and...
That would be a better world. But in this world, it would still be true that there is no universal, absolute, optimal percentage of experiments failing to replicate, and the optimal percentage is set by decision-theoretic/economic concerns.
Experiments that fail to replicate at percentages greater than those expected from published confidence values (say, posterior probabilities) are evidence that the published confidence values are wrong.
A research process that consistently produces wrong confidence values has serious problems.
How would you know? People do not produce posterior probabilities or credible intervals, they produce confidence intervals and p-values.
I don’t see how this point helps you.
Either the p-values in the papers are worthless in the sense of not reflecting the probability that the observed effect is real—in which case the issue in the parent post stands.
Or the p-values, while not perfect, do reflect the probability the effect is real—in which case they are falsified by the replication rates and in which case the issue in the parent post stands.
p-values do not reflect the probability that the observed effect is real but the inverse, and no one has ever claimed that, so we can safely dismiss this entire line of thought.
p-values can, with some assumptions and choices, be used to calculate other things like positive predictive value/PPV, which are more meaningful. However, the issue still stands. Suppose a field’s studies have a PPV of 20%. Is this good or bad? I don’t know—it depends on the uses you intend to put it to and the loss function on the results.
Maybe it would be helpful if I put it in Bayesian terms where the terms are more meaningful & easier to understand. Suppose an experiment turns in a posterior with 80% of the distribution >0. Subsequent experiments or additional data collection will agree with and ‘replicate’ this result the obvious amount.
Now, was this experiment ‘underpowered’ (it collected too little data and is bad) or ‘overpowered’ (too much and inefficient/unethical) or just right? Was this field too tolerant of shoddy research practices in producing that result?
Well, if the associated loss function has a high penalty on true values being <0 (because the cancer drugs have nasty side-effects and are expensive and only somewhat improve on the other drugs) then it was probably underpowered; if it has a small loss function (because it was a website A/B test and you lose little if it was a worse variant) then it was probably overpowered because you spent more traffic/samples than you had to to choose a variant.
The ‘replication crises’ are a ‘crisis’ in part because people are basing meaningful decisions on the results to an extent that cannot be justified if one were to explicitly go through a Bayesian & decision theory analysis with informative data. eg pharmacorps probably should not be spending millions of dollars to buy and do preliminary trials on research which is not much distinguishable from noise, as they have learned to their intense frustration & financial cost, to say nothing of diet research. If the results did not matter to anyone, then it would not be a big deal if the PPV were 5% rather than 50%: the researchers would cope, and other people would not make costly suboptimal decisions.
There is no single replication rate which is ideal for cancer trials and GWASes and individual differences psychology research and taxonomy and ecology and schizophrenia trials and...