the readers of scientific papers are expected to understand that results significant to p=0.05 will be wrong around 5% of the times, more or less
And this is the base rate neglect. It’s not “results significant to p=0.05 will be wrong about 5% of time”. It’s “wrong results will be significant to p=0.05 about 5% of time”. And most people will confuse these two things.
It’s like when people confuse “A ⇒ B” with “B ⇒ A”, only this time it is “A ⇒ B (p=0.05)” with “B ⇒ A (p=0.05)”. It is “if wrong, then in 5% significant”. It is not “if significant, then in 5% wrong”.
Notice how you had to do a great deal of handwaving in establishing your prior (aka the base rate).
Yes, you are right. Establishing the prior is pretty difficult, perhaps impossible. (But that does not make “A ⇒ B” equal to “B ⇒ A”.) Probably the reasonable thing to do would be simply to impose strict limits in areas where many results were proved wrong.
Probably the reasonable thing to do would be simply to impose strict limits in areas where many results were proved wrong.
Um, what “strict limits” are you talking about, what will they look like, and who will be doing the imposing?
To get back to my example, let’s say I’m running experiments to check if the tincture made from the bark of a certain tree helps with acne—what strict limits would you like?
p = 0.001, and if at the end of the year too many researches fail to replicate, keep decreasing. (let’s say that “fail to replicate” in this context means that the replication attempt cannot prove it even with p = 0.05 -- we don’t want to make replications too expensive, just a simple sanity check)
let’s say I’m running experiments to check if the tincture made from the bark of a certain tree helps with acne—what strict limits would you like?
a long answer would involve a lot of handwaving again (it depends on why do you believe the bark is helpful; in other words, what other evidence do you already have)
Well, and what’s magical about this particular number? Why not p=0.01? why not p=0.0001? Confidence thresholds are arbitrary, do you have a compelling argument why any particular one is better than the rest?
Besides, you’re forgetting the costs. Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false. By your strict criteria you’re rejecting all of them—you’re rejecting 95 correct papers. There is a cost to that, is there not?
Lumifer, please update that at this moment you don’t grok the difference between “A ⇒ B (p=0.05)” and “B ⇒ A (p = 0.05)”, which is why you don’t understand what p-value really means, which is why you don’t understand the difference between selection bias and base rate neglect, which is probably why the emphasis on using Bayes theorem in scientific process does not make sense to you. You made a mistake, that happens to all of us. Just stop it already, please.
And don’t feel bad about it. Until recently I didn’t understand it too, and I had a gold medal from international mathematical olympiad. Somehow it is not explained correctly at most schools, perhaps because the teachers don’t get it themselves, or maybe they just underestimate the difficulty of proper understanding and the high chance of getting it wrong. So please don’t contibute to the confusion.
Imagine that there are 1000 possible hypotheses, among which 999 are wrong, and 1 is correct. (That’s just a random example to illustrate the concept. The numbers in real life can be different.) You have an experiment that says “yes” to 5% of the wrong hypotheses (this is what p=0.05 means), and also to the correct hypothesis. So at the end, you have 50 wrong hypotheses and 1 correct hypothesis confirmed by the experiment. So in the journal, 98% of the published articles would be wrong, not 5%. It is “wrong ⇒ confirmed (p=0.05)”, not “confirmed ⇒ wrong (p=0.05)”.
LOL. Yeah, yeah, mea culpa, I had a brain fart and expressed myself very poorly.
I do understand what p-value really means. The issue was that I had in mind a specific scenario (where in effect you’re trying to see if the difference in means between two groups is significant) but neglected to mention it in the post :-)
Lumifer, please update that at this moment you don’t grok the difference between “A ⇒ B (p=0.05)” and “B ⇒ A (p = 0.05)”, which is why you don’t understand what p-value really means, which is why you don’t understand the difference between selection bias and base rate neglect, which is probably why the emphasis on using Bayes theorem in scientific process does not make sense to you. You made a mistake, that happens to all of us. Just stop it already, please.
I feel like this could use a bit longer explanation, especially since I think you’re not hearing Lumifer’s point, so let me give it a shot. (I’m not sure a see a meaningful difference between base rate neglect and selection bias in this circumstance.)
The word “grok” in Viliam_Bur’s comment is really important. This part of the grandparent is true:
Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false.
But it’s like saying “well, assume the diagnosis is correct. Then the treatment will make the patient better with high probability.” While true, it’s totally out of touch with reality- we can’t assume the diagnosis is correct, and a huge part of being a doctor is responding correctly to that uncertainty.
Earlier, Lumifer said this, which is an almost correct explanation of using Bayes in this situation:
But that all is fine—the readers of scientific papers are expected to understand that results significant to p=0.05 will be wrong around 5% of the times, more or less (not exactly because the usual test measures P(D|H), the probability of the observed data given the (null) hypothesis while you really want P(H|D), the probability of the hypothesis given the data).
The part that makes it the “almost” is the “5% of the times, more or less.” This implies that it’s centered around 5%, with random chance determining what this instance is. But selection bias means it will almost certainly be more, and generally much more. In fields that study phenomena that don’t exist, 100% of the papers published will be of false results that were significant by chance. In many real fields, rates of failure to replicate are around 30%. Describing 30% as “5%, more or less” seems odd, to say the least.
But the proposal to reduce the p value doesn’t solve the underlying problem (which was Lumifer’s response). If we set the p value threshold lower, at .01 or .001 or wherever, we reducing the risk of false positives at the cost of increasing the risk of false negatives. A study design which needs to determine an effect at the .001 level is much more expensive than a study design which needs to determine an effect at the .05 level, and so we will have many less studies attempted, and many many less published studies.
Better to drop p entirely. Notice that stricter p thresholds go in the opposite direction as the publication of negative results, which is the real solution to the problem of selection bias. By calling for stricter p thresholds, you implicitly assume that p is a worthwhile metric, when what we really want is publication of negative results and more replications.
But it’s like saying “well, assume the diagnosis is correct. Then the treatment will make the patient better with high probability.” While true, it’s totally out of touch with reality
My grandparent post was stupid, but what I had in mind was basically a stage-2 (or −3) drug trial situation. You have declared (at least to the FDA) that you’re running a trial, so selection bias does not apply at this stage. You have two groups, one receives the experimental drug, one receives a placebo. Assume a double-blind randomized scenario and assume there is a measurable metric of improvement at the end of the trial.
After the trial you have two groups with two empirical distributions of the metric of choice. The question is how confident you are that these two distributions are different.
Better to drop p entirely.
Well, as usual it’s complicated. Yes, the p-test is suboptimal in most situations where it’s used in reality. However it fulfils a need and if you drop the test entirely you need a replacement for the need won’t go away.
Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct...
That’s not how p-values work. p=0.05 doesn’t mean that the hypothesis is 95% likely to be correct, even in principle; it means that there’s a 5% chance of seeing the same correlation if the null hypothesis is true. Pull a hundred independent data sets and we’d normally expect to find a p=0.05 correlation or better in at least five or so of them, no matter whether we’re testing, say, an association of cancer risk with smoking or with overuse of the word “muskellunge”.
This distinction’s especially important to keep in mind in an environment where running replications is relatively low-status or where negative results tend to be quietly shelved—both of which, as it happens, hold true in large chunks of academia. But even if this weren’t the case, we’d normally expect replication rates to be less than one minus the claimed p-value, simply because there are many more promising ideas than true ones and some of those will turn up false positives.
Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false.
No, they won’t. You’re committing base rate neglect. It’s entirely possible for people to publish 2000 papers in a field where there’s no hope of finding a true result, and get 100 false results with p 0.05).
And this is the base rate neglect. It’s not “results significant to p=0.05 will be wrong about 5% of time”. It’s “wrong results will be significant to p=0.05 about 5% of time”. And most people will confuse these two things.
It’s like when people confuse “A ⇒ B” with “B ⇒ A”, only this time it is “A ⇒ B (p=0.05)” with “B ⇒ A (p=0.05)”. It is “if wrong, then in 5% significant”. It is not “if significant, then in 5% wrong”.
Yes, you are right. Establishing the prior is pretty difficult, perhaps impossible. (But that does not make “A ⇒ B” equal to “B ⇒ A”.) Probably the reasonable thing to do would be simply to impose strict limits in areas where many results were proved wrong.
Um, what “strict limits” are you talking about, what will they look like, and who will be doing the imposing?
To get back to my example, let’s say I’m running experiments to check if the tincture made from the bark of a certain tree helps with acne—what strict limits would you like?
p = 0.001, and if at the end of the year too many researches fail to replicate, keep decreasing. (let’s say that “fail to replicate” in this context means that the replication attempt cannot prove it even with p = 0.05 -- we don’t want to make replications too expensive, just a simple sanity check)
a long answer would involve a lot of handwaving again (it depends on why do you believe the bark is helpful; in other words, what other evidence do you already have)
a short answer: for example, p = 0.001
Well, and what’s magical about this particular number? Why not p=0.01? why not p=0.0001? Confidence thresholds are arbitrary, do you have a compelling argument why any particular one is better than the rest?
Besides, you’re forgetting the costs. Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false. By your strict criteria you’re rejecting all of them—you’re rejecting 95 correct papers. There is a cost to that, is there not?
Lumifer, please update that at this moment you don’t grok the difference between “A ⇒ B (p=0.05)” and “B ⇒ A (p = 0.05)”, which is why you don’t understand what p-value really means, which is why you don’t understand the difference between selection bias and base rate neglect, which is probably why the emphasis on using Bayes theorem in scientific process does not make sense to you. You made a mistake, that happens to all of us. Just stop it already, please.
And don’t feel bad about it. Until recently I didn’t understand it too, and I had a gold medal from international mathematical olympiad. Somehow it is not explained correctly at most schools, perhaps because the teachers don’t get it themselves, or maybe they just underestimate the difficulty of proper understanding and the high chance of getting it wrong. So please don’t contibute to the confusion.
Imagine that there are 1000 possible hypotheses, among which 999 are wrong, and 1 is correct. (That’s just a random example to illustrate the concept. The numbers in real life can be different.) You have an experiment that says “yes” to 5% of the wrong hypotheses (this is what p=0.05 means), and also to the correct hypothesis. So at the end, you have 50 wrong hypotheses and 1 correct hypothesis confirmed by the experiment. So in the journal, 98% of the published articles would be wrong, not 5%. It is “wrong ⇒ confirmed (p=0.05)”, not “confirmed ⇒ wrong (p=0.05)”.
LOL. Yeah, yeah, mea culpa, I had a brain fart and expressed myself very poorly.
I do understand what p-value really means. The issue was that I had in mind a specific scenario (where in effect you’re trying to see if the difference in means between two groups is significant) but neglected to mention it in the post :-)
I feel like this could use a bit longer explanation, especially since I think you’re not hearing Lumifer’s point, so let me give it a shot. (I’m not sure a see a meaningful difference between base rate neglect and selection bias in this circumstance.)
The word “grok” in Viliam_Bur’s comment is really important. This part of the grandparent is true:
But it’s like saying “well, assume the diagnosis is correct. Then the treatment will make the patient better with high probability.” While true, it’s totally out of touch with reality- we can’t assume the diagnosis is correct, and a huge part of being a doctor is responding correctly to that uncertainty.
Earlier, Lumifer said this, which is an almost correct explanation of using Bayes in this situation:
The part that makes it the “almost” is the “5% of the times, more or less.” This implies that it’s centered around 5%, with random chance determining what this instance is. But selection bias means it will almost certainly be more, and generally much more. In fields that study phenomena that don’t exist, 100% of the papers published will be of false results that were significant by chance. In many real fields, rates of failure to replicate are around 30%. Describing 30% as “5%, more or less” seems odd, to say the least.
But the proposal to reduce the p value doesn’t solve the underlying problem (which was Lumifer’s response). If we set the p value threshold lower, at .01 or .001 or wherever, we reducing the risk of false positives at the cost of increasing the risk of false negatives. A study design which needs to determine an effect at the .001 level is much more expensive than a study design which needs to determine an effect at the .05 level, and so we will have many less studies attempted, and many many less published studies.
Better to drop p entirely. Notice that stricter p thresholds go in the opposite direction as the publication of negative results, which is the real solution to the problem of selection bias. By calling for stricter p thresholds, you implicitly assume that p is a worthwhile metric, when what we really want is publication of negative results and more replications.
My grandparent post was stupid, but what I had in mind was basically a stage-2 (or −3) drug trial situation. You have declared (at least to the FDA) that you’re running a trial, so selection bias does not apply at this stage. You have two groups, one receives the experimental drug, one receives a placebo. Assume a double-blind randomized scenario and assume there is a measurable metric of improvement at the end of the trial.
After the trial you have two groups with two empirical distributions of the metric of choice. The question is how confident you are that these two distributions are different.
Well, as usual it’s complicated. Yes, the p-test is suboptimal in most situations where it’s used in reality. However it fulfils a need and if you drop the test entirely you need a replacement for the need won’t go away.
That’s not how p-values work. p=0.05 doesn’t mean that the hypothesis is 95% likely to be correct, even in principle; it means that there’s a 5% chance of seeing the same correlation if the null hypothesis is true. Pull a hundred independent data sets and we’d normally expect to find a p=0.05 correlation or better in at least five or so of them, no matter whether we’re testing, say, an association of cancer risk with smoking or with overuse of the word “muskellunge”.
This distinction’s especially important to keep in mind in an environment where running replications is relatively low-status or where negative results tend to be quietly shelved—both of which, as it happens, hold true in large chunks of academia. But even if this weren’t the case, we’d normally expect replication rates to be less than one minus the claimed p-value, simply because there are many more promising ideas than true ones and some of those will turn up false positives.
No, they won’t. You’re committing base rate neglect. It’s entirely possible for people to publish 2000 papers in a field where there’s no hope of finding a true result, and get 100 false results with p 0.05).