It’s not their fault, said Steven Goodman, co-director of METRICS. Even after spending his “entire career” thinking about p-values, he said he could tell me the definition, “but I cannot tell you what it means, and almost nobody can.” Scientists regularly get it wrong, and so do most textbooks, he said. When Goodman speaks to large audiences of scientists, he often presents correct and incorrect definitions of the p-value, and they “very confidently” raise their hand for the wrong answer. “Almost all of them think it gives some direct information about how likely they are to be wrong, and that’s definitely not what a p-value does,” Goodman said.
“Almost all of them think it gives some direct information about how likely they are to be wrong, and that’s definitely not what a p-value does...”
But
″...the technical definition of a p-value — the probability of getting results at least as extreme as the ones you observed, given that the null hypothesis is correct...”
Aren’t these basically the same? Can’t you paraphrase them both as “the probability that you would get this result if your hypothesis was wrong”? Am I failing to understand what they mean by ‘direct information’? Or am I being overly binary in assuming that the hypothesis and the null hypothesis as the only two possibilities?
How likely is it that you’d get a result this impressive just by chance if the effect you’re looking for isn’t actually there?
What they’re commonly taken to mean?
How likely is it, given the impressiveness of the result, that the effect you’re looking for is actually there?
That is, p-values measure Pr(observations | null hypothesis) whereas what you want is more like Pr(alternative hypothesis | observations).
(Actually, what you want is more like a probability distribution for the size of the effect—that’s the “overly binary* thing—but never mind that for now.)
So what are the relevant differences between these?
If your null hypothesis and alternative hypothesis are one another’s negations (as they’re supposed to be) then you’re looking at the relationship between Pr(A|B) and Pr(B|A). These are famously related by Bayes’ theorem, but they are certainly not the same thing. We have Pr(A|B) = Pr(A&B)/Pr(B) and Pr(B|A) = Pr(A&B)/Pr(A) so the ratio between the two is the ratio of probabilities of A and B. So, e.g., suppose you are interested in ESP and you do a study on precognition or something whose result has a p-value of 0.05. If your priors are like mine, your estimate of Pr(precognition) will still be extremely small because precognition is (in advance of the experimental evidence) much more unlikely than just randomly getting however many correct guesses it takes to get a p-value of 0.05.
In practice, the null hypothesis is usually something like “X =Y” or “X<=Y”. Then your alternative is “X /= Y” or “X > Y”. But in practice what you actually care about is that X and Y are substantially unequal, or X is substantially bigger than Y, and that’s probably the alternative you actually have in mind even if you’re doing statistical tests that just accept or reject the null hypothesis. So a small p-value may come from a very carefully measured difference that’s too small to care about. E.g., suppose that before you do your precognition study you think (for whatever reason) that precog is about as likely to be real as not. Then after the study results come in, you should in fact think it’s probably real. But if you then think “aha, time to book my flight to Las Vegas” you may be making a terrible mistake even if you’re right about precognition being real. Because maybe your study looked at someone predicting a million die rolls and they got 500 more right than you’d expect by chance; that would be very exciting scientifically but probably useless for casino gambling because it’s not enough to outweigh the house’s advantage.
The p-value is a strange nonlinear transformation of data that is only interpretable under the null hypothesis. Once you abandon the null (as we do when we observe something with a very low p-value), the p-value itself becomes irrelevant.
What kind of answer, other than philosophical deepities, would you expect in response to ”...but what does it mean”? Meaning almost entirely depends on the subject and the context.
Is the meaning of a hammer describing its role and use, as opposed to a mere definition describing some physical characteristics, really a ‘philosophical deepity’?
When you mumble some jargon about ‘the frequency of a class of outcomes in sampling from a particular distribution’, you may have defined a p-value, but you have not given a meaning. It is numerology if left there, some gematriya played with distributions. You have not given any reason to care whatsoever about this particular arbitrary construct or explained what a p=0.04 vs a 0.06 means or why any of this is important or what you should do upon seeing one p-value rather than another or explained what other people value about it or how it affects beliefs about anything. (Maybe you should go back and reread the Sequences, particularly the ones about words.)
Is the meaning of a hammer describing its role and use, as opposed to a mere definition describing some physical characteristics, really a ‘philosophical deepity’?
Just like you don’t accept the definition as an adequate substitute for meaning, I don’t see why “role and use” would be an adequate substitute either.
As I mentioned, meaning critically depends on the subject and the context. Sometimes the meaning of the p-value boils down to “We can publish that”. Or maybe “There doesn’t seem to be anything here worth investigating further”. But in general case it depends and that is fine. That context dependence is not a special property of the p-value, though.
I don’t see why “role and use” would be an adequate substitute either.
I’ll again refer you to the Sequences. I think Eliezer did an excellent job explaining why definitions are so inadequate and why role and use are the adequate substitutes.
As I mentioned, meaning critically depends on the subject and the context. Sometimes the meaning of the p-value boils down to “We can publish that”. Or maybe “There doesn’t seem to be anything here worth investigating further”.
And if these experts, who (unusually) are entirely familiar with the brute definition and don’t misinterpret it as something it is not, cannot explain any use of p-values without resorting to shockingly crude and unacceptable contextual explanations like ‘we need this numerology to get published’, then it’s time to consider whether p-values should be used at all for any purpose—much less their current use as the arbiters of scientific truth.
Which is much the point of that quote, and of all the citations I have so exhaustively compiled in this post.
“Not Even Scientists Can Easily Explain P-values”
Okay, stupid question :-/
But
Aren’t these basically the same? Can’t you paraphrase them both as “the probability that you would get this result if your hypothesis was wrong”? Am I failing to understand what they mean by ‘direct information’? Or am I being overly binary in assuming that the hypothesis and the null hypothesis as the only two possibilities?
What p-values actually mean:
How likely is it that you’d get a result this impressive just by chance if the effect you’re looking for isn’t actually there?
What they’re commonly taken to mean?
How likely is it, given the impressiveness of the result, that the effect you’re looking for is actually there?
That is, p-values measure Pr(observations | null hypothesis) whereas what you want is more like Pr(alternative hypothesis | observations).
(Actually, what you want is more like a probability distribution for the size of the effect—that’s the “overly binary* thing—but never mind that for now.)
So what are the relevant differences between these?
If your null hypothesis and alternative hypothesis are one another’s negations (as they’re supposed to be) then you’re looking at the relationship between Pr(A|B) and Pr(B|A). These are famously related by Bayes’ theorem, but they are certainly not the same thing. We have Pr(A|B) = Pr(A&B)/Pr(B) and Pr(B|A) = Pr(A&B)/Pr(A) so the ratio between the two is the ratio of probabilities of A and B. So, e.g., suppose you are interested in ESP and you do a study on precognition or something whose result has a p-value of 0.05. If your priors are like mine, your estimate of Pr(precognition) will still be extremely small because precognition is (in advance of the experimental evidence) much more unlikely than just randomly getting however many correct guesses it takes to get a p-value of 0.05.
In practice, the null hypothesis is usually something like “X =Y” or “X<=Y”. Then your alternative is “X /= Y” or “X > Y”. But in practice what you actually care about is that X and Y are substantially unequal, or X is substantially bigger than Y, and that’s probably the alternative you actually have in mind even if you’re doing statistical tests that just accept or reject the null hypothesis. So a small p-value may come from a very carefully measured difference that’s too small to care about. E.g., suppose that before you do your precognition study you think (for whatever reason) that precog is about as likely to be real as not. Then after the study results come in, you should in fact think it’s probably real. But if you then think “aha, time to book my flight to Las Vegas” you may be making a terrible mistake even if you’re right about precognition being real. Because maybe your study looked at someone predicting a million die rolls and they got 500 more right than you’d expect by chance; that would be very exciting scientifically but probably useless for casino gambling because it’s not enough to outweigh the house’s advantage.
[EDITED to fix a typo and clarify a bit.]
Thank you—I get it now.
Not at all. To quote Andrew Gelman,
Also see more of Gelman on the same topic.
Why not? Most people misunderstand it, but in the frequentist framework its actual meaning is quite straightforward.
A definition is not a meaning, in the same way the meaning of a hammer is not ‘a long piece of metal with a round bit at one end’.
Everyone with a working memory can define a p-value, as indeed Goodman and the others can, but what does it mean?
What kind of answer, other than philosophical deepities, would you expect in response to ”...but what does it mean”? Meaning almost entirely depends on the subject and the context.
Is the meaning of a hammer describing its role and use, as opposed to a mere definition describing some physical characteristics, really a ‘philosophical deepity’?
When you mumble some jargon about ‘the frequency of a class of outcomes in sampling from a particular distribution’, you may have defined a p-value, but you have not given a meaning. It is numerology if left there, some gematriya played with distributions. You have not given any reason to care whatsoever about this particular arbitrary construct or explained what a p=0.04 vs a 0.06 means or why any of this is important or what you should do upon seeing one p-value rather than another or explained what other people value about it or how it affects beliefs about anything. (Maybe you should go back and reread the Sequences, particularly the ones about words.)
Just like you don’t accept the definition as an adequate substitute for meaning, I don’t see why “role and use” would be an adequate substitute either.
As I mentioned, meaning critically depends on the subject and the context. Sometimes the meaning of the p-value boils down to “We can publish that”. Or maybe “There doesn’t seem to be anything here worth investigating further”. But in general case it depends and that is fine. That context dependence is not a special property of the p-value, though.
I’ll again refer you to the Sequences. I think Eliezer did an excellent job explaining why definitions are so inadequate and why role and use are the adequate substitutes.
And if these experts, who (unusually) are entirely familiar with the brute definition and don’t misinterpret it as something it is not, cannot explain any use of p-values without resorting to shockingly crude and unacceptable contextual explanations like ‘we need this numerology to get published’, then it’s time to consider whether p-values should be used at all for any purpose—much less their current use as the arbiters of scientific truth.
Which is much the point of that quote, and of all the citations I have so exhaustively compiled in this post.
I think we’re talking past each other.
Tap.