gwern comments on Against NHST

gwern 25 Nov 2015 19:44 UTC
2 points
“Not Even Scientists Can Easily Explain P-values”

It’s not their fault, said Steven Goodman, co-director of METRICS. Even after spending his “entire career” thinking about p-values, he said he could tell me the definition, “but I cannot tell you what it means, and almost nobody can.” Scientists regularly get it wrong, and so do most textbooks, he said. When Goodman speaks to large audiences of scientists, he often presents correct and incorrect definitions of the p-value, and they “very confidently” raise their hand for the wrong answer. “Almost all of them think it gives some direct information about how likely they are to be wrong, and that’s definitely not what a p-value does,” Goodman said.
- Tem42 25 Nov 2015 23:29 UTC
  0 points
  Parent
  Okay, stupid question :-/
  
  “Almost all of them think it gives some direct information about how likely they are to be wrong, and that’s definitely not what a p-value does...”
  
  But
  
  ″...the technical definition of a p-value — the probability of getting results at least as extreme as the ones you observed, given that the null hypothesis is correct...”
  
  Aren’t these basically the same? Can’t you paraphrase them both as “the probability that you would get this result if your hypothesis was wrong”? Am I failing to understand what they mean by ‘direct information’? Or am I being overly binary in assuming that the hypothesis and the null hypothesis as the only two possibilities?
  - gjm 25 Nov 2015 23:52 UTC
    3 points
    Parent
    What p-values actually mean:
    
    How likely is it that you’d get a result this impressive just by chance if the effect you’re looking for isn’t actually there?
    
    What they’re commonly taken to mean?
    
    How likely is it, given the impressiveness of the result, that the effect you’re looking for is actually there?
    
    That is, p-values measure Pr(observations | null hypothesis) whereas what you want is more like Pr(alternative hypothesis | observations).
    
    (Actually, what you want is more like a probability distribution for the size of the effect—that’s the “overly binary* thing—but never mind that for now.)
    
    So what are the relevant differences between these?
    
    If your null hypothesis and alternative hypothesis are one another’s negations (as they’re supposed to be) then you’re looking at the relationship between Pr(A|B) and Pr(B|A). These are famously related by Bayes’ theorem, but they are certainly not the same thing. We have Pr(A|B) = Pr(A&B)/Pr(B) and Pr(B|A) = Pr(A&B)/Pr(A) so the ratio between the two is the ratio of probabilities of A and B. So, e.g., suppose you are interested in ESP and you do a study on precognition or something whose result has a p-value of 0.05. If your priors are like mine, your estimate of Pr(precognition) will still be extremely small because precognition is (in advance of the experimental evidence) much more unlikely than just randomly getting however many correct guesses it takes to get a p-value of 0.05.
    
    In practice, the null hypothesis is usually something like “X =Y” or “X<=Y”. Then your alternative is “X /= Y” or “X > Y”. But in practice what you actually care about is that X and Y are substantially unequal, or X is substantially bigger than Y, and that’s probably the alternative you actually have in mind even if you’re doing statistical tests that just accept or reject the null hypothesis. So a small p-value may come from a very carefully measured difference that’s too small to care about. E.g., suppose that before you do your precognition study you think (for whatever reason) that precog is about as likely to be real as not. Then after the study results come in, you should in fact think it’s probably real. But if you then think “aha, time to book my flight to Las Vegas” you may be making a terrible mistake even if you’re right about precognition being real. Because maybe your study looked at someone predicting a million die rolls and they got 500 more right than you’d expect by chance; that would be very exciting scientifically but probably useless for casino gambling because it’s not enough to outweigh the house’s advantage.
    
    [EDITED to fix a typo and clarify a bit.]
    - Tem42 26 Nov 2015 16:22 UTC
      0 points
      Parent
      Thank you—I get it now.
  - Lumifer 25 Nov 2015 23:53 UTC
    1 point
    Parent
    
    Aren’t these basically the same?
    
    Not at all. To quote Andrew Gelman,
    
    The p-value is a strange nonlinear transformation of data that is only interpretable under the null hypothesis. Once you abandon the null (as we do when we observe something with a very low p-value), the p-value itself becomes irrelevant.
    
    Also see more of Gelman on the same topic.
- Lumifer 25 Nov 2015 19:53 UTC
  0 points
  Parent
  
  but I cannot tell you what it means
  
  Why not? Most people misunderstand it, but in the frequentist framework its actual meaning is quite straightforward.
  - gwern 25 Nov 2015 22:23 UTC
    4 points
    Parent
    A definition is not a meaning, in the same way the meaning of a hammer is not ‘a long piece of metal with a round bit at one end’.
    
    Everyone with a working memory can define a p-value, as indeed Goodman and the others can, but what does it mean?
    - Lumifer 25 Nov 2015 23:47 UTC
      −2 points
      Parent
      What kind of answer, other than philosophical deepities, would you expect in response to ”...but what does it mean”? Meaning almost entirely depends on the subject and the context.
      - gwern 26 Nov 2015 15:48 UTC
        11 points
        Parent
        Is the meaning of a hammer describing its role and use, as opposed to a mere definition describing some physical characteristics, really a ‘philosophical deepity’?
        
        When you mumble some jargon about ‘the frequency of a class of outcomes in sampling from a particular distribution’, you may have defined a p-value, but you have not given a meaning. It is numerology if left there, some gematriya played with distributions. You have not given any reason to care whatsoever about this particular arbitrary construct or explained what a p=0.04 vs a 0.06 means or why any of this is important or what you should do upon seeing one p-value rather than another or explained what other people value about it or how it affects beliefs about anything. (Maybe you should go back and reread the Sequences, particularly the ones about words.)
        Lumifer 30 Nov 2015 17:25 UTC
        0 points
        Parent
        
        Is the meaning of a hammer describing its role and use, as opposed to a mere definition describing some physical characteristics, really a ‘philosophical deepity’?
        
        Just like you don’t accept the definition as an adequate substitute for meaning, I don’t see why “role and use” would be an adequate substitute either.
        
        As I mentioned, meaning critically depends on the subject and the context. Sometimes the meaning of the p-value boils down to “We can publish that”. Or maybe “There doesn’t seem to be anything here worth investigating further”. But in general case it depends and that is fine. That context dependence is not a special property of the p-value, though.
        gwern 1 Dec 2015 2:44 UTC
        2 points
        Parent
        
        I don’t see why “role and use” would be an adequate substitute either.
        
        I’ll again refer you to the Sequences. I think Eliezer did an excellent job explaining why definitions are so inadequate and why role and use are the adequate substitutes.
        
        As I mentioned, meaning critically depends on the subject and the context. Sometimes the meaning of the p-value boils down to “We can publish that”. Or maybe “There doesn’t seem to be anything here worth investigating further”.
        
        And if these experts, who (unusually) are entirely familiar with the brute definition and don’t misinterpret it as something it is not, cannot explain any use of p-values without resorting to shockingly crude and unacceptable contextual explanations like ‘we need this numerology to get published’, then it’s time to consider whether p-values should be used at all for any purpose—much less their current use as the arbiters of scientific truth.
        
        Which is much the point of that quote, and of all the citations I have so exhaustively compiled in this post.
        Lumifer 1 Dec 2015 3:48 UTC
        −1 points
        Parent
        I think we’re talking past each other.
        
        Tap.