dvasya comments on Too good to be true

dvasya 11 Jul 2014 20:26 UTC
5 points
0
Also, different studies have different statistical power, so it may not be OK to simply add up their evidence with equal weights.
- A1987dM 12 Jul 2014 10:32 UTC
  8 points
  0
  Parent
  p-values are supposed to be distributed uniformly from 0 to 1 conditional on the null hypothesis being true.
- PhilGoetz 11 Jul 2014 21:18 UTC
  1 point
  0
  Parent
  No; it’s standard to set the threshold for your statistical test for 95% confidence. Studies with larger samples can detect smaller differences between groups with that same statistical power.
  - Cyan 12 Jul 2014 3:00 UTC
    16 points
    0
    Parent
    
    No; it’s standard to set the threshold for your statistical test for 95% confidence. That’s its statistical power.
    
    “Power” is a statistical term of art, and its technical meaning is neither 1 - alpha) nor 1 - p.
    - PhilGoetz 12 Jul 2014 14:49 UTC
      4 points
      0
      Parent
      Oops; you’re right. Careless of me; fixed.
    - DanielLC 20 Jul 2014 6:02 UTC
      0 points
      0
      Parent
      It’s times like this that I wish Doctor Seuss was a mathematician (or statistician in this case). If they were willing to make up new words, we’d be able to talk without accidentally using jargon that has technical meaning we didn’t intend.
  - Mass_Driver 11 Jul 2014 22:10 UTC
    5 points
    0
    Parent
    I’m confused about how this works.
    
    Suppose the standard were to use 80% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked? Suppose the standard were to use 99% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked?
    
    Also, doesn’t the prior plausibility of the connection being tested matter for attempts to detect experimenter bias this way? E.g., for any given convention about confidence intervals, shouldn’t we be quicker to infer experimenter bias when a set of studies conclude (1) that there is no link between eating lithium batteries and suffering brain damage vs. when a set of studies conclude (2) that there is no link between eating carrots and suffering brain damage?
    - PhilGoetz 11 Jul 2014 22:33 UTC
      9 points
      0
      Parent
      “95% confidence” means “I am testing whether X is linked to Y. I know that the data might randomly conspire against me to make it look as if X is linked to Y. I’m going to look for an effect so large that, if there is no link between X and Y, the data will conspire against me only 5% of the time to look as if there is. If I don’t see an effect at least that large, I’ll say that I failed to show a link between X and Y.”
      
      If you went for 80% confidence instead, you’d be looking for an effect that wasn’t quite as big. You’d be able to detect smaller clinical effects—for instance, a drug that has a small but reliable effect—but if there were no effect, you’d be fooled by the data 20% of the time into thinking that there was.
      
      Also, doesn’t the prior plausibility of the connection being tested matter for attempts to detect experimenter bias this way?
      
      It would if the papers claimed to find a connection. When they claim not to find a connection, I think not. Suppose people decided to test the hypothesis that stock market crashes are caused by the Earth’s distance from Mars. They would gather data on Earth’s distance from Mars, and on movements in the stock market, and look for a correlation.
      
      If there is no relationship, there should be zero correlation, on average. That (approximately) means that half of all studies will show a negative correlation, and half will have positive correlation.
      
      They need to pick a number, and say that if they find a positive correlation above that number, they’ve proven that Mars causes stock market crashes. And they pick that number by finding the correlation just exactly large enough that, if there is no relationship, it happens 5% of the time by chance.
      
      If the proposition is very very unlikely, somebody might insist on a 99% confidence interval instead of a 95% confidence interval. That’s how prior plausibility would affect it. Adopting a standard of 95% confidence is really a way of saying we agree not to haggle over priors.
      - V_V 12 Jul 2014 20:47 UTC
        1 point
        0
        Parent
        
        I’m going to look for an effect so large that, if there is no link between X and Y, the data will conspire against me only 5% of the time to look as if there is.
        
        I think it is “only at most 5% of the time”.
        Douglas_Knight 12 Jul 2014 21:21 UTC
        6 points
        0
        Parent
        No, we are choosing the effect size before we do the study. We choose it so that if the true effect is zero, we will have a false positive exactly 5% of the time.
        jbay 17 Jul 2014 12:39 UTC
        2 points
        0
        Parent
        How does this work for a binary quantity?
        
        If your experiment tells you that [x > 45] with 99% confidence, you may in certain cases be able to confidently transform that to [x > 60] with 95% confidence.
        
        For example, if your experiment tells you that the mass of the Q particle is 1.5034(42) with 99% confidence, maybe you can say instead that it’s 1.50344(2) with 95% confidence.
        
        If your experiment happens to tell you that [particle Q exists] is true with 99% confidence, what kind of transformation can you apply to get 95% confidence instead? Discard some of your evidence? Add noise into your sensor readings?
        
        Roll dice before reporting the answer?
        Douglas_Knight 17 Jul 2014 14:56 UTC
        3 points
        0
        Parent
        We’re not talking about a binary quantity.
        V_V 14 Jul 2014 10:46 UTC
        2 points
        0
        Parent
        According to Wikipedia:
        
        In statistical significance testing, the p-value is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.[1][2] A researcher will often “reject the null hypothesis” when the p-value turns out to be less than a predetermined significance level, often 0.05[3][4] or 0.01.
        
        Douglas_Knight 14 Jul 2014 14:32 UTC
        5 points
        0
        Parent
        Quoting authorities without further commentary is a dick thing to do. I am going to spend more words speculating about the intention of the quote than are in the quote, let alone that you bothered to type.
        
        I have no idea what you think is relevant about that passage. It says exactly what I said, except transformed from the effect size scale to the p-value scale. But somehow I doubt that’s why you posted it. The most common problem in the comments on this thread is that people confuse false positive rate with false negative rate, so my best guess is that you are making that mistake and thinking the passage supports that error (though I have no idea why you’re telling me). Another possibility, slightly more relevant to this subthread, is that you’re pointing out that some people use other p-values. But in medicine, they don’t. They almost always use 95%, though sometimes 90%.
        V_V 20 Jul 2014 15:37 UTC
        1 point
        0
        Parent
        My confusion is about “at least” vs. “exactly”. See my answer to Cyan.
        Cyan 14 Jul 2014 14:23 UTC
        3 points
        0
        Parent
        You want size), not p-value. The difference is that size is a “pre-data” (or “design”) quantity, while the p-value is post-data, i.e., data-dependent.
        V_V 20 Jul 2014 15:36 UTC
        3 points
        0
        Parent
        Thanks.
        
        So if I set size at 5%, collect the data, and run the test, and repeat the whole experiment with fresh data multiple times, should I expect that, if the null hypothesis is true, the test accepts exactly %5 of times, or at most 5% of times?
        Cyan 20 Jul 2014 16:10 UTC
        3 points
        0
        Parent
        If the null hypothesis is simple (that is, if it picks out a single point in the hypothesis space), and the model assumptions are true blah blah blah, then the test (falsely) rejects the null with exactly 5% probability. If the null is composite (comprises a non-singleton subset of parameter space), and there is no nice reduction to a simple null via mathematical tricks like sufficiency or the availability of a pivot, then the test falsely rejects the null with at most 5% probability.
        
        But that’s all very technical; somewhat less technically, almost always, a bootstrap procedure is available that obviates these questions and gets you to “exactly 5%”… asymptotically. Here “asymptotically” means “if the sample size is big enough”. This just throws the question onto “how big is big enough,” and that’s context-dependent. And all of this is about one million times less important than the question of how well each study addresses systematic biases, which is an issue of real, actual study design and implementation rather than mathematical statistical theory.
  - dvasya 11 Jul 2014 22:42 UTC
    3 points
    0
    Parent
    How does your choice of threshold (made beforehand) affect your actual data and the information about the actual phenomenon contained therein?