No; it’s standard to set the threshold for your statistical test for 95% confidence. Studies with larger samples can detect smaller differences between groups with that same statistical power.
It’s times like this that I wish Doctor Seuss was a mathematician (or statistician in this case). If they were willing to make up new words, we’d be able to talk without accidentally using jargon that has technical meaning we didn’t intend.
Suppose the standard were to use 80% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked?
Suppose the standard were to use 99% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked?
Also, doesn’t the prior plausibility of the connection being tested matter for attempts to detect experimenter bias this way? E.g., for any given convention about confidence intervals, shouldn’t we be quicker to infer experimenter bias when a set of studies conclude (1) that there is no link between eating lithium batteries and suffering brain damage vs. when a set of studies conclude (2) that there is no link between eating carrots and suffering brain damage?
“95% confidence” means “I am testing whether X is linked to Y. I know that the data might randomly conspire against me to make it look as if X is linked to Y. I’m going to look for an effect so large that, if there is no link between X and Y, the data will conspire against me only 5% of the time to look as if there is. If I don’t see an effect at least that large, I’ll say that I failed to show a link between X and Y.”
If you went for 80% confidence instead, you’d be looking for an effect that wasn’t quite as big. You’d be able to detect smaller clinical effects—for instance, a drug that has a small but reliable effect—but if there were no effect, you’d be fooled by the data 20% of the time into thinking that there was.
Also, doesn’t the prior plausibility of the connection being tested matter for attempts to detect experimenter bias this way?
It would if the papers claimed to find a connection. When they claim not to find a connection, I think not. Suppose people decided to test the hypothesis that stock market crashes are caused by the Earth’s distance from Mars. They would gather data on Earth’s distance from Mars, and on movements in the stock market, and look for a correlation.
If there is no relationship, there should be zero correlation, on average. That (approximately) means that half of all studies will show a negative correlation, and half will have positive correlation.
They need to pick a number, and say that if they find a positive correlation above that number, they’ve proven that Mars causes stock market crashes. And they pick that number by finding the correlation just exactly large enough that, if there is no relationship, it happens 5% of the time by chance.
If the proposition is very very unlikely, somebody might insist on a 99% confidence interval instead of a 95% confidence interval. That’s how prior plausibility would affect it. Adopting a standard of 95% confidence is really a way of saying we agree not to haggle over priors.
I’m going to look for an effect so large that, if there is no link between X and Y, the data will conspire against me only 5% of the time to look as if there is.
No, we are choosing the effect size before we do the study. We choose it so that if the true effect is zero, we will have a false positive exactly 5% of the time.
If your experiment tells you that [x > 45] with 99% confidence, you may in certain cases be able to confidently transform that to [x > 60] with 95% confidence.
For example, if your experiment tells you that the mass of the Q particle is 1.5034(42) with 99% confidence, maybe you can say instead that it’s 1.50344(2) with 95% confidence.
If your experiment happens to tell you that [particle Q exists] is true with 99% confidence, what kind of transformation can you apply to get 95% confidence instead? Discard some of your evidence? Add noise into your sensor readings?
In statistical significance testing, the p-value is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.[1][2] A researcher will often “reject the null hypothesis” when the p-value turns out to be less than a predetermined significance level, often 0.05[3][4] or 0.01.
Quoting authorities without further commentary is a dick thing to do. I am going to spend more words speculating about the intention of the quote than are in the quote, let alone that you bothered to type.
I have no idea what you think is relevant about that passage. It says exactly what I said, except transformed from the effect size scale to the p-value scale. But somehow I doubt that’s why you posted it. The most common problem in the comments on this thread is that people confuse false positive rate with false negative rate, so my best guess is that you are making that mistake and thinking the passage supports that error (though I have no idea why you’re telling me). Another possibility, slightly more relevant to this subthread, is that you’re pointing out that some people use other p-values. But in medicine, they don’t. They almost always use 95%, though sometimes 90%.
So if I set size at 5%, collect the data, and run the test, and repeat the whole experiment with fresh data multiple times, should I expect that, if the null hypothesis is true, the test accepts exactly %5 of times, or at most 5% of times?
If the null hypothesis is simple (that is, if it picks out a single point in the hypothesis space), and the model assumptions are true blah blah blah, then the test (falsely) rejects the null with exactly 5% probability. If the null is composite (comprises a non-singleton subset of parameter space), and there is no nice reduction to a simple null via mathematical tricks like sufficiency or the availability of a pivot, then the test falsely rejects the null with at most 5% probability.
But that’s all very technical; somewhat less technically, almost always, a bootstrap procedure is available that obviates these questions and gets you to “exactly 5%”… asymptotically. Here “asymptotically” means “if the sample size is big enough”. This just throws the question onto “how big is big enough,” and that’s context-dependent. And all of this is about one million times less important than the question of how well each study addresses systematic biases, which is an issue of real, actual study design and implementation rather than mathematical statistical theory.
Was that “exactly 95% confidence” or “at least 95% confidence”?
Also, different studies have different statistical power, so it may not be OK to simply add up their evidence with equal weights.
p-values are supposed to be distributed uniformly from 0 to 1 conditional on the null hypothesis being true.
No; it’s standard to set the threshold for your statistical test for 95% confidence. Studies with larger samples can detect smaller differences between groups with that same statistical power.
“Power” is a statistical term of art, and its technical meaning is neither 1 - alpha) nor 1 - p.
Oops; you’re right. Careless of me; fixed.
It’s times like this that I wish Doctor Seuss was a mathematician (or statistician in this case). If they were willing to make up new words, we’d be able to talk without accidentally using jargon that has technical meaning we didn’t intend.
I’m confused about how this works.
Suppose the standard were to use 80% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked? Suppose the standard were to use 99% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked?
Also, doesn’t the prior plausibility of the connection being tested matter for attempts to detect experimenter bias this way? E.g., for any given convention about confidence intervals, shouldn’t we be quicker to infer experimenter bias when a set of studies conclude (1) that there is no link between eating lithium batteries and suffering brain damage vs. when a set of studies conclude (2) that there is no link between eating carrots and suffering brain damage?
“95% confidence” means “I am testing whether X is linked to Y. I know that the data might randomly conspire against me to make it look as if X is linked to Y. I’m going to look for an effect so large that, if there is no link between X and Y, the data will conspire against me only 5% of the time to look as if there is. If I don’t see an effect at least that large, I’ll say that I failed to show a link between X and Y.”
If you went for 80% confidence instead, you’d be looking for an effect that wasn’t quite as big. You’d be able to detect smaller clinical effects—for instance, a drug that has a small but reliable effect—but if there were no effect, you’d be fooled by the data 20% of the time into thinking that there was.
It would if the papers claimed to find a connection. When they claim not to find a connection, I think not. Suppose people decided to test the hypothesis that stock market crashes are caused by the Earth’s distance from Mars. They would gather data on Earth’s distance from Mars, and on movements in the stock market, and look for a correlation.
If there is no relationship, there should be zero correlation, on average. That (approximately) means that half of all studies will show a negative correlation, and half will have positive correlation.
They need to pick a number, and say that if they find a positive correlation above that number, they’ve proven that Mars causes stock market crashes. And they pick that number by finding the correlation just exactly large enough that, if there is no relationship, it happens 5% of the time by chance.
If the proposition is very very unlikely, somebody might insist on a 99% confidence interval instead of a 95% confidence interval. That’s how prior plausibility would affect it. Adopting a standard of 95% confidence is really a way of saying we agree not to haggle over priors.
I think it is “only at most 5% of the time”.
No, we are choosing the effect size before we do the study. We choose it so that if the true effect is zero, we will have a false positive exactly 5% of the time.
How does this work for a binary quantity?
If your experiment tells you that [x > 45] with 99% confidence, you may in certain cases be able to confidently transform that to [x > 60] with 95% confidence.
For example, if your experiment tells you that the mass of the Q particle is 1.5034(42) with 99% confidence, maybe you can say instead that it’s 1.50344(2) with 95% confidence.
If your experiment happens to tell you that [particle Q exists] is true with 99% confidence, what kind of transformation can you apply to get 95% confidence instead? Discard some of your evidence? Add noise into your sensor readings?
Roll dice before reporting the answer?
We’re not talking about a binary quantity.
According to Wikipedia:
Quoting authorities without further commentary is a dick thing to do. I am going to spend more words speculating about the intention of the quote than are in the quote, let alone that you bothered to type.
I have no idea what you think is relevant about that passage. It says exactly what I said, except transformed from the effect size scale to the p-value scale. But somehow I doubt that’s why you posted it. The most common problem in the comments on this thread is that people confuse false positive rate with false negative rate, so my best guess is that you are making that mistake and thinking the passage supports that error (though I have no idea why you’re telling me). Another possibility, slightly more relevant to this subthread, is that you’re pointing out that some people use other p-values. But in medicine, they don’t. They almost always use 95%, though sometimes 90%.
My confusion is about “at least” vs. “exactly”. See my answer to Cyan.
You want size), not p-value. The difference is that size is a “pre-data” (or “design”) quantity, while the p-value is post-data, i.e., data-dependent.
Thanks.
So if I set size at 5%, collect the data, and run the test, and repeat the whole experiment with fresh data multiple times, should I expect that, if the null hypothesis is true, the test accepts exactly %5 of times, or at most 5% of times?
If the null hypothesis is simple (that is, if it picks out a single point in the hypothesis space), and the model assumptions are true blah blah blah, then the test (falsely) rejects the null with exactly 5% probability. If the null is composite (comprises a non-singleton subset of parameter space), and there is no nice reduction to a simple null via mathematical tricks like sufficiency or the availability of a pivot, then the test falsely rejects the null with at most 5% probability.
But that’s all very technical; somewhat less technically, almost always, a bootstrap procedure is available that obviates these questions and gets you to “exactly 5%”… asymptotically. Here “asymptotically” means “if the sample size is big enough”. This just throws the question onto “how big is big enough,” and that’s context-dependent. And all of this is about one million times less important than the question of how well each study addresses systematic biases, which is an issue of real, actual study design and implementation rather than mathematical statistical theory.
How does your choice of threshold (made beforehand) affect your actual data and the information about the actual phenomenon contained therein?