Or, for instance in the case of particle physics, it means the probability you are just looking at background. You are painting with an overly broad brush. Sure, p-values are overused, but there are situations where the p-value IS the right thing to look at.
Or, for instance in the case of particle physics, it means the probability you are just looking at background.
No, it’s the probability that you’d see a result that extreme (or more extreme) conditioned on just looking at background. Frequentists can’t evaluate unconditional probabilities, and ‘probability that I see noise given that I see X’ (if that’s what you had in mind) is quite different from ‘probability that I see X given that I see noise’.
(Incidentally, the fact that this kind of conflation is so common is one of the strongest arguments against defaulting to p-values.)
Keep in mind that he and other physicists do not generally consider “probability that it is noise, given an observation X” to even be a statement about the world (it’s a statement about one’s personal beliefs, after all, one’s confidence in the engineering of an experimental apparatus, and so on and so forth), so they are perhaps conflating much less than it would appear under very literal reading. This is why I like the idea of using the word “plausibility” to describe beliefs, and “probability” to describe things such as the probability of an event rigorously calculated using a specific model.
edit: note by the way that physicists can consider a very strong result—e.g. those superluminal neutrinos—extremely implausible on the basis of a prior—and correctly conclude that there is most likely a problem with their machinery, on the basis of ratio between the likelihood of seeing that via noise to likelihood of seeing that via hardware fault. How’s that even possible without actually performing Bayesian inference?
edit2: also note that there is a fundamental difference as with plausibilities you will have to be careful to avoid vicious cycles in the collective reasoning. Plausibility, as needed for combining it with other plausibilities, is not a real number, it is a real number with attached description of how exactly it was made, so that evidence would not be double-counted. The number itself is of little use to communication for this reason.
Keep in mind that he and other physicists do not generally consider “probability that it is noise, given an observation X” to even be a statement about the world (it’s a statement about one’s personal beliefs, after all, one’s confidence in the engineering of an experimental apparatus, and so on and so forth)
It’s about the probability that there is an effect which will cause this deviation from background to become more and more supported by additional data rather than simply regress to the mean (or with your wording, the other way around). That seems fairly based-in-the-world to me.
The actual reality either has this effect, or it does not. You can quantify your uncertainty with a number, that would require you to assign some a-priori probability, which you’ll have to choose arbitrarily.
You can contrast this to a die roll which scrambles initial phase space, mapping (approximately but very close to) 1⁄6 of any physically small region of it to each number on the die, the 1⁄6 being an objective property of how symmetrical dies bounce.
They are specific to your idiosyncratic choice of prior, I am not interested in hearing them (in the context of science), unlike the statements about the world.
That knowledge is subjective doesn’t mean that such statements are not about the world. Furthermore, such statements can (and sometimes do) have arguments for the priors...
By this standard, any ‘statement about the world’ ignores all of the uncertainty that actually applies. Science doesn’t require you to sweep your ignorance under the rug.
Or, for instance in the case of particle physics, it means the probability you are just looking at background.
Well, technically, the probability that you will end up with a result given that you are just looking at background. I.e. the probability that after the experiment you will end up looking at background thinking it is not background*, assuming it is all background.
if it is used for a threshold for such thinking
It’s really awkward to describe that in English, though, and I just assume that this is what you mean (while Bayesianists assume that you are conflating the two).
Or, for instance in the case of particle physics, it means the probability you are just looking at background. You are painting with an overly broad brush. Sure, p-values are overused, but there are situations where the p-value IS the right thing to look at.
Note that the ‘brush’ I am using is essentially painting the picture “0.05 is for sissies”, not a rejection of p-values (which I may do elsewhere but with less contempt). The physics reference was to illustrate the contrast of standards between fields and why physics papers can be trusted more than medical papers.
With the thresholds from physics, we’d still be figuring out if penicillin really, actually kills certain bacteria (somewhat hyperbolic, 5 sigma ~ 1 in 3.5 million).
0.05 is a practical tradeoff, for supposed Bayesians, it is still much too strict, not too lax.
I for one think that 0.05 is way too lax (other than for the purposes of seeing whenever it is worth it to conduct a bigger study and other such value-of-information related uses) and 0.05 results require rather carefully constructed meta-study to interpret correctly. Because a selection factor of 20 is well within the range attainable by dodgy practices that are almost impossible to prevent, and even in the absence of the dodgy practices, selection due to you being more likely to hear of something interesting.
I can only imagine considering it too strict if I were unaware of those issues or their importance (Bayesianism or not)
This goes much more so for weaker forms of information, such as “Here’s a plausible looking speculation I came up with”. To get anywhere with that kind of stuff one would need to somehow account for the preference towards specific lines of speculation.
edit: plus, effective cures in medicine are the ones supported by very very strong evidence, on par with particle physics (e.g. the same penicillin killing bacteria, you have really big sample sizes when you are dealing with bacteria). The weak stuff—antidepressants for which we don’t know if they lower or raise the risk of the suicide, and are uncertain whenever the effect is an artefact from using in any way whatsoever a depression score that includes weight loss and insomnia as symptoms when testing a drug that causes weight gain and sleepiness.
I think it is mostly because priors for finding a strongly effective drug are very low, so when large p-values are involved, you can only find low effect, near-placebo drugs.
edit2: Other issue is that many studies are plagued by at least some un-blinding that can modulate the placebo effect. So, I think a threshold on the strength of the effect (not just p-value) is also necessary—things that are within the potential systematic error margin from the placebo effect may mostly be a result of systematic error.
edit3: By the way, note that for a study of same size, stronger effect will result in much lower p-value, and so a higher standard on p-values does not interfere with detection of strong effects much. When you are testing an antibiotic… well, the chance probability of one bacterium dying in some short timespan may be 0.1, and with antibiotic at a fairly high concentration, 99.99999… . Needless to say, a dozen bacteria put you far beyond the standards from the particle physics, and a whole poisoned petri dish makes point moot, with all the unconfidence coming from the possibility of killing the bacteria in some other way.
It probably is too lax. I’d settle for 0.01, but 0.005 or 0.001 would be better for most applications (i.e—where you can get it). We have have the whole range of numbers between 1 in 25 and 1 in 3.5 million to choose from, and I’d like to see an actual argument before concluding that the number we picked mostly from historical accident was actually right all along.
Still, a big part of the problem is the ‘p-value’ itself, not the number coming after it. Apart from the statistical issues, it’s far too often mistaken for something else, as RobbBB has pointed out elsewhere in this thread.
0.05 is a practical tradeoff, for supposed Bayesians, it is still much too strict, not too lax.
No, it isn’t. In an environment where the incentive to find a positive result in huge and there are all sorts of flexibilities in what particular results to report and which studies to abandon entirely, 0.05 leaves far too many false positives. I really does begin to look like this. I don’t advocate using the standards from physics but p=0.01 would be preferable.
Mind you, there is no particularly good reason why there is an arbitrary p value to equate with ‘significance’ anyhow.
Well, I would find it really awkward for a Bayesian to condone a modus operandi such as “The p-value of 0.15 indicates it is much more likely that there is a correlation than that the result is due to chance, however for all intents and purposes the scientific community will treat the correlation as non-existent, since we’re not sufficiently certain of it (even though it likely exists)”.
Similar to having choice of two roads to go down, one of which leads into the forbidden forest. Then saying “while I have decent evidence which way goes where, because I’m not yet really certain, I’ll just toss a coin.” How many false choices would you make in life, using an approach like that? Neglecting your duty to update, so to speak. A p-value of 0.15 is important evidence. A p-value of 0.05 is even more important evidence. It should not be disregarded, regardless of the perverse incentives in publishing and the false binary choice (if (p<=0.05) correlation=true, else correlation=false). However, for the medical community, a p-value of 0.15 might as well be 0.45, for practical purposes. Not published = not published.
This is especially pertinent given that many important chance discoveries may only barely reach significance initially, not because their effect size is so small, but because in medicine sample sizes often are, with the accompanying low power of discovering new effects. When you’re just a grad student with samples from e.g. 10 patients (no economic incentive yet, not yet a large trial), unless you’ve found magical ambrosia, p-values may tend to be “insignificant”, even of potentially significant breakthrough drugs .
Better to check out a few false candidates too many than to falsely dismiss important new discoveries. Falsely claiming a promising new substance to have no significant effect due to p-value shenanigans is much worse than not having tested it in the first place, since the “this avenue was fruitless” conclusion can steer research in the wrong direction (information spreads around somewhat even when unpublished, “group abc had no luck with testing substances xyz”).
IOW, I’m more concerned with false negatives (may never get discovered as such, lost chance) than with false positives (get discovered later on—in larger follow-up trials—as being false positives). A sliding p-value scale may make sense, with initial screening tests having a lax barrier signifying a “should be investigated further”, with a stricter standard for the follow-up investigations.
Well, I would find it really awkward for a Bayesian to condone a modus operandi such as “The p-value of 0.15 indicates it is much more likely that there is a correlation than that the result is due to chance, however for all intents and purposes the scientific community will treat the correlation as non-existent, since we’re not sufficiently certain of it (even though it likely exists)”.
And this is a really, really great reason not to identify yourself as “Bayesian”. You end up not using effective methods when you can’t derive them from Bayes theorem. (Which is to be expected absent very serious training in deriving things).
Better to check out a few false candidates too many than to falsely dismiss important new discoveries
Where do you think the funds for testing false candidates are going to come from? If you are checking too many false candidates, you are dismissing important new discoveries. You are also robbing time away from any exploration into the unexplored space.
edit: also I think you overestimate the extent to which promising avenues of research are “closed” by a failure to confirm. It is understood that a failure can result from a multitude of causes. Keep in mind also that with a strong effect, you have quadratically better p-value for the same sample size. You are at much less of a risk of dismissing strong results.
Well, I would find it really awkward for a Bayesian to condone a modus operandi such as “The p-value of 0.15 indicates it is much more likely that there is a correlation than that the result is due to chance, however for all intents and purposes the scientific community will treat the correlation as non-existent, since we’re not sufficiently certain of it (even though it likely exists)”.
The way statistically significant scientific studies are currently used is not like this. The meaning conveyed and the practical effect of official people declaring statistically significant findings is not a simple declaration of the Bayesian evidence implied by the particular statistical test returning less than 0.05. Because of this, I have no qualms with saying that I would prefer lower values than p<0.05 to be used in the place where that standard is currently used. No rejection of Bayesian epistemology is implied.
Or, for instance in the case of particle physics, it means the probability you are just looking at background. You are painting with an overly broad brush. Sure, p-values are overused, but there are situations where the p-value IS the right thing to look at.
No, it’s the probability that you’d see a result that extreme (or more extreme) conditioned on just looking at background. Frequentists can’t evaluate unconditional probabilities, and ‘probability that I see noise given that I see X’ (if that’s what you had in mind) is quite different from ‘probability that I see X given that I see noise’.
(Incidentally, the fact that this kind of conflation is so common is one of the strongest arguments against defaulting to p-values.)
Keep in mind that he and other physicists do not generally consider “probability that it is noise, given an observation X” to even be a statement about the world (it’s a statement about one’s personal beliefs, after all, one’s confidence in the engineering of an experimental apparatus, and so on and so forth), so they are perhaps conflating much less than it would appear under very literal reading. This is why I like the idea of using the word “plausibility” to describe beliefs, and “probability” to describe things such as the probability of an event rigorously calculated using a specific model.
edit: note by the way that physicists can consider a very strong result—e.g. those superluminal neutrinos—extremely implausible on the basis of a prior—and correctly conclude that there is most likely a problem with their machinery, on the basis of ratio between the likelihood of seeing that via noise to likelihood of seeing that via hardware fault. How’s that even possible without actually performing Bayesian inference?
edit2: also note that there is a fundamental difference as with plausibilities you will have to be careful to avoid vicious cycles in the collective reasoning. Plausibility, as needed for combining it with other plausibilities, is not a real number, it is a real number with attached description of how exactly it was made, so that evidence would not be double-counted. The number itself is of little use to communication for this reason.
It’s about the probability that there is an effect which will cause this deviation from background to become more and more supported by additional data rather than simply regress to the mean (or with your wording, the other way around). That seems fairly based-in-the-world to me.
The actual reality either has this effect, or it does not. You can quantify your uncertainty with a number, that would require you to assign some a-priori probability, which you’ll have to choose arbitrarily.
You can contrast this to a die roll which scrambles initial phase space, mapping (approximately but very close to) 1⁄6 of any physically small region of it to each number on the die, the 1⁄6 being an objective property of how symmetrical dies bounce.
Such statements are about the world, in a framework of probability.
They are specific to your idiosyncratic choice of prior, I am not interested in hearing them (in the context of science), unlike the statements about the world.
That knowledge is subjective doesn’t mean that such statements are not about the world. Furthermore, such statements can (and sometimes do) have arguments for the priors...
By this standard, any ‘statement about the world’ ignores all of the uncertainty that actually applies. Science doesn’t require you to sweep your ignorance under the rug.
Well, technically, the probability that you will end up with a result given that you are just looking at background. I.e. the probability that after the experiment you will end up looking at background thinking it is not background*, assuming it is all background.
if it is used for a threshold for such thinking
It’s really awkward to describe that in English, though, and I just assume that this is what you mean (while Bayesianists assume that you are conflating the two).
Note that the ‘brush’ I am using is essentially painting the picture “0.05 is for sissies”, not a rejection of p-values (which I may do elsewhere but with less contempt). The physics reference was to illustrate the contrast of standards between fields and why physics papers can be trusted more than medical papers.
That’s what multiple testing correction is for.
With the thresholds from physics, we’d still be figuring out if penicillin really, actually kills certain bacteria (somewhat hyperbolic, 5 sigma ~ 1 in 3.5 million).
0.05 is a practical tradeoff, for supposed Bayesians, it is still much too strict, not too lax.
I for one think that 0.05 is way too lax (other than for the purposes of seeing whenever it is worth it to conduct a bigger study and other such value-of-information related uses) and 0.05 results require rather carefully constructed meta-study to interpret correctly. Because a selection factor of 20 is well within the range attainable by dodgy practices that are almost impossible to prevent, and even in the absence of the dodgy practices, selection due to you being more likely to hear of something interesting.
I can only imagine considering it too strict if I were unaware of those issues or their importance (Bayesianism or not)
This goes much more so for weaker forms of information, such as “Here’s a plausible looking speculation I came up with”. To get anywhere with that kind of stuff one would need to somehow account for the preference towards specific lines of speculation.
edit: plus, effective cures in medicine are the ones supported by very very strong evidence, on par with particle physics (e.g. the same penicillin killing bacteria, you have really big sample sizes when you are dealing with bacteria). The weak stuff—antidepressants for which we don’t know if they lower or raise the risk of the suicide, and are uncertain whenever the effect is an artefact from using in any way whatsoever a depression score that includes weight loss and insomnia as symptoms when testing a drug that causes weight gain and sleepiness.
I think it is mostly because priors for finding a strongly effective drug are very low, so when large p-values are involved, you can only find low effect, near-placebo drugs.
edit2: Other issue is that many studies are plagued by at least some un-blinding that can modulate the placebo effect. So, I think a threshold on the strength of the effect (not just p-value) is also necessary—things that are within the potential systematic error margin from the placebo effect may mostly be a result of systematic error.
edit3: By the way, note that for a study of same size, stronger effect will result in much lower p-value, and so a higher standard on p-values does not interfere with detection of strong effects much. When you are testing an antibiotic… well, the chance probability of one bacterium dying in some short timespan may be 0.1, and with antibiotic at a fairly high concentration, 99.99999… . Needless to say, a dozen bacteria put you far beyond the standards from the particle physics, and a whole poisoned petri dish makes point moot, with all the unconfidence coming from the possibility of killing the bacteria in some other way.
It probably is too lax. I’d settle for 0.01, but 0.005 or 0.001 would be better for most applications (i.e—where you can get it). We have have the whole range of numbers between 1 in 25 and 1 in 3.5 million to choose from, and I’d like to see an actual argument before concluding that the number we picked mostly from historical accident was actually right all along. Still, a big part of the problem is the ‘p-value’ itself, not the number coming after it. Apart from the statistical issues, it’s far too often mistaken for something else, as RobbBB has pointed out elsewhere in this thread.
No, it isn’t. In an environment where the incentive to find a positive result in huge and there are all sorts of flexibilities in what particular results to report and which studies to abandon entirely, 0.05 leaves far too many false positives. I really does begin to look like this. I don’t advocate using the standards from physics but p=0.01 would be preferable.
Mind you, there is no particularly good reason why there is an arbitrary p value to equate with ‘significance’ anyhow.
Well, I would find it really awkward for a Bayesian to condone a modus operandi such as “The p-value of 0.15 indicates it is much more likely that there is a correlation than that the result is due to chance, however for all intents and purposes the scientific community will treat the correlation as non-existent, since we’re not sufficiently certain of it (even though it likely exists)”.
Similar to having choice of two roads to go down, one of which leads into the forbidden forest. Then saying “while I have decent evidence which way goes where, because I’m not yet really certain, I’ll just toss a coin.” How many false choices would you make in life, using an approach like that? Neglecting your duty to update, so to speak. A p-value of 0.15 is important evidence. A p-value of 0.05 is even more important evidence. It should not be disregarded, regardless of the perverse incentives in publishing and the false binary choice (if (p<=0.05) correlation=true, else correlation=false). However, for the medical community, a p-value of 0.15 might as well be 0.45, for practical purposes. Not published = not published.
This is especially pertinent given that many important chance discoveries may only barely reach significance initially, not because their effect size is so small, but because in medicine sample sizes often are, with the accompanying low power of discovering new effects. When you’re just a grad student with samples from e.g. 10 patients (no economic incentive yet, not yet a large trial), unless you’ve found magical ambrosia, p-values may tend to be “insignificant”, even of potentially significant breakthrough drugs .
Better to check out a few false candidates too many than to falsely dismiss important new discoveries. Falsely claiming a promising new substance to have no significant effect due to p-value shenanigans is much worse than not having tested it in the first place, since the “this avenue was fruitless” conclusion can steer research in the wrong direction (information spreads around somewhat even when unpublished, “group abc had no luck with testing substances xyz”).
IOW, I’m more concerned with false negatives (may never get discovered as such, lost chance) than with false positives (get discovered later on—in larger follow-up trials—as being false positives). A sliding p-value scale may make sense, with initial screening tests having a lax barrier signifying a “should be investigated further”, with a stricter standard for the follow-up investigations.
And this is a really, really great reason not to identify yourself as “Bayesian”. You end up not using effective methods when you can’t derive them from Bayes theorem. (Which is to be expected absent very serious training in deriving things).
Where do you think the funds for testing false candidates are going to come from? If you are checking too many false candidates, you are dismissing important new discoveries. You are also robbing time away from any exploration into the unexplored space.
edit: also I think you overestimate the extent to which promising avenues of research are “closed” by a failure to confirm. It is understood that a failure can result from a multitude of causes. Keep in mind also that with a strong effect, you have quadratically better p-value for the same sample size. You are at much less of a risk of dismissing strong results.
The way statistically significant scientific studies are currently used is not like this. The meaning conveyed and the practical effect of official people declaring statistically significant findings is not a simple declaration of the Bayesian evidence implied by the particular statistical test returning less than 0.05. Because of this, I have no qualms with saying that I would prefer lower values than p<0.05 to be used in the place where that standard is currently used. No rejection of Bayesian epistemology is implied.