If a given piece of evidence E1 provides Bayesian likelihood for theory T1 over T2, and E2 was generated by an isomorphic process, then we get the likelihood ratio squared, providing that T1 and T2 are single possible worlds and have no parameters being updated by E1 or E2 so that the probability of the evidence is conditionally independent.
Thus sayeth Bayes, so far as I can tell.
As for the frequentists...
Well, logically, we’re allegedly rejecting a null hypothesis. If the “null hypothesis” contains no parameters to be updated and the probability that E1 was generated by the null hypothesis is .05, and E2 was generated by a causally conditionally independent process, the probability that E1+E2 was generated by the null hypothesis ought to be 0.0025.
But of course gwern’s calculation came out differently in the decimals. This could be because some approximation truncated a decimal or two. But it could also be because frequentism actually calculates the probability that E1 is in some amazing class [E] of other data we could’ve observed but didn’t, to be p < 0.05. Who knows what strange class of other data we could’ve seen but didn’t, a given frequentist method will put E1 + E2 into? I mean, you can make up whatever the hell [E] you want, so who says you’ve got to make up one that makes [E+E] have the probability of [E] squared? So if E1 and E2 are exactly equally likely given the null hypothesis, a frequentist method could say that their combined “significance” is the square of E1, less than the square, more than the square, who knows, what the hell, if we obeyed probability theory we’d be Bayesians so let’s just make stuff up. Sorry if I sound a bit polemical here.
You can’t just multiply p-values together to get the combined p-value for multiple experiments.
A p-value is a statistic that has a uniform(0,1) distribution if the null hypothesis is true. If you take two independent uniform(0,1) variables and multiply them together, the product is not a uniform(0,1) variable—it has more of its distribution near 0 and less near 1. So multiplying two p-values together does not give you a p-value; it gives you a number that is smaller than the p-value that you would get if you went through the appropriate frequentist procedure.
In the course of figuring out what the hell the parent comment was talking about and how one was supposed to do the calculation, I found this. p-values are much clearer for me now, thanks for bringing this up.
Don’t get me wrong, this is a good paper, well-written to be clearly understandable and not to be deliberately obtuse like far too many math papers these days, and the author’s heart is clearly in the right place, but I still screamed while reading it.
How can anyone read this, and not bang their head against the wall at how horribly arbitrary this all is… no wonder more than half of published findings are false.
Unfortunately, walls solid enough to sustain the force of the bang I wanted to produce were not to be found within a radius of five meters when I was reading it. I did want to bang my head on my desk, though.
The arbitrari-ness of all the decisions (who decides the cutoff point to reject the null and on what basis? “Meh, whatever” seems to be the ruling methodology) did strike me as unscientific. Or, well, as un-((Some Term For What I Used To Think “Science” Meant Until I Saw That Most Of It Was About Testing Arbitrary Hypotheses Rather Than Deliberate Cornering Of Facts)) as something actually following the scientific method can get.
I don’t mind the arbitrary cutoff point. That’s like a Bayesian reporting likelihood ratios and leaving the prior up to the reader.
It’s more things like, “And now we’ll multiply all the significances together, and calculate the probability that their multiplicand would be equal to or lower than the result, given the null hypothesis” that make me want to scream. Why not take the arithmetic mean of the significances and calculate the probability of that instead, so long as we’re pretending the actual result is part of an arbitrary class of results? It just seems horribly obvious that you just get further and further away from what the likelihood ratios are actually telling you, as you pile arbitrary test on arbitrary test...
Also, I found that the function R_k in Section 2 has the slightly-more-closed formula
ρ⋅Pk(log(1/ρ) where P_k(x) is the first k terms of the Taylor series for e^x (and has the formula with factorials and everything). Just in case anyone wants to try this at home.
A more generous way to think about frequentism (which can be justified by some conditional probability sleight-of-hand) is that the significance of some evidence E is actually the probability that the null hypothesis is true, given E and also some prior distribution that is swept under the rug and (mostly) not under the experimenter’s control. Which is bad, yes, but in many cases the prior distribution is at least close to something reasonable. And there are some cases in which we can somewhat change the prior distribution to reflect our real priors: for example, when choosing to conduct a 1-tailed test rather than a 2-tailed one.
Under this interpretation, it is silly to expect significances to multiply. You’d really be saying something like Pr[H|E1+E2] = Pr[H|E1] Pr[H|E2]. And that’s simply not true: you are double-counting the prior probability Pr[H] when you do this. The frequentist approach is a correct way to combine these probabilities, although this isn’t obvious because nobody actually knows what the frequentist Pr[H] is.
But if you read about two experiments with a p-value of 0.05, and think of them as one experiment with a p-value of 0.0025, you are very very very wrong; not just frequentist-wrong but Bayesian-wrong as well.
the significance of some evidence E is actually the probability that the null hypothesis is true, given E
No frequentist says this. They don’t believe in P(H|E). That’s the explicit basis of the whole philosophy. People who talk about the probability of a hypothesis given the evidence are Bayesians, full stop.
Statistical significance is, albeit in a strange and distorted way, supposed to be about P(E|null hypothesis), and so, yes, two experiments with a p-value of 0.05 should add up to somewhere in the vicinity of p < 0.0025, because it’s about likelihoods, which do multiply, and not posteriors.
While some frequentist methods do use likelihoods, the mapping from likelihood to p-value is non-linear, so multiplying them would still be a mistake, at least as far as I can tell.
I’m not saying that frequentists believe this. I’m saying that the frequentist math (which computes Pr[E|H0]) is equivalent to computing Pr[H0|E] with respect to a prior distribution under which Pr[H0]=Pr[E]. Furthermore, this is a reasonable thing to look at, because from that point of view the way statistical significances combine actually makes sense.
Well, we have, in general, Pr[H0|E] = Pr[E|H0] * Pr[H0]/Pr[E]. Frequentists compute Pr[E|H0] instead of Pr[H0|E], but this turns out not to matter if Pr[H0]/Pr[E] cancels, which happens when the above equality holds.
From a certain point of view, this is just mathematical sleight of hand, of course. Also, the “E” is actually some class of outcomes that are grouped together (e.g. all outcomes in which 8 or more coins, out of 10, came up heads). But if we combine sequences of experimental results in the correct way, then this means that the frequentist and Bayesian result differ only by a constant factor (precisely the factor which we assumed, above, to be 1).
Why the heck would the probability of seeing the evidence, conditional on the mix of all hypotheses being considered, exactly equal the prior probability of the null hypothesis?
It wouldn’t. Probably a better way to explain it would have been to factor their ratio out as a constant.
Anyway, I’ve totally messed up explaining this, so I will fold for now and direct you to a completely different argument made elsewhere in the comments which is more worthy of being considered.
Suppose that our data are coin flips, and consider three hypotheses: H0 = always heads, H1 = fair coin, H2 = heads with probability 25%. Now suppose that the two hypotheses we actually want to test between are H0 and H’ = 0.5(H1+H2). After seeing a single heads, the likelihood of H0 is 1 and the likelihood of H’ is 0.5(0.5+0.25). After seeing two heads, the likelihood of H0 is 1 and the likelihood of H’ is 0.5(0.5^2+0.25^2). In general, the likelihood of H’ after n heads is 0.5(0.5^n+0.25^n), i.e. a mixture of multiple geometric functions. In general if H’ is a mixture of many hypotheses, the likelihood will be a mixture of many geometric functions, and therefore could be more or less arbitrary.
Oops, missed that; but that specification doesn’t hold in the situation we care about, since rejecting the null hypotheses typically requires us to consider the result of marginalizing over a space of alternative hypotheses (well, assuming we’re being Bayesians, but I know you prefer that anyways =P).
Well, right, assuming we’re Bayesians, but when we’re just “rejecting the null hypothesis” we should mostly be concerned about likelihood from the null hypothesis which has no moving parts, which is why I used the log approximation I did. But at this point we’re mixing frequentism and Bayes to the point where I shan’t defend the point further—it’s certainly true that once a Bayesian considers more than exactly two atomic hypotheses, the update on two independent pieces of evidence doesn’t go as the square of one update (even though the likelihood ratios still go as the square, etc.).
Amazing, innit? Meanwhile in the land of the sane people, the likelihood function from any given propensity to come up heads, to the observed data, is exactly squared for 120 in 200 vs. 60 in 100.
I’m not sure that follows.
If a given piece of evidence E1 provides Bayesian likelihood for theory T1 over T2, and E2 was generated by an isomorphic process, then we get the likelihood ratio squared, providing that T1 and T2 are single possible worlds and have no parameters being updated by E1 or E2 so that the probability of the evidence is conditionally independent.
Thus sayeth Bayes, so far as I can tell.
As for the frequentists...
Well, logically, we’re allegedly rejecting a null hypothesis. If the “null hypothesis” contains no parameters to be updated and the probability that E1 was generated by the null hypothesis is .05, and E2 was generated by a causally conditionally independent process, the probability that E1+E2 was generated by the null hypothesis ought to be 0.0025.
But of course gwern’s calculation came out differently in the decimals. This could be because some approximation truncated a decimal or two. But it could also be because frequentism actually calculates the probability that E1 is in some amazing class [E] of other data we could’ve observed but didn’t, to be p < 0.05. Who knows what strange class of other data we could’ve seen but didn’t, a given frequentist method will put E1 + E2 into? I mean, you can make up whatever the hell [E] you want, so who says you’ve got to make up one that makes [E+E] have the probability of [E] squared? So if E1 and E2 are exactly equally likely given the null hypothesis, a frequentist method could say that their combined “significance” is the square of E1, less than the square, more than the square, who knows, what the hell, if we obeyed probability theory we’d be Bayesians so let’s just make stuff up. Sorry if I sound a bit polemical here.
See also: http://lesswrong.com/lw/1gc/frequentist_statistics_are_frequently_subjective/
You can’t just multiply p-values together to get the combined p-value for multiple experiments.
A p-value is a statistic that has a uniform(0,1) distribution if the null hypothesis is true. If you take two independent uniform(0,1) variables and multiply them together, the product is not a uniform(0,1) variable—it has more of its distribution near 0 and less near 1. So multiplying two p-values together does not give you a p-value; it gives you a number that is smaller than the p-value that you would get if you went through the appropriate frequentist procedure.
In the course of figuring out what the hell the parent comment was talking about and how one was supposed to do the calculation, I found this. p-values are much clearer for me now, thanks for bringing this up.
Don’t get me wrong, this is a good paper, well-written to be clearly understandable and not to be deliberately obtuse like far too many math papers these days, and the author’s heart is clearly in the right place, but I still screamed while reading it.
How can anyone read this, and not bang their head against the wall at how horribly arbitrary this all is… no wonder more than half of published findings are false.
Unfortunately, walls solid enough to sustain the force of the bang I wanted to produce were not to be found within a radius of five meters when I was reading it. I did want to bang my head on my desk, though.
The arbitrari-ness of all the decisions (who decides the cutoff point to reject the null and on what basis? “Meh, whatever” seems to be the ruling methodology) did strike me as unscientific. Or, well, as un-((Some Term For What I Used To Think “Science” Meant Until I Saw That Most Of It Was About Testing Arbitrary Hypotheses Rather Than Deliberate Cornering Of Facts)) as something actually following the scientific method can get.
I don’t mind the arbitrary cutoff point. That’s like a Bayesian reporting likelihood ratios and leaving the prior up to the reader.
It’s more things like, “And now we’ll multiply all the significances together, and calculate the probability that their multiplicand would be equal to or lower than the result, given the null hypothesis” that make me want to scream. Why not take the arithmetic mean of the significances and calculate the probability of that instead, so long as we’re pretending the actual result is part of an arbitrary class of results? It just seems horribly obvious that you just get further and further away from what the likelihood ratios are actually telling you, as you pile arbitrary test on arbitrary test...
That is a really interesting paper.
Also, I found that the function R_k in Section 2 has the slightly-more-closed formula ρ⋅Pk(log(1/ρ) where P_k(x) is the first k terms of the Taylor series for e^x (and has the formula with factorials and everything). Just in case anyone wants to try this at home.
A more generous way to think about frequentism (which can be justified by some conditional probability sleight-of-hand) is that the significance of some evidence E is actually the probability that the null hypothesis is true, given E and also some prior distribution that is swept under the rug and (mostly) not under the experimenter’s control. Which is bad, yes, but in many cases the prior distribution is at least close to something reasonable. And there are some cases in which we can somewhat change the prior distribution to reflect our real priors: for example, when choosing to conduct a 1-tailed test rather than a 2-tailed one.
Under this interpretation, it is silly to expect significances to multiply. You’d really be saying something like Pr[H|E1+E2] = Pr[H|E1] Pr[H|E2]. And that’s simply not true: you are double-counting the prior probability Pr[H] when you do this. The frequentist approach is a correct way to combine these probabilities, although this isn’t obvious because nobody actually knows what the frequentist Pr[H] is.
But if you read about two experiments with a p-value of 0.05, and think of them as one experiment with a p-value of 0.0025, you are very very very wrong; not just frequentist-wrong but Bayesian-wrong as well.
No frequentist says this. They don’t believe in P(H|E). That’s the explicit basis of the whole philosophy. People who talk about the probability of a hypothesis given the evidence are Bayesians, full stop.
Statistical significance is, albeit in a strange and distorted way, supposed to be about P(E|null hypothesis), and so, yes, two experiments with a p-value of 0.05 should add up to somewhere in the vicinity of p < 0.0025, because it’s about likelihoods, which do multiply, and not posteriors.
While some frequentist methods do use likelihoods, the mapping from likelihood to p-value is non-linear, so multiplying them would still be a mistake, at least as far as I can tell.
I’m not saying that frequentists believe this. I’m saying that the frequentist math (which computes Pr[E|H0]) is equivalent to computing Pr[H0|E] with respect to a prior distribution under which Pr[H0]=Pr[E]. Furthermore, this is a reasonable thing to look at, because from that point of view the way statistical significances combine actually makes sense.
Whaa?
Well, we have, in general, Pr[H0|E] = Pr[E|H0] * Pr[H0]/Pr[E]. Frequentists compute Pr[E|H0] instead of Pr[H0|E], but this turns out not to matter if Pr[H0]/Pr[E] cancels, which happens when the above equality holds.
From a certain point of view, this is just mathematical sleight of hand, of course. Also, the “E” is actually some class of outcomes that are grouped together (e.g. all outcomes in which 8 or more coins, out of 10, came up heads). But if we combine sequences of experimental results in the correct way, then this means that the frequentist and Bayesian result differ only by a constant factor (precisely the factor which we assumed, above, to be 1).
Why the heck would the probability of seeing the evidence, conditional on the mix of all hypotheses being considered, exactly equal the prior probability of the null hypothesis?
It wouldn’t. Probably a better way to explain it would have been to factor their ratio out as a constant.
Anyway, I’ve totally messed up explaining this, so I will fold for now and direct you to a completely different argument made elsewhere in the comments which is more worthy of being considered.
Suppose that our data are coin flips, and consider three hypotheses: H0 = always heads, H1 = fair coin, H2 = heads with probability 25%. Now suppose that the two hypotheses we actually want to test between are H0 and H’ = 0.5(H1+H2). After seeing a single heads, the likelihood of H0 is 1 and the likelihood of H’ is 0.5(0.5+0.25). After seeing two heads, the likelihood of H0 is 1 and the likelihood of H’ is 0.5(0.5^2+0.25^2). In general, the likelihood of H’ after n heads is 0.5(0.5^n+0.25^n), i.e. a mixture of multiple geometric functions. In general if H’ is a mixture of many hypotheses, the likelihood will be a mixture of many geometric functions, and therefore could be more or less arbitrary.
That’s why I specified single possible worlds / hypotheses with no internal parameters that are being learned.
Oops, missed that; but that specification doesn’t hold in the situation we care about, since rejecting the null hypotheses typically requires us to consider the result of marginalizing over a space of alternative hypotheses (well, assuming we’re being Bayesians, but I know you prefer that anyways =P).
Well, right, assuming we’re Bayesians, but when we’re just “rejecting the null hypothesis” we should mostly be concerned about likelihood from the null hypothesis which has no moving parts, which is why I used the log approximation I did. But at this point we’re mixing frequentism and Bayes to the point where I shan’t defend the point further—it’s certainly true that once a Bayesian considers more than exactly two atomic hypotheses, the update on two independent pieces of evidence doesn’t go as the square of one update (even though the likelihood ratios still go as the square, etc.).
You’re right. That would be true if we did n independent tests, not one test with n-times the subjects.
e.g. probability of 60 or more heads in 100 tosses = .028
probability of 120 or more heads in 200 tosses = .0028
but .028^2 = .00081
Amazing, innit? Meanwhile in the land of the sane people, the likelihood function from any given propensity to come up heads, to the observed data, is exactly squared for 120 in 200 vs. 60 in 100.