As a Bayesian, I’m very happy to see an attempted steelman of hypothesis testing. Too often I see Bayesian criticism of “frequentist” reasoning no frequentist statistician would ever actually apply. Unfortunately, this is a failed steelman (even granting the first premise) -- the description of the process of hypothesis testing is wrong, and as a result the the actual near-syllogism underlying hypothesis testing is not properly explained.
The first flaw with the description of the process is that it omits the need for some kind of ordering on the set of hypotheses, plus the need for a statistic -- a function of from the sample space to a totally ordered set—such that more extreme statistic values are more probable (in some sense, e.g., ordered by median or ordered by expected value) the further an alternative is from the null. This is not too restrictive as a mathematical condition, but it often involves throwing away relevant information in the data (basically, any time there isn’t a sufficient statistic).
The second flaw is that the third and fourth step of the syllogism should read something like “Under the null distribution, a statistic value as or more extreme than ours is extremely unlikely”. Being able to say this is the point of the orderings discussed in the previous paragraph. Without the orderings, you’re left talking about unlikely samples, which, as gjm pointed out, is not enough on its own to make the move from 4 to 5 even roughly truth-preserving. For example, that move would authorize the rejection of the null hypothesis “no miracle occurred” on these data.
As to the actual reasoning underlying the hypothesis testing procedure, it’s helpful to think about the kinds of tests students are given in school. An idealized (i.e., impractical) test would deeply probe a student’s understanding of the course material, such that a passing grade would be (in logical terms) both a necessary and a sufficient signifier of an adequate understanding of the material. In practice, it’s only feasible to test a patchwork subset of the course material, which introduces an element of chance. A student whose understanding is just barely inadequate (by some arbitrary standard) might get lucky and be tested mostly on material she understands; and vice versa. The further the student’s understanding lies from the threshold of bare adequacy, the less likely the test is to pass or fail in error.
In a closely analogous fashion, a hypothesis test is a probe for a certain kind of inadequacy in the statistical model. The statistic is the equivalent of the grade, and the threshold of statistical significance is equivalent of the standard of bare adequacy. And just as the standard of bare adequacy in the above metaphor is notional and arbitrary, the threshold of the hypothesis test need not be set in advance—with the realized value of the statistic in hand, one can consider the entire class of hypothesis tests ex post facto. The p-value is one way of capturing this kind of information. For more on this line of reasoning, see the work of Deborah Mayo.
Thanks for this comment. I was attempting to abstract away from the specific details of NHST and talk about the general idea since in many particulars there is much to criticize, but it appears that I abstracted too much—the ordering of the hypothesis space (i.e. a monotone likelihood ratio as in Neyman-Pearson) is definitely necessary.
In a closely analogous fashion, a hypothesis test is a probe for a certain kind of inadequacy in the statistical model. The statistic is the equivalent of the grade, and the threshold of statistical significance is equivalent of the standard of bare adequacy.
This seems to back up my claim that we can still view NHST as a sort of induction without a detailed theory of induction (though the reasons for and nature of this “thin” induction must be different from what I was thinking about). Do you agree?
I agree that the quote seems to back up the claim, but I don’t agree with the claim. Like all frequentist procedures, NHST does have a detailed theory of induction founded on the notion that one can use just the (model’s) sampling probability of a realized event to generate well-warranted claims about some hypothesis/hypotheses. (Again, see the work of Deborah Mayo.)
As a Bayesian, I’m very happy to see an attempted steelman of hypothesis testing. Too often I see Bayesian criticism of “frequentist” reasoning no frequentist statistician would ever actually apply. Unfortunately, this is a failed steelman (even granting the first premise) -- the description of the process of hypothesis testing is wrong, and as a result the the actual near-syllogism underlying hypothesis testing is not properly explained.
The first flaw with the description of the process is that it omits the need for some kind of ordering on the set of hypotheses, plus the need for a statistic -- a function of from the sample space to a totally ordered set—such that more extreme statistic values are more probable (in some sense, e.g., ordered by median or ordered by expected value) the further an alternative is from the null. This is not too restrictive as a mathematical condition, but it often involves throwing away relevant information in the data (basically, any time there isn’t a sufficient statistic).
The second flaw is that the third and fourth step of the syllogism should read something like “Under the null distribution, a statistic value as or more extreme than ours is extremely unlikely”. Being able to say this is the point of the orderings discussed in the previous paragraph. Without the orderings, you’re left talking about unlikely samples, which, as gjm pointed out, is not enough on its own to make the move from 4 to 5 even roughly truth-preserving. For example, that move would authorize the rejection of the null hypothesis “no miracle occurred” on these data.
As to the actual reasoning underlying the hypothesis testing procedure, it’s helpful to think about the kinds of tests students are given in school. An idealized (i.e., impractical) test would deeply probe a student’s understanding of the course material, such that a passing grade would be (in logical terms) both a necessary and a sufficient signifier of an adequate understanding of the material. In practice, it’s only feasible to test a patchwork subset of the course material, which introduces an element of chance. A student whose understanding is just barely inadequate (by some arbitrary standard) might get lucky and be tested mostly on material she understands; and vice versa. The further the student’s understanding lies from the threshold of bare adequacy, the less likely the test is to pass or fail in error.
In a closely analogous fashion, a hypothesis test is a probe for a certain kind of inadequacy in the statistical model. The statistic is the equivalent of the grade, and the threshold of statistical significance is equivalent of the standard of bare adequacy. And just as the standard of bare adequacy in the above metaphor is notional and arbitrary, the threshold of the hypothesis test need not be set in advance—with the realized value of the statistic in hand, one can consider the entire class of hypothesis tests ex post facto. The p-value is one way of capturing this kind of information. For more on this line of reasoning, see the work of Deborah Mayo.
Thanks for this comment. I was attempting to abstract away from the specific details of NHST and talk about the general idea since in many particulars there is much to criticize, but it appears that I abstracted too much—the ordering of the hypothesis space (i.e. a monotone likelihood ratio as in Neyman-Pearson) is definitely necessary.
This seems to back up my claim that we can still view NHST as a sort of induction without a detailed theory of induction (though the reasons for and nature of this “thin” induction must be different from what I was thinking about). Do you agree?
I agree that the quote seems to back up the claim, but I don’t agree with the claim. Like all frequentist procedures, NHST does have a detailed theory of induction founded on the notion that one can use just the (model’s) sampling probability of a realized event to generate well-warranted claims about some hypothesis/hypotheses. (Again, see the work of Deborah Mayo.)