I think you’re interpreting the F test a little more strictly than you should. Isn’t it fairer to say a null result on a F test is “It is not the case that for most x, P(x)”, with “most” defined in a particular way?
You’re correct that a F-test is miserable at separating out different classes of responders. (In fact, it should be easy to develop a test that does separate out different classes of responders; I’ll have to think about that. Maybe just fit a GMM with three modes in a way that tries to maximize the distance between the modes?)
But I think the detail that you suppressed for brevity also makes a significant difference in how the results are interpreted. This paper doesn’t make the mistake of saying “artificial food coloring does not cause hyperactivity in every child, therefore artificial food coloring affects no children.” The paper says “artificial food coloring does not cause hyperactivity in every child whose parents confidently expect them to respond negatively to artificial food coloring, therefore their parents’ expectation is mistaken at the 95% confidence level.”
Now, it could be the case that there are children who do respond negatively to artificial food coloring, but the Feingold association is terrible at finding them / rejecting those children where it doesn’t have an effect. (This is unsurprising from a Hawthorne Effect or confirmation bias perspective.) As well, for small sample sizes, it seems better to use F and t tests than to try to separate out the various classes of responders, because the class sizes will be tiny; if one child responds poorly after administered artificial food die, that’s not much to go on, compared to a distinct subpopulation of 20 children in a sample of 1000.
The section of the paper where they describe their reference class:
If artificial additives affect only a small proportion of hyperactive children, significant dietary effects are unlikely to be detected in heterogeneous samples of hyperactive children. Therefore, children who had been placed on the Feingold diet by their parents and who were reported by their parents to have derived marked behavioral benefit from the diet and to experience marked deterioration when given artificial food colorings were targeted for this study. This sampling approach, combined with high dosage, was chosen to maximize the likelihood of observing behavioral deterioration with ingestion of artificial colorings.
(I should add that the first sentence is especially worth contemplating, here.)
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
Can you conclude that you failed to reject the null hypothesis? And if you expected to reject the null hypothesis, isn’t that failure meaningful? (Note that my language carefully included the confidence value.)
As a general comment, this is why the Bayesian approach is much more amenable to knowledge-generation than the frequentist approach. The statement “the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0” (with the variance of that estimate pulled out of thin air) is much more meaningful than “we can’t be sure it’s not zero.”
As a general comment, this is why Bayesian statistics is much more amenable to knowledge-generation than frequentist statistics. The statement “the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0” (with the variance of that estimate pulled out of thin air) is much more meaningful than “we can’t be sure it’s not zero.”
I agree with the second sentence, and the first might be true, but the second isn’t evidence for the first; interval estimation vs. hypothesis testing is an independent issue to Bayesianism vs. frequentism. There are Bayesian hypothesis tests and frequentist interval estimates.
Agreed that both have those tools, and rereading my comment I think “approach” may have been a more precise word than “statistics.” If you think in terms of “my results are certain, reality is uncertain” then the first tool you reach for is “let’s make an interval estimate / put a distribution on reality,” whereas if you think in terms of “reality is certain, my results are uncertain” then the first tool you reach for is hypothesis testing. Such defaults have very important effects on what actually gets used in studies.
And if you expected to reject the null hypothesis, isn’t that failure meaningful?
To me, but not to the theoretical foundations of the method employed.
Hypothesis testing generally works sensibly because people smuggle in intuitions that aren’t part of the foundations of the method. But since they’re only smuggling things in under a deficient theoretical framework, they’re given to mistakes, particularly when they’re applying their intuitions to the theoretical framework and not the base data.
I agree with the later comment on Bayesian statistics, and I’d go further. Scatterplot the labeled data, or show the distribution if you have tons of data. That’s generally much more productive than any particular particular confidence interval you might construct.
It would be an interesting study generative study to compare the various statistical tests on the same hypothesis versus the human eyeball. I think the eyeball will hold it’s own.
I think you’re interpreting the F test a little more strictly than you should. Isn’t it fairer to say a null result on a F test is “It is not the case that for most x, P(x)”, with “most” defined in a particular way?
You’re correct that a F-test is miserable at separating out different classes of responders. (In fact, it should be easy to develop a test that does separate out different classes of responders; I’ll have to think about that. Maybe just fit a GMM with three modes in a way that tries to maximize the distance between the modes?)
But I think the detail that you suppressed for brevity also makes a significant difference in how the results are interpreted. This paper doesn’t make the mistake of saying “artificial food coloring does not cause hyperactivity in every child, therefore artificial food coloring affects no children.” The paper says “artificial food coloring does not cause hyperactivity in every child whose parents confidently expect them to respond negatively to artificial food coloring, therefore their parents’ expectation is mistaken at the 95% confidence level.”
Now, it could be the case that there are children who do respond negatively to artificial food coloring, but the Feingold association is terrible at finding them / rejecting those children where it doesn’t have an effect. (This is unsurprising from a Hawthorne Effect or confirmation bias perspective.) As well, for small sample sizes, it seems better to use F and t tests than to try to separate out the various classes of responders, because the class sizes will be tiny; if one child responds poorly after administered artificial food die, that’s not much to go on, compared to a distinct subpopulation of 20 children in a sample of 1000.
The section of the paper where they describe their reference class:
(I should add that the first sentence is especially worth contemplating, here.)
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
Can you conclude that you failed to reject the null hypothesis? And if you expected to reject the null hypothesis, isn’t that failure meaningful? (Note that my language carefully included the confidence value.)
As a general comment, this is why the Bayesian approach is much more amenable to knowledge-generation than the frequentist approach. The statement “the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0” (with the variance of that estimate pulled out of thin air) is much more meaningful than “we can’t be sure it’s not zero.”
I agree with the second sentence, and the first might be true, but the second isn’t evidence for the first; interval estimation vs. hypothesis testing is an independent issue to Bayesianism vs. frequentism. There are Bayesian hypothesis tests and frequentist interval estimates.
Agreed that both have those tools, and rereading my comment I think “approach” may have been a more precise word than “statistics.” If you think in terms of “my results are certain, reality is uncertain” then the first tool you reach for is “let’s make an interval estimate / put a distribution on reality,” whereas if you think in terms of “reality is certain, my results are uncertain” then the first tool you reach for is hypothesis testing. Such defaults have very important effects on what actually gets used in studies.
To me, but not to the theoretical foundations of the method employed.
Hypothesis testing generally works sensibly because people smuggle in intuitions that aren’t part of the foundations of the method. But since they’re only smuggling things in under a deficient theoretical framework, they’re given to mistakes, particularly when they’re applying their intuitions to the theoretical framework and not the base data.
I agree with the later comment on Bayesian statistics, and I’d go further. Scatterplot the labeled data, or show the distribution if you have tons of data. That’s generally much more productive than any particular particular confidence interval you might construct.
It would be an interesting study generative study to compare the various statistical tests on the same hypothesis versus the human eyeball. I think the eyeball will hold it’s own.