Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:
response = effect + normally distributed error
where the effect is the same for every subject.
The F test / t test doesn’t quite say that. It makes statements about population averages. More specifically, if you’re comparing the mean of two groups, the t or F test says whether the average response of one group is the same as the other group. Heterogeneity just gets captured by the error term. In fact, econometricians define the error term as the difference between the true response and what their model says the mean response is (usually conditional on covariates).
The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses. If there really are heterogeneous responses we should see that show up in the population average unless:
The positive and negative effects cancel each other out exactly once you average across the population. (this seems very unlikely)
The population average effect size is nonzero but very small, possibly because the effect only occurs in a small subset of the population (even if it’s large when it does occur) or something similar but more complicated. In this case, a large enough sample size would still detect the effect.
Now it might not be very strong evidence—this depends on sample size and the likely nature of the heterogeneity (or confounders, as Cyan mentions). And in general there is merit in your criticism of their conclusions. But I think you’ve unfairly characterized the methods they used.
The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses.
Why do you say that? Did you look at the data?
They found F values of 0.77, 2.161, and 1.103. That means they found different behavior in the two groups. But those F-values were lower than the thresholds they had computed assuming homogeneity. They therefore said “We have rejected the hypothesis”, and claimed that the evidence, which interpreted in a Bayesian framework might support that hypothesis, refuted it.
I didn’t look at the data. I was commenting on your assessment of what they did, which showed that you didn’t know how the F test works. Your post made it seem as if all they did was run an F test that compared the average response of the control and treatment groups and found no difference.
The F test / t test doesn’t quite say that. It makes statements about population averages. More specifically, if you’re comparing the mean of two groups, the t or F test says whether the average response of one group is the same as the other group. Heterogeneity just gets captured by the error term. In fact, econometricians define the error term as the difference between the true response and what their model says the mean response is (usually conditional on covariates).
The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses. If there really are heterogeneous responses we should see that show up in the population average unless:
The positive and negative effects cancel each other out exactly once you average across the population. (this seems very unlikely)
The population average effect size is nonzero but very small, possibly because the effect only occurs in a small subset of the population (even if it’s large when it does occur) or something similar but more complicated. In this case, a large enough sample size would still detect the effect.
Now it might not be very strong evidence—this depends on sample size and the likely nature of the heterogeneity (or confounders, as Cyan mentions). And in general there is merit in your criticism of their conclusions. But I think you’ve unfairly characterized the methods they used.
Why do you say that? Did you look at the data?
They found F values of 0.77, 2.161, and 1.103. That means they found different behavior in the two groups. But those F-values were lower than the thresholds they had computed assuming homogeneity. They therefore said “We have rejected the hypothesis”, and claimed that the evidence, which interpreted in a Bayesian framework might support that hypothesis, refuted it.
I didn’t look at the data. I was commenting on your assessment of what they did, which showed that you didn’t know how the F test works. Your post made it seem as if all they did was run an F test that compared the average response of the control and treatment groups and found no difference.