I was reading some scientific papers and I encountered what looks like fallacious reasoning but I’m not quite sure what’s wrong with it (if anything). It does like this:
Alice formulates hypothesis H and publishes an experiment that moderately supports H (p < 0.05 but > 0.01).
Bob does a similar experiment that contradicts H.
People look at the differences in Alice’s and Bob’s studies and formulate a new hypothesis H’: “H is true under certain conditions (as in Alice’s experiment), and false under other conditions (as in Bob’s experiment)”. They look at the two studies and conclude that H’ is probably true because it’s supported by both studies.
This sounds fishy to me (something like post hoc reasoning) but I’m not quite sure how to explain why and I’m not even sure I’m correct.
It’s using the experimental evidence to privilege H’ (a strictly more complex hypothesis than H), and then using the same experimental evidence to support H’. That’s double-counting.
The more possibly relevant differences between the experiments, the worse this is. There are usually a lot of potentially relevant differences, which causes exponential explosion in the hypothesis space from which H’ is privileged.
What’s worse, Alice’s experiment gave only weak evidence for H against some non-H hypotheses. Since you mention p-value, I expect that it’s only comparing against one other hypothesis. That would make it weak evidence for H even if p < 0.0001 - but it couldn’t even manage that.
Are there no other hypotheses of comparable or lesser complexity than H’ matching the evidence as well or better? Did those formulating H’ even think for five minutes about whether there were or not?
It sounds to me like a problem of not reasoning according to Occam’s razor and “overfitting” a model to the available data.
Ceteris paribus, H’ isn’t more “fishy” than any other hypothesis, but H’ is a significantly more complex hypothesis than H or ¬H: instead of asserting H or ¬H, it asserts (A=>H) & (B=>¬H), so it should have been commensurately de-weighted in the prior distribution according to its complexity. The fact that Alice’s study supports H and Bob’s contradicts it does, in fact, increase the weight given to H’ in the posterior relative to its weight in the prior; it’s just that H’ is prima facie less likely, according to Occam.
Given all the evidence, the ratio of likelihoods P(H’|E)/P(H|E)=P(E|H’)P(H’)/(P(E|H)P(H)). We know P(E|H’) > P(E|H) (and P(E|H’) > P(E|¬H)), since the results of Alice’s and Bob’s studies together are more likely given H’, but P(H’) < P(H) (and P(H’) < P(¬H)) according to the complexity prior. Whether H’ is more likely than H (or ¬H, respectively) is ultimately up to whether P(E|H’)/P(E|H) (or P(E|H’)/P(E|¬H)) is larger or smaller than P(H’)/P(H) (or P(H’)/P(¬H)).
I think it ends up feeling fishy because the people formulating H’ just used more features (the circumstances of the experiments) in a more complex model to account for the as-of-yet observed data after having observed said data, so it ends up seeming like in selecting H’ as a hypothesis, they’re according it more weight than it deserves according to the complexity prior.
I was reading some scientific papers and I encountered what looks like fallacious reasoning but I’m not quite sure what’s wrong with it (if anything). It does like this:
Alice formulates hypothesis H and publishes an experiment that moderately supports H (p < 0.05 but > 0.01).
Bob does a similar experiment that contradicts H.
People look at the differences in Alice’s and Bob’s studies and formulate a new hypothesis H’: “H is true under certain conditions (as in Alice’s experiment), and false under other conditions (as in Bob’s experiment)”. They look at the two studies and conclude that H’ is probably true because it’s supported by both studies.
This sounds fishy to me (something like post hoc reasoning) but I’m not quite sure how to explain why and I’m not even sure I’m correct.
Yes, it’s definitely fishy.
It’s using the experimental evidence to privilege H’ (a strictly more complex hypothesis than H), and then using the same experimental evidence to support H’. That’s double-counting.
The more possibly relevant differences between the experiments, the worse this is. There are usually a lot of potentially relevant differences, which causes exponential explosion in the hypothesis space from which H’ is privileged.
What’s worse, Alice’s experiment gave only weak evidence for H against some non-H hypotheses. Since you mention p-value, I expect that it’s only comparing against one other hypothesis. That would make it weak evidence for H even if p < 0.0001 - but it couldn’t even manage that.
Are there no other hypotheses of comparable or lesser complexity than H’ matching the evidence as well or better? Did those formulating H’ even think for five minutes about whether there were or not?
It sounds to me like a problem of not reasoning according to Occam’s razor and “overfitting” a model to the available data.
Ceteris paribus, H’ isn’t more “fishy” than any other hypothesis, but H’ is a significantly more complex hypothesis than H or ¬H: instead of asserting H or ¬H, it asserts (A=>H) & (B=>¬H), so it should have been commensurately de-weighted in the prior distribution according to its complexity. The fact that Alice’s study supports H and Bob’s contradicts it does, in fact, increase the weight given to H’ in the posterior relative to its weight in the prior; it’s just that H’ is prima facie less likely, according to Occam.
Given all the evidence, the ratio of likelihoods P(H’|E)/P(H|E)=P(E|H’)P(H’)/(P(E|H)P(H)). We know P(E|H’) > P(E|H) (and P(E|H’) > P(E|¬H)), since the results of Alice’s and Bob’s studies together are more likely given H’, but P(H’) < P(H) (and P(H’) < P(¬H)) according to the complexity prior. Whether H’ is more likely than H (or ¬H, respectively) is ultimately up to whether P(E|H’)/P(E|H) (or P(E|H’)/P(E|¬H)) is larger or smaller than P(H’)/P(H) (or P(H’)/P(¬H)).
I think it ends up feeling fishy because the people formulating H’ just used more features (the circumstances of the experiments) in a more complex model to account for the as-of-yet observed data after having observed said data, so it ends up seeming like in selecting H’ as a hypothesis, they’re according it more weight than it deserves according to the complexity prior.