FP works well when it is easy to make progress on ideas / questions through armchair reasoning, which you can think of as using the information or evidence you have more efficiently. However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work. As an illustrative example, consider trying to estimate the population of a city via FP, vs. going out and doing a census. In ML, you could say “we should add such-and-such inductive bias to our models so that they learn faster”, and we can debate how much that would help, but if you actually build it in and train the model and see what happens, you just know the answer now.
HARKing does the right two steps in the wrong order—it first gets the data, and then makes the hypothesis. This is fine for generating hypotheses, but isn’t great for telling whether a hypothesis is true or not, because there are likely many hypotheses that explain the data and it’s not clear why the one you chose should be the right one. It’s much stronger evidence if you first have a hypothesis, and then design a test for it, because then there is only one result out of many possible results that confirms your hypothesis. (This is oversimplified, but captures the broad point.)
I wouldn’t say “data beats theory”, I think theory (in the sense of “some way of predicting which ideas will be good”, not necessarily math) is needed in order to figure out which ideas to bother testing in the first place. But if you are evaluating on “what gives me confidence that <hypothesis> is true”, it’s usually going to be data. Theorems could do it, but it seems pretty rare that there are theorems for actually interesting hypotheses.
I actually think there is an interesting philosophical puzzle around this that has not fully been solved...
If I show you the code I’m going to use to run my experiment, can you be confident in guessing which hypothesis I aim to test?
If yes, then HARKing should be easily detectable. By looking at my code, it should be clear that the hypothesis I am actually testing is not the one that I published.
If no, then the resulting data could be used to prove multiple different hypotheses, and thus doesn’t necessarily constitute stronger evidence for any one of the particular hypotheses it could be used to prove (e.g. the hypothesis I preregistered).
To put it another way, in your first paragraph you say “there are likely many hypotheses that explain the data”, but in the second paragraph, you talk as though there’s a particular set of data such that if we get that data, we know there’s only one hypothesis which it can be used to support! What gives?
My solution to the puzzle: Pre-registration works because it forces researchers to be honest about their prior knowledge. Basically, prior knowledge unencumbered by hindsight bias (“armchair reasoning”) is underrated. Any hypothesis which has only the support of armchair reasoning or data from a single experiment is suspect. You really want both.
In Bayesian terms, you have to look at both the prior and the likelihood. Order shouldn’t matter (multiplication is commutative), but as I said—hindsight bias.
Curious to hear your thoughts.
[There are also cases where given the data, there’s only one plausible hypothesis which could possibly explain it. A well-designed experiment will hopefully produce data like this, but I think it’s a bit orthogonal to the HARKing issue, because we can imagine scenarios where post hoc data analysis suggests there is only one plausible hypothesis for what’s going on… although we should still be suspicious in that case because (presumably) we didn’t have prior beliefs indicating this hypothesis was likely to be true. Note that in both cases we are bottlenecked on the creativity of the experiment designer/data analyst in thinking up alternative hypotheses.]
[BTW, I think “armchair reasoning” might have the same referent as phrases with a more positive connotation: “deconfusion work” or “research distillation”.]
My solution to the puzzle is a bit different (but maybe the same?) Let’s suppose that there’s an experiment we could run that would come out with some result D. Each potential value of D is consistent with N hypotheses. There are 2N potential hypotheses.
Suppose Alice runs the experiment D and then chooses a hypothesis to explain it. This is consistent with Alice having a uniform prior, in which case she has a 1N chance of having settled on the true hypothesis. (Why not just list all N hypotheses? Because Alice didn’t think of all of them—it’s hard to search the entire space of 2N hypotheses.)
On the other hand, if Bob chose a hypothesis to test via his priors, ran the experiment, and then D was consistent with that hypothesis… you should infer that Bob’s priors were really good (i.e. not uniform) and the hypothesis is correct. After all, if Bob’s hypothesis was chosen at random, he only had a N2N chance of getting a D that was consistent with it.
Put another way: When I see the first scenario, I expect that the evidence gathered from the experiment is primarily serving to locate the hypothesis at all. When I see the second scenario, I expect that Bob has already successfully located the hypothesis before the experiment, and the experiment provides the last little bit of evidence needed to confirm it.
If I show you the code I’m going to use to run my experiment, can you be confident in guessing which hypothesis I aim to test?
Under this model, I can’t be confident in guessing which hypothesis you are trying to test.
My solution to the puzzle: Pre-registration works because it forces researchers to be honest about their prior knowledge. Basically, prior knowledge unencumbered by hindsight bias (“armchair reasoning”) is underrated.
It’s possible that Alice herself believes that the hypothesis she settled on was correct, rather than assigning it a 1N probability. If that were the case, I would say it was due to hindsight bias.
Any hypothesis which has only the support of armchair reasoning or data from a single experiment is suspect.
Yeah, I broadly agree with this.
[BTW, I think “armchair reasoning” might have the same referent as phrases with a more positive connotation: “deconfusion work” or “research distillation”.]
I definitely do not mean research distillation. Deconfusion work feels like a separate thing, which is usually a particular example of armchair reasoning. By armchair reasoning, I mean any sort of reasoning that can be done by just thinking without gathering more data. So for example, solving a thorny algorithms question would involve armchair reasoning.
I don’t mean to include the negative connotations of “armchair reasoning”, but I don’t know another short phrase that means the same thing.
Interesting. I think you’re probably right that our model should have a parameter for “researcher quality”, and if a researcher is able to correctly predict the outcome of an experiment, that should cause an update in the direction of that researcher being more knowledgable (and their prior judgements should therefore carry more weight—including for this particular experiment!)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread. Earlier you wrote: “However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work.” But in this recent comment you wrote: “the experiment provides the last little bit of evidence needed to confirm [the hypothesis]”. In the earlier comment, it sounds like you’re talking about a scenario where most of the evidence comes in the form of data; in the later comment, it sounds like you’re talking about a scenario where most of the evidence was necessary “just to think of the correct answer—to promote it to your attention” and the experiment only provides “the last little bit” of evidence.
So I think the philosophical puzzle is still unsolved. A few more things to ponder if someone wants to work on solving it:
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him? Does the mechanism by which hindsight bias works matter? (Here is one possible mechanism.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis. But noise appears to be a pretty big problem (see: replication crisis?) In current scientific practice, the probability that a result could have been obtained through noise is a number of great interest that’s almost always calculated (the p-value). How should this number be factored in, if at all?
Note that p-values can be used in Bayesian calculations. For example, in a simplified universe where either the null is true or the alternative is true, p(alternative|data) = p(data|alternative)p(alternative) / (p(data|alternative)p(alternative) + p(data|null)p(null))
My solution was focused on a scenario where we’re considering relatively obvious hypotheses and subject to lots of measurement noise, but you convinced me this is inadequate in general.
I’m unsatisfied with the discussion around “Alice didn’t think of all of them”. I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him. (By “relatively simple”, I mean a hypothesis that didn’t have hundreds of free parameters.) Presumably, Einstein had access to the same data as other contemporary physicists, so it feels weird to explain his contribution in terms of having access to more evidence.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating. This seems closely related to puzzles around “realizability”—through your search of hypothesis space, you’re essentially “realizing” a particular hypothesis on the fly, which isn’t how Bayesian updating is formally supposed to work. (But it is how deep learning works, for example.)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread.
The earlier comment was comparing experiments to “armchair reasoning”, while the later comment was comparing experiments to “all prior knowledge”. I think the typical case is:
Amount of evidence in “all prior knowledge” >> Amount of evidence in an experiment >> Amount of evidence from “armchair reasoning”.
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him?
I would pay a little more attention, but not that much more, and would want an experimental confirmation anyway. It seems to be that the world is complex enough and humans model it badly enough (for the sorts of things academia is looking at) that past evidence of good priors on one question doesn’t imply good priors on a different question.
(This is an empirical belief; I’m not confident in it.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis.
I expect that if you made a more complicated model where each hypothesis H had a likelihood p(D∣H), and p(D∣H) was high for N hypotheses and low for the rest, you’d get a similar conclusion, while accounting for results that are just noise.
I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him.
I agree that relativity is an example that doesn’t fit my story, where most of the work was in coming up with the hypothesis. (Though I suspect you could argue that relativity shouldn’t have been believed before experimental confirmation.) I claim that it is the exception, not the rule.
Also, I do think it is often a valuable contribution to even think of a plausible hypothesis that fits the data, even if you should assign it a relatively low probability of being true. I’m just saying that if you want to reach the truth, this work must be supplemented by experiments / gathering good data.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating.
Bayesian updating does not work well when you don’t have the full hypothesis space. Given that you know that you don’t have the full hypothesis space, you should not be trying to approximate Bayesian updating over the hypothesis space you do have.
Bayesian updating does not work well when you don’t have the full hypothesis space.
Do you have any links related to this? Technically speaking, the right hypothesis is almost never in our hypothesis space (“All models are wrong, but some are useful”). But even if there’s no “useful” model in your hypothesis space, it seems Bayesian updating fails gracefully if you have a reasonably wide prior distribution for your noise parameters as well (then the model fitting process will conclude that the value of your noise parameter must be high).
No, I haven’t read much about Bayesian updating. But I can give an example.
Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you’re right I give you $2 and if I’m right you give me $1). Then I flip the coin and the bet resolves.
If your hypothesis space is “the coin has some bias b of coming up Heads or Tails”, then you will eagerly accept this game for large enough N—you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.
However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you’ll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.
Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn’t help). I don’t know of a general way to use noise parameters to avoid issues like this.
FP works well when it is easy to make progress on ideas / questions through armchair reasoning, which you can think of as using the information or evidence you have more efficiently. However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work. As an illustrative example, consider trying to estimate the population of a city via FP, vs. going out and doing a census. In ML, you could say “we should add such-and-such inductive bias to our models so that they learn faster”, and we can debate how much that would help, but if you actually build it in and train the model and see what happens, you just know the answer now.
Hm, you think data soundly beats theory in ML? Why is HARKing a problem then?
HARKing does the right two steps in the wrong order—it first gets the data, and then makes the hypothesis. This is fine for generating hypotheses, but isn’t great for telling whether a hypothesis is true or not, because there are likely many hypotheses that explain the data and it’s not clear why the one you chose should be the right one. It’s much stronger evidence if you first have a hypothesis, and then design a test for it, because then there is only one result out of many possible results that confirms your hypothesis. (This is oversimplified, but captures the broad point.)
I wouldn’t say “data beats theory”, I think theory (in the sense of “some way of predicting which ideas will be good”, not necessarily math) is needed in order to figure out which ideas to bother testing in the first place. But if you are evaluating on “what gives me confidence that <hypothesis> is true”, it’s usually going to be data. Theorems could do it, but it seems pretty rare that there are theorems for actually interesting hypotheses.
I actually think there is an interesting philosophical puzzle around this that has not fully been solved...
If I show you the code I’m going to use to run my experiment, can you be confident in guessing which hypothesis I aim to test?
If yes, then HARKing should be easily detectable. By looking at my code, it should be clear that the hypothesis I am actually testing is not the one that I published.
If no, then the resulting data could be used to prove multiple different hypotheses, and thus doesn’t necessarily constitute stronger evidence for any one of the particular hypotheses it could be used to prove (e.g. the hypothesis I preregistered).
To put it another way, in your first paragraph you say “there are likely many hypotheses that explain the data”, but in the second paragraph, you talk as though there’s a particular set of data such that if we get that data, we know there’s only one hypothesis which it can be used to support! What gives?
My solution to the puzzle: Pre-registration works because it forces researchers to be honest about their prior knowledge. Basically, prior knowledge unencumbered by hindsight bias (“armchair reasoning”) is underrated. Any hypothesis which has only the support of armchair reasoning or data from a single experiment is suspect. You really want both.
In Bayesian terms, you have to look at both the prior and the likelihood. Order shouldn’t matter (multiplication is commutative), but as I said—hindsight bias.
Curious to hear your thoughts.
[There are also cases where given the data, there’s only one plausible hypothesis which could possibly explain it. A well-designed experiment will hopefully produce data like this, but I think it’s a bit orthogonal to the HARKing issue, because we can imagine scenarios where post hoc data analysis suggests there is only one plausible hypothesis for what’s going on… although we should still be suspicious in that case because (presumably) we didn’t have prior beliefs indicating this hypothesis was likely to be true. Note that in both cases we are bottlenecked on the creativity of the experiment designer/data analyst in thinking up alternative hypotheses.]
[BTW, I think “armchair reasoning” might have the same referent as phrases with a more positive connotation: “deconfusion work” or “research distillation”.]
My solution to the puzzle is a bit different (but maybe the same?) Let’s suppose that there’s an experiment we could run that would come out with some result D. Each potential value of D is consistent with N hypotheses. There are 2N potential hypotheses.
Suppose Alice runs the experiment D and then chooses a hypothesis to explain it. This is consistent with Alice having a uniform prior, in which case she has a 1N chance of having settled on the true hypothesis. (Why not just list all N hypotheses? Because Alice didn’t think of all of them—it’s hard to search the entire space of 2N hypotheses.)
On the other hand, if Bob chose a hypothesis to test via his priors, ran the experiment, and then D was consistent with that hypothesis… you should infer that Bob’s priors were really good (i.e. not uniform) and the hypothesis is correct. After all, if Bob’s hypothesis was chosen at random, he only had a N2N chance of getting a D that was consistent with it.
Put another way: When I see the first scenario, I expect that the evidence gathered from the experiment is primarily serving to locate the hypothesis at all. When I see the second scenario, I expect that Bob has already successfully located the hypothesis before the experiment, and the experiment provides the last little bit of evidence needed to confirm it.
Related: Privileging the hypothesis
Under this model, I can’t be confident in guessing which hypothesis you are trying to test.
It’s possible that Alice herself believes that the hypothesis she settled on was correct, rather than assigning it a 1N probability. If that were the case, I would say it was due to hindsight bias.
Yeah, I broadly agree with this.
I definitely do not mean research distillation. Deconfusion work feels like a separate thing, which is usually a particular example of armchair reasoning. By armchair reasoning, I mean any sort of reasoning that can be done by just thinking without gathering more data. So for example, solving a thorny algorithms question would involve armchair reasoning.
I don’t mean to include the negative connotations of “armchair reasoning”, but I don’t know another short phrase that means the same thing.
Interesting. I think you’re probably right that our model should have a parameter for “researcher quality”, and if a researcher is able to correctly predict the outcome of an experiment, that should cause an update in the direction of that researcher being more knowledgable (and their prior judgements should therefore carry more weight—including for this particular experiment!)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread. Earlier you wrote: “However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work.” But in this recent comment you wrote: “the experiment provides the last little bit of evidence needed to confirm [the hypothesis]”. In the earlier comment, it sounds like you’re talking about a scenario where most of the evidence comes in the form of data; in the later comment, it sounds like you’re talking about a scenario where most of the evidence was necessary “just to think of the correct answer—to promote it to your attention” and the experiment only provides “the last little bit” of evidence.
So I think the philosophical puzzle is still unsolved. A few more things to ponder if someone wants to work on solving it:
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him? Does the mechanism by which hindsight bias works matter? (Here is one possible mechanism.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis. But noise appears to be a pretty big problem (see: replication crisis?) In current scientific practice, the probability that a result could have been obtained through noise is a number of great interest that’s almost always calculated (the p-value). How should this number be factored in, if at all?
Note that p-values can be used in Bayesian calculations. For example, in a simplified universe where either the null is true or the alternative is true,
p(alternative|data) = p(data|alternative)p(alternative) / (p(data|alternative)p(alternative) + p(data|null)p(null))
My solution was focused on a scenario where we’re considering relatively obvious hypotheses and subject to lots of measurement noise, but you convinced me this is inadequate in general.
I’m unsatisfied with the discussion around “Alice didn’t think of all of them”. I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him. (By “relatively simple”, I mean a hypothesis that didn’t have hundreds of free parameters.) Presumably, Einstein had access to the same data as other contemporary physicists, so it feels weird to explain his contribution in terms of having access to more evidence.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating. This seems closely related to puzzles around “realizability”—through your search of hypothesis space, you’re essentially “realizing” a particular hypothesis on the fly, which isn’t how Bayesian updating is formally supposed to work. (But it is how deep learning works, for example.)
The earlier comment was comparing experiments to “armchair reasoning”, while the later comment was comparing experiments to “all prior knowledge”. I think the typical case is:
Amount of evidence in “all prior knowledge” >> Amount of evidence in an experiment >> Amount of evidence from “armchair reasoning”.
I would pay a little more attention, but not that much more, and would want an experimental confirmation anyway. It seems to be that the world is complex enough and humans model it badly enough (for the sorts of things academia is looking at) that past evidence of good priors on one question doesn’t imply good priors on a different question.
(This is an empirical belief; I’m not confident in it.)
I expect that if you made a more complicated model where each hypothesis H had a likelihood p(D∣H), and p(D∣H) was high for N hypotheses and low for the rest, you’d get a similar conclusion, while accounting for results that are just noise.
I agree that relativity is an example that doesn’t fit my story, where most of the work was in coming up with the hypothesis. (Though I suspect you could argue that relativity shouldn’t have been believed before experimental confirmation.) I claim that it is the exception, not the rule.
Also, I do think it is often a valuable contribution to even think of a plausible hypothesis that fits the data, even if you should assign it a relatively low probability of being true. I’m just saying that if you want to reach the truth, this work must be supplemented by experiments / gathering good data.
Bayesian updating does not work well when you don’t have the full hypothesis space. Given that you know that you don’t have the full hypothesis space, you should not be trying to approximate Bayesian updating over the hypothesis space you do have.
Do you have any links related to this? Technically speaking, the right hypothesis is almost never in our hypothesis space (“All models are wrong, but some are useful”). But even if there’s no “useful” model in your hypothesis space, it seems Bayesian updating fails gracefully if you have a reasonably wide prior distribution for your noise parameters as well (then the model fitting process will conclude that the value of your noise parameter must be high).
No, I haven’t read much about Bayesian updating. But I can give an example.
Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you’re right I give you $2 and if I’m right you give me $1). Then I flip the coin and the bet resolves.
If your hypothesis space is “the coin has some bias b of coming up Heads or Tails”, then you will eagerly accept this game for large enough N—you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.
However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you’ll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.
Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn’t help). I don’t know of a general way to use noise parameters to avoid issues like this.
Thanks for the example!