I broadly agree, but there are good reasons for more traditional research as well:
In many research areas, ideas are common, and it isn’t clear which ideas are most important. The most useful contributions come from someone taking an idea and demonstrating that it is viable and important, which often requires a lot of solitary work that can’t be done in the typical amount of time it takes to write a comment or post.
FP often leads to long, winding discussions that may end with two researchers agreeing, but the resulting transcript is not great for future readers. In contrast, traditional research produces more crisp distillations of an idea that are useful for communicating with an entire field. (I anticipate people saying that academic papers are incomprehensible. They are incomprehensible to outsiders, but often are relatively easy to read to people in the field. I typically find academic papers in my area to be significantly better at telling me what I want to know than blog posts, though the ideal combination is to first read the blog post and then follow it up with the paper.)
Nevertheless, I do think that FP is a better strategy for intellectual progress in AI alignment, which at least currently feels more ideas-bottlenecked than paper-bottlenecked.
In many research areas, ideas are common, and it isn’t clear which ideas are most important. The most useful contributions come from someone taking an idea and demonstrating that it is viable and important, which often requires a lot of solitary work that can’t be done in the typical amount of time it takes to write a comment or post.
Agreed. My recommendations aren’t meant to be universally applicable. (ETA: Also, one could marginally increase one’s forum participation in order to capture some of the benefits, and not necessarily go all the way to adopting it as one’s primary research strategy.)
FP often leads to long, winding discussions that may end with two researchers agreeing, but the resulting transcript is not great for future readers.
There’s nothing that explicitly prevents people from distilling such discussions into subsequent posts or papers. If people aren’t doing that, or are doing that less than they should, that could potentially be solved as a problem that’s separate from “should more people be doing FP or traditional research?”
Also, it’s not clear to me that traditional research produces more clear distillations of how disagreements get resolved. It seems like most such discussions don’t make for publishable papers and therefore most disagreements between “traditional researchers” just don’t get resolved in a way that leaves a public record (or at all).
There’s nothing that explicitly prevents people from distilling such discussions into subsequent posts or papers. If people aren’t doing that, or are doing that less than they should, that could potentially be solved as a problem that’s separate from “should more people be doing FP or traditional research?”
FYI this is something the LW team thinks about a bunch and I expect us to have made some serious effort towards incentivizing and simplifying this process in the coming year.
This is actually a major motivation for the wiki/tagging system we are building. Also, you might have noticed all the edited transcripts we’ve been publishing, and the debates we’ve started organizing, which is also part of this. I‘be experimented a lot over the last year with UI for potentially directly distilling comment threads, but all of them ended up too clunky and messy to ever make me excited about them, though I still have some things I might want to give a shot, but overall I am currently thinking of tackling this problem in a slightly more indirect way.
Hm, I percieved Raemon to be referring more specifically to turning forum discussions into posts, or otherwise tidying them up. I think that’s importantly different to transcribing a talk (since a talk isn’t a discussion), or a debate (since you only have a short period of time to think about your response to the other person). I guess it’s possible that the tagging system helps with this, but it’s not obvious to me how it would. That being said, I do agree that more broadly LW has moved towards more synthesis and intertemporal discussions.
I’d add “The LessWrong 2018 Review” to the list of things that are “sort of exploring the same direction”. I agree my particular prediction about mechanical tools for distilling comments didn’t materialize, but we did definitely allocate tons of effort towards distillation as a whole.
Yeah, and I experimented a bunch with that (directly turning forum discussions into posts) and mostly felt like it didn’t really work that well. I mostly updated that there needs to be a larger synthesis step, though I still have some guesses for more direct things that could work. Ben spent some hours distilling the discussion and comments on a bunch of posts, which we should get around to posting (I just realized we never published them).
Re tagging: In general the tagging system that we are building has a lot in common with being a wiki (collaboratively editable descriptions, providing canonical definitions and references, and providing good summaries of existing content), and I expect it to grow into being more of a wiki over time (the tagging use-case was a specific narrow use-case that seemed easy to get traction on, but the mid-term goal is to do a lot more wiki-like stuff). And I think from that perspective it’s more clear how it helps with distillation.
There’s nothing that explicitly prevents people from distilling such discussions into subsequent posts or papers. If people aren’t doing that, or are doing that less than they should, that could potentially be solved as a problem that’s separate from “should more people be doing FP or traditional research?”
Doing these types of summarize feels like a good place to start out if you are new to doing FP. It is a fairly straight-forward task, but provides a lot of value, and helps you grow skills and reputation that will help you when you do more independent work later.
It might be useful for more experienced researchers/posters to explicitly point out when they are leaving this kind of value on the table. (“This was an interesting conversation, it contains a few valuable insights, and if I didn’t have more pressing things to work on, I would have liked to distill it to make it more clear. If someone feels like doing that, I will happily comment on the draft and signal boost the post.”)
There’s nothing that explicitly prevents people from distilling such discussions into subsequent posts or papers. If people aren’t doing that, or are doing that less than they should, that could potentially be solved as a problem that’s separate from “should more people be doing FP or traditional research?”
Agreed. I’m mostly saying that empirically people don’t do that, but yes there could be other solutions to the problem, it need not be inherent to FP.
Also, it’s not clear to me that traditional research produces more clear distillations of how disagreements get resolved.
I agree you don’t see how the disagreement gets resolved, but you usually can see the answer to the question that prompted the disagreement, because the resolution itself can be turned into a paper. This is assuming that the resolution came via new evidence. I agree that if a disagreement is resolved via simply talking through the arguments, then it doesn’t turn into a paper, but this seems pretty rare (at least in CS).
In many research areas, ideas are common, and it isn’t clear which ideas are most important. The most useful contributions come from someone taking an idea and demonstrating that it is viable and important, which often requires a lot of solitary work that can’t be done in the typical amount of time it takes to write a comment or post.
Interesting. Definitely not an expert here, but I could imagine FP being a good tool in this case… if the forum is an efficient “marketplace of ideas”, where perspectives compete and poke holes in each other and adapt to critics, and the strongest perspectives emerge victorious, then this seems like it could be a good way to figure out which ideas are the best? Some say AI alignment is like software security, and there’s that saying “with enough eyes all bugs are shallow”. If security flaws tend to be a result of software designers relying on faulty abstractions or otherwise falling prey to blind spots, then I would expect that withstanding a bunch of critics, each critic using their own set of abstractions, is a stronger indicator of quality than anything one person is able to do in solitude.
(It’s possible that you’re using “important” in a way that’s different than how I used it in the preceding paragraph.)
FP works well when it is easy to make progress on ideas / questions through armchair reasoning, which you can think of as using the information or evidence you have more efficiently. However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work. As an illustrative example, consider trying to estimate the population of a city via FP, vs. going out and doing a census. In ML, you could say “we should add such-and-such inductive bias to our models so that they learn faster”, and we can debate how much that would help, but if you actually build it in and train the model and see what happens, you just know the answer now.
HARKing does the right two steps in the wrong order—it first gets the data, and then makes the hypothesis. This is fine for generating hypotheses, but isn’t great for telling whether a hypothesis is true or not, because there are likely many hypotheses that explain the data and it’s not clear why the one you chose should be the right one. It’s much stronger evidence if you first have a hypothesis, and then design a test for it, because then there is only one result out of many possible results that confirms your hypothesis. (This is oversimplified, but captures the broad point.)
I wouldn’t say “data beats theory”, I think theory (in the sense of “some way of predicting which ideas will be good”, not necessarily math) is needed in order to figure out which ideas to bother testing in the first place. But if you are evaluating on “what gives me confidence that <hypothesis> is true”, it’s usually going to be data. Theorems could do it, but it seems pretty rare that there are theorems for actually interesting hypotheses.
I actually think there is an interesting philosophical puzzle around this that has not fully been solved...
If I show you the code I’m going to use to run my experiment, can you be confident in guessing which hypothesis I aim to test?
If yes, then HARKing should be easily detectable. By looking at my code, it should be clear that the hypothesis I am actually testing is not the one that I published.
If no, then the resulting data could be used to prove multiple different hypotheses, and thus doesn’t necessarily constitute stronger evidence for any one of the particular hypotheses it could be used to prove (e.g. the hypothesis I preregistered).
To put it another way, in your first paragraph you say “there are likely many hypotheses that explain the data”, but in the second paragraph, you talk as though there’s a particular set of data such that if we get that data, we know there’s only one hypothesis which it can be used to support! What gives?
My solution to the puzzle: Pre-registration works because it forces researchers to be honest about their prior knowledge. Basically, prior knowledge unencumbered by hindsight bias (“armchair reasoning”) is underrated. Any hypothesis which has only the support of armchair reasoning or data from a single experiment is suspect. You really want both.
In Bayesian terms, you have to look at both the prior and the likelihood. Order shouldn’t matter (multiplication is commutative), but as I said—hindsight bias.
Curious to hear your thoughts.
[There are also cases where given the data, there’s only one plausible hypothesis which could possibly explain it. A well-designed experiment will hopefully produce data like this, but I think it’s a bit orthogonal to the HARKing issue, because we can imagine scenarios where post hoc data analysis suggests there is only one plausible hypothesis for what’s going on… although we should still be suspicious in that case because (presumably) we didn’t have prior beliefs indicating this hypothesis was likely to be true. Note that in both cases we are bottlenecked on the creativity of the experiment designer/data analyst in thinking up alternative hypotheses.]
[BTW, I think “armchair reasoning” might have the same referent as phrases with a more positive connotation: “deconfusion work” or “research distillation”.]
My solution to the puzzle is a bit different (but maybe the same?) Let’s suppose that there’s an experiment we could run that would come out with some result D. Each potential value of D is consistent with N hypotheses. There are 2N potential hypotheses.
Suppose Alice runs the experiment D and then chooses a hypothesis to explain it. This is consistent with Alice having a uniform prior, in which case she has a 1N chance of having settled on the true hypothesis. (Why not just list all N hypotheses? Because Alice didn’t think of all of them—it’s hard to search the entire space of 2N hypotheses.)
On the other hand, if Bob chose a hypothesis to test via his priors, ran the experiment, and then D was consistent with that hypothesis… you should infer that Bob’s priors were really good (i.e. not uniform) and the hypothesis is correct. After all, if Bob’s hypothesis was chosen at random, he only had a N2N chance of getting a D that was consistent with it.
Put another way: When I see the first scenario, I expect that the evidence gathered from the experiment is primarily serving to locate the hypothesis at all. When I see the second scenario, I expect that Bob has already successfully located the hypothesis before the experiment, and the experiment provides the last little bit of evidence needed to confirm it.
If I show you the code I’m going to use to run my experiment, can you be confident in guessing which hypothesis I aim to test?
Under this model, I can’t be confident in guessing which hypothesis you are trying to test.
My solution to the puzzle: Pre-registration works because it forces researchers to be honest about their prior knowledge. Basically, prior knowledge unencumbered by hindsight bias (“armchair reasoning”) is underrated.
It’s possible that Alice herself believes that the hypothesis she settled on was correct, rather than assigning it a 1N probability. If that were the case, I would say it was due to hindsight bias.
Any hypothesis which has only the support of armchair reasoning or data from a single experiment is suspect.
Yeah, I broadly agree with this.
[BTW, I think “armchair reasoning” might have the same referent as phrases with a more positive connotation: “deconfusion work” or “research distillation”.]
I definitely do not mean research distillation. Deconfusion work feels like a separate thing, which is usually a particular example of armchair reasoning. By armchair reasoning, I mean any sort of reasoning that can be done by just thinking without gathering more data. So for example, solving a thorny algorithms question would involve armchair reasoning.
I don’t mean to include the negative connotations of “armchair reasoning”, but I don’t know another short phrase that means the same thing.
Interesting. I think you’re probably right that our model should have a parameter for “researcher quality”, and if a researcher is able to correctly predict the outcome of an experiment, that should cause an update in the direction of that researcher being more knowledgable (and their prior judgements should therefore carry more weight—including for this particular experiment!)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread. Earlier you wrote: “However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work.” But in this recent comment you wrote: “the experiment provides the last little bit of evidence needed to confirm [the hypothesis]”. In the earlier comment, it sounds like you’re talking about a scenario where most of the evidence comes in the form of data; in the later comment, it sounds like you’re talking about a scenario where most of the evidence was necessary “just to think of the correct answer—to promote it to your attention” and the experiment only provides “the last little bit” of evidence.
So I think the philosophical puzzle is still unsolved. A few more things to ponder if someone wants to work on solving it:
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him? Does the mechanism by which hindsight bias works matter? (Here is one possible mechanism.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis. But noise appears to be a pretty big problem (see: replication crisis?) In current scientific practice, the probability that a result could have been obtained through noise is a number of great interest that’s almost always calculated (the p-value). How should this number be factored in, if at all?
Note that p-values can be used in Bayesian calculations. For example, in a simplified universe where either the null is true or the alternative is true, p(alternative|data) = p(data|alternative)p(alternative) / (p(data|alternative)p(alternative) + p(data|null)p(null))
My solution was focused on a scenario where we’re considering relatively obvious hypotheses and subject to lots of measurement noise, but you convinced me this is inadequate in general.
I’m unsatisfied with the discussion around “Alice didn’t think of all of them”. I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him. (By “relatively simple”, I mean a hypothesis that didn’t have hundreds of free parameters.) Presumably, Einstein had access to the same data as other contemporary physicists, so it feels weird to explain his contribution in terms of having access to more evidence.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating. This seems closely related to puzzles around “realizability”—through your search of hypothesis space, you’re essentially “realizing” a particular hypothesis on the fly, which isn’t how Bayesian updating is formally supposed to work. (But it is how deep learning works, for example.)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread.
The earlier comment was comparing experiments to “armchair reasoning”, while the later comment was comparing experiments to “all prior knowledge”. I think the typical case is:
Amount of evidence in “all prior knowledge” >> Amount of evidence in an experiment >> Amount of evidence from “armchair reasoning”.
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him?
I would pay a little more attention, but not that much more, and would want an experimental confirmation anyway. It seems to be that the world is complex enough and humans model it badly enough (for the sorts of things academia is looking at) that past evidence of good priors on one question doesn’t imply good priors on a different question.
(This is an empirical belief; I’m not confident in it.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis.
I expect that if you made a more complicated model where each hypothesis H had a likelihood p(D∣H), and p(D∣H) was high for N hypotheses and low for the rest, you’d get a similar conclusion, while accounting for results that are just noise.
I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him.
I agree that relativity is an example that doesn’t fit my story, where most of the work was in coming up with the hypothesis. (Though I suspect you could argue that relativity shouldn’t have been believed before experimental confirmation.) I claim that it is the exception, not the rule.
Also, I do think it is often a valuable contribution to even think of a plausible hypothesis that fits the data, even if you should assign it a relatively low probability of being true. I’m just saying that if you want to reach the truth, this work must be supplemented by experiments / gathering good data.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating.
Bayesian updating does not work well when you don’t have the full hypothesis space. Given that you know that you don’t have the full hypothesis space, you should not be trying to approximate Bayesian updating over the hypothesis space you do have.
Bayesian updating does not work well when you don’t have the full hypothesis space.
Do you have any links related to this? Technically speaking, the right hypothesis is almost never in our hypothesis space (“All models are wrong, but some are useful”). But even if there’s no “useful” model in your hypothesis space, it seems Bayesian updating fails gracefully if you have a reasonably wide prior distribution for your noise parameters as well (then the model fitting process will conclude that the value of your noise parameter must be high).
No, I haven’t read much about Bayesian updating. But I can give an example.
Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you’re right I give you $2 and if I’m right you give me $1). Then I flip the coin and the bet resolves.
If your hypothesis space is “the coin has some bias b of coming up Heads or Tails”, then you will eagerly accept this game for large enough N—you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.
However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you’ll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.
Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn’t help). I don’t know of a general way to use noise parameters to avoid issues like this.
I broadly agree, but there are good reasons for more traditional research as well:
In many research areas, ideas are common, and it isn’t clear which ideas are most important. The most useful contributions come from someone taking an idea and demonstrating that it is viable and important, which often requires a lot of solitary work that can’t be done in the typical amount of time it takes to write a comment or post.
FP often leads to long, winding discussions that may end with two researchers agreeing, but the resulting transcript is not great for future readers. In contrast, traditional research produces more crisp distillations of an idea that are useful for communicating with an entire field. (I anticipate people saying that academic papers are incomprehensible. They are incomprehensible to outsiders, but often are relatively easy to read to people in the field. I typically find academic papers in my area to be significantly better at telling me what I want to know than blog posts, though the ideal combination is to first read the blog post and then follow it up with the paper.)
Nevertheless, I do think that FP is a better strategy for intellectual progress in AI alignment, which at least currently feels more ideas-bottlenecked than paper-bottlenecked.
Agreed. My recommendations aren’t meant to be universally applicable. (ETA: Also, one could marginally increase one’s forum participation in order to capture some of the benefits, and not necessarily go all the way to adopting it as one’s primary research strategy.)
There’s nothing that explicitly prevents people from distilling such discussions into subsequent posts or papers. If people aren’t doing that, or are doing that less than they should, that could potentially be solved as a problem that’s separate from “should more people be doing FP or traditional research?”
Also, it’s not clear to me that traditional research produces more clear distillations of how disagreements get resolved. It seems like most such discussions don’t make for publishable papers and therefore most disagreements between “traditional researchers” just don’t get resolved in a way that leaves a public record (or at all).
FYI this is something the LW team thinks about a bunch and I expect us to have made some serious effort towards incentivizing and simplifying this process in the coming year.
Just got a calendar reminder to check if this happened—my impression is that any such efforts haven’t really materialised on the site.
This is actually a major motivation for the wiki/tagging system we are building. Also, you might have noticed all the edited transcripts we’ve been publishing, and the debates we’ve started organizing, which is also part of this. I‘be experimented a lot over the last year with UI for potentially directly distilling comment threads, but all of them ended up too clunky and messy to ever make me excited about them, though I still have some things I might want to give a shot, but overall I am currently thinking of tackling this problem in a slightly more indirect way.
Hm, I percieved Raemon to be referring more specifically to turning forum discussions into posts, or otherwise tidying them up. I think that’s importantly different to transcribing a talk (since a talk isn’t a discussion), or a debate (since you only have a short period of time to think about your response to the other person). I guess it’s possible that the tagging system helps with this, but it’s not obvious to me how it would. That being said, I do agree that more broadly LW has moved towards more synthesis and intertemporal discussions.
I’d add “The LessWrong 2018 Review” to the list of things that are “sort of exploring the same direction”. I agree my particular prediction about mechanical tools for distilling comments didn’t materialize, but we did definitely allocate tons of effort towards distillation as a whole.
Yeah, and I experimented a bunch with that (directly turning forum discussions into posts) and mostly felt like it didn’t really work that well. I mostly updated that there needs to be a larger synthesis step, though I still have some guesses for more direct things that could work. Ben spent some hours distilling the discussion and comments on a bunch of posts, which we should get around to posting (I just realized we never published them).
Re tagging: In general the tagging system that we are building has a lot in common with being a wiki (collaboratively editable descriptions, providing canonical definitions and references, and providing good summaries of existing content), and I expect it to grow into being more of a wiki over time (the tagging use-case was a specific narrow use-case that seemed easy to get traction on, but the mid-term goal is to do a lot more wiki-like stuff). And I think from that perspective it’s more clear how it helps with distillation.
I went and published such a distillation (which attempts to summarise the post What Failure Looks Like and distill its comments).
Doing these types of summarize feels like a good place to start out if you are new to doing FP. It is a fairly straight-forward task, but provides a lot of value, and helps you grow skills and reputation that will help you when you do more independent work later.
It might be useful for more experienced researchers/posters to explicitly point out when they are leaving this kind of value on the table. (“This was an interesting conversation, it contains a few valuable insights, and if I didn’t have more pressing things to work on, I would have liked to distill it to make it more clear. If someone feels like doing that, I will happily comment on the draft and signal boost the post.”)
Agreed. I’m mostly saying that empirically people don’t do that, but yes there could be other solutions to the problem, it need not be inherent to FP.
I agree you don’t see how the disagreement gets resolved, but you usually can see the answer to the question that prompted the disagreement, because the resolution itself can be turned into a paper. This is assuming that the resolution came via new evidence. I agree that if a disagreement is resolved via simply talking through the arguments, then it doesn’t turn into a paper, but this seems pretty rare (at least in CS).
Interesting. Definitely not an expert here, but I could imagine FP being a good tool in this case… if the forum is an efficient “marketplace of ideas”, where perspectives compete and poke holes in each other and adapt to critics, and the strongest perspectives emerge victorious, then this seems like it could be a good way to figure out which ideas are the best? Some say AI alignment is like software security, and there’s that saying “with enough eyes all bugs are shallow”. If security flaws tend to be a result of software designers relying on faulty abstractions or otherwise falling prey to blind spots, then I would expect that withstanding a bunch of critics, each critic using their own set of abstractions, is a stronger indicator of quality than anything one person is able to do in solitude.
(It’s possible that you’re using “important” in a way that’s different than how I used it in the preceding paragraph.)
FP works well when it is easy to make progress on ideas / questions through armchair reasoning, which you can think of as using the information or evidence you have more efficiently. However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work. As an illustrative example, consider trying to estimate the population of a city via FP, vs. going out and doing a census. In ML, you could say “we should add such-and-such inductive bias to our models so that they learn faster”, and we can debate how much that would help, but if you actually build it in and train the model and see what happens, you just know the answer now.
Hm, you think data soundly beats theory in ML? Why is HARKing a problem then?
HARKing does the right two steps in the wrong order—it first gets the data, and then makes the hypothesis. This is fine for generating hypotheses, but isn’t great for telling whether a hypothesis is true or not, because there are likely many hypotheses that explain the data and it’s not clear why the one you chose should be the right one. It’s much stronger evidence if you first have a hypothesis, and then design a test for it, because then there is only one result out of many possible results that confirms your hypothesis. (This is oversimplified, but captures the broad point.)
I wouldn’t say “data beats theory”, I think theory (in the sense of “some way of predicting which ideas will be good”, not necessarily math) is needed in order to figure out which ideas to bother testing in the first place. But if you are evaluating on “what gives me confidence that <hypothesis> is true”, it’s usually going to be data. Theorems could do it, but it seems pretty rare that there are theorems for actually interesting hypotheses.
I actually think there is an interesting philosophical puzzle around this that has not fully been solved...
If I show you the code I’m going to use to run my experiment, can you be confident in guessing which hypothesis I aim to test?
If yes, then HARKing should be easily detectable. By looking at my code, it should be clear that the hypothesis I am actually testing is not the one that I published.
If no, then the resulting data could be used to prove multiple different hypotheses, and thus doesn’t necessarily constitute stronger evidence for any one of the particular hypotheses it could be used to prove (e.g. the hypothesis I preregistered).
To put it another way, in your first paragraph you say “there are likely many hypotheses that explain the data”, but in the second paragraph, you talk as though there’s a particular set of data such that if we get that data, we know there’s only one hypothesis which it can be used to support! What gives?
My solution to the puzzle: Pre-registration works because it forces researchers to be honest about their prior knowledge. Basically, prior knowledge unencumbered by hindsight bias (“armchair reasoning”) is underrated. Any hypothesis which has only the support of armchair reasoning or data from a single experiment is suspect. You really want both.
In Bayesian terms, you have to look at both the prior and the likelihood. Order shouldn’t matter (multiplication is commutative), but as I said—hindsight bias.
Curious to hear your thoughts.
[There are also cases where given the data, there’s only one plausible hypothesis which could possibly explain it. A well-designed experiment will hopefully produce data like this, but I think it’s a bit orthogonal to the HARKing issue, because we can imagine scenarios where post hoc data analysis suggests there is only one plausible hypothesis for what’s going on… although we should still be suspicious in that case because (presumably) we didn’t have prior beliefs indicating this hypothesis was likely to be true. Note that in both cases we are bottlenecked on the creativity of the experiment designer/data analyst in thinking up alternative hypotheses.]
[BTW, I think “armchair reasoning” might have the same referent as phrases with a more positive connotation: “deconfusion work” or “research distillation”.]
My solution to the puzzle is a bit different (but maybe the same?) Let’s suppose that there’s an experiment we could run that would come out with some result D. Each potential value of D is consistent with N hypotheses. There are 2N potential hypotheses.
Suppose Alice runs the experiment D and then chooses a hypothesis to explain it. This is consistent with Alice having a uniform prior, in which case she has a 1N chance of having settled on the true hypothesis. (Why not just list all N hypotheses? Because Alice didn’t think of all of them—it’s hard to search the entire space of 2N hypotheses.)
On the other hand, if Bob chose a hypothesis to test via his priors, ran the experiment, and then D was consistent with that hypothesis… you should infer that Bob’s priors were really good (i.e. not uniform) and the hypothesis is correct. After all, if Bob’s hypothesis was chosen at random, he only had a N2N chance of getting a D that was consistent with it.
Put another way: When I see the first scenario, I expect that the evidence gathered from the experiment is primarily serving to locate the hypothesis at all. When I see the second scenario, I expect that Bob has already successfully located the hypothesis before the experiment, and the experiment provides the last little bit of evidence needed to confirm it.
Related: Privileging the hypothesis
Under this model, I can’t be confident in guessing which hypothesis you are trying to test.
It’s possible that Alice herself believes that the hypothesis she settled on was correct, rather than assigning it a 1N probability. If that were the case, I would say it was due to hindsight bias.
Yeah, I broadly agree with this.
I definitely do not mean research distillation. Deconfusion work feels like a separate thing, which is usually a particular example of armchair reasoning. By armchair reasoning, I mean any sort of reasoning that can be done by just thinking without gathering more data. So for example, solving a thorny algorithms question would involve armchair reasoning.
I don’t mean to include the negative connotations of “armchair reasoning”, but I don’t know another short phrase that means the same thing.
Interesting. I think you’re probably right that our model should have a parameter for “researcher quality”, and if a researcher is able to correctly predict the outcome of an experiment, that should cause an update in the direction of that researcher being more knowledgable (and their prior judgements should therefore carry more weight—including for this particular experiment!)
But the story you’re telling doesn’t seem entirely compatible with your comment earlier in this thread. Earlier you wrote: “However, it is often the case that you could get a lot more high-quality evidence that basically settles the question, if you put in many hours of work.” But in this recent comment you wrote: “the experiment provides the last little bit of evidence needed to confirm [the hypothesis]”. In the earlier comment, it sounds like you’re talking about a scenario where most of the evidence comes in the form of data; in the later comment, it sounds like you’re talking about a scenario where most of the evidence was necessary “just to think of the correct answer—to promote it to your attention” and the experiment only provides “the last little bit” of evidence.
So I think the philosophical puzzle is still unsolved. A few more things to ponder if someone wants to work on solving it:
If Bob is known to be an excellent researcher, can we trust HARKing if it comes from him? Does the mechanism by which hindsight bias works matter? (Here is one possible mechanism.)
In your simplified model above, there’s no possibility of a result that is “just noise” and not explained by any particular hypothesis. But noise appears to be a pretty big problem (see: replication crisis?) In current scientific practice, the probability that a result could have been obtained through noise is a number of great interest that’s almost always calculated (the p-value). How should this number be factored in, if at all?
Note that p-values can be used in Bayesian calculations. For example, in a simplified universe where either the null is true or the alternative is true,
p(alternative|data) = p(data|alternative)p(alternative) / (p(data|alternative)p(alternative) + p(data|null)p(null))
My solution was focused on a scenario where we’re considering relatively obvious hypotheses and subject to lots of measurement noise, but you convinced me this is inadequate in general.
I’m unsatisfied with the discussion around “Alice didn’t think of all of them”. I know nothing about relativity, but I imagine a big part of Einstein’s contribution was his discovery of a relatively simple hypothesis which explained all the data available to him. (By “relatively simple”, I mean a hypothesis that didn’t have hundreds of free parameters.) Presumably, Einstein had access to the same data as other contemporary physicists, so it feels weird to explain his contribution in terms of having access to more evidence.
In other words, it feels like the task of searching hypothesis space should be factored out from the task of Bayesian updating. This seems closely related to puzzles around “realizability”—through your search of hypothesis space, you’re essentially “realizing” a particular hypothesis on the fly, which isn’t how Bayesian updating is formally supposed to work. (But it is how deep learning works, for example.)
The earlier comment was comparing experiments to “armchair reasoning”, while the later comment was comparing experiments to “all prior knowledge”. I think the typical case is:
Amount of evidence in “all prior knowledge” >> Amount of evidence in an experiment >> Amount of evidence from “armchair reasoning”.
I would pay a little more attention, but not that much more, and would want an experimental confirmation anyway. It seems to be that the world is complex enough and humans model it badly enough (for the sorts of things academia is looking at) that past evidence of good priors on one question doesn’t imply good priors on a different question.
(This is an empirical belief; I’m not confident in it.)
I expect that if you made a more complicated model where each hypothesis H had a likelihood p(D∣H), and p(D∣H) was high for N hypotheses and low for the rest, you’d get a similar conclusion, while accounting for results that are just noise.
I agree that relativity is an example that doesn’t fit my story, where most of the work was in coming up with the hypothesis. (Though I suspect you could argue that relativity shouldn’t have been believed before experimental confirmation.) I claim that it is the exception, not the rule.
Also, I do think it is often a valuable contribution to even think of a plausible hypothesis that fits the data, even if you should assign it a relatively low probability of being true. I’m just saying that if you want to reach the truth, this work must be supplemented by experiments / gathering good data.
Bayesian updating does not work well when you don’t have the full hypothesis space. Given that you know that you don’t have the full hypothesis space, you should not be trying to approximate Bayesian updating over the hypothesis space you do have.
Do you have any links related to this? Technically speaking, the right hypothesis is almost never in our hypothesis space (“All models are wrong, but some are useful”). But even if there’s no “useful” model in your hypothesis space, it seems Bayesian updating fails gracefully if you have a reasonably wide prior distribution for your noise parameters as well (then the model fitting process will conclude that the value of your noise parameter must be high).
No, I haven’t read much about Bayesian updating. But I can give an example.
Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you’re right I give you $2 and if I’m right you give me $1). Then I flip the coin and the bet resolves.
If your hypothesis space is “the coin has some bias b of coming up Heads or Tails”, then you will eagerly accept this game for large enough N—you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.
However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you’ll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.
Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn’t help). I don’t know of a general way to use noise parameters to avoid issues like this.
Thanks for the example!