Obviously hypotheses do not just come out of an “unbiased random sampling” process, there are some intuitions that drive them that incorporate tons of evidence that the scientist already has.
I thought you were saying something along the lines of: “some people seem particularly good at this, instead of producing hypotheses that have a 1/1000 chance of being correct, they instead produce hypotheses with a 1⁄2 chance of being correct. Let’s look at these people in particular and figure out how to replicate their reasoning”.
I’m saying in response to that (which may not be what you meant): “In the specific case of Carnot’s theorem, my default hypothesis is that ~1000 people tried hypotheses with probability ~1/1000 and one happened to be correct; you can study any of those 1000 people / ideas instead of studying Carnot in particular. (Studying the wrong ones is probably better, the wrong parts could tell you what people can’t do when creating hypotheses in advance.)”
I believe that your default hypothesis is wrong because it is assuming an incredible amount of structure and is in contradiction with the history of science (and invention in general).
I wasn’t trying to give a grand theory of science and invention. I’m trying to explain the specific question I quoted, about why a seemingly “bad” analogy still worked out well in this case.
I also don’t know what you think the hypothesis is in contradiction with.
If you have an incredibly large amount of possibility, no amount of unbiased random sampling will yield anything, certainly not the bounty of results we get from science.
I totally agree it was biased in the sense that “dissipative theory” is a lot simpler than “on Sundays, my experiments do whatever Abraham Lincoln would have predicted would happen; on other days it’s whatever George Washington would have predicted”, and so people investigated the theories like the former much more than theories like the latter.
If you want some evidence that this example was not just a random sampling that worked but actually a strongly biased move, there’s the fact that Sadi’s work got used (after being neglected) 25 years later for the formalization of modern thermodynamics, and despite its age, that’s what the founders of modern thermodynamics used. Also most of his result, despite staying in obscurity for at least 10 years, haven’t been rediscovered AFAIK (or I expect things like Carnot’s theorem to have a name with the multiple inventors in it)
I expect to see this result in a random sampling world; why don’t you? It seems like you just have to wait for the same random sample to be drawn again; not drawing that sample in 25 years seems totally normal.
I thought you were saying something along the lines of: “some people seem particularly good at this, instead of producing hypotheses that have a 1/1000 chance of being correct, they instead produce hypotheses with a 1⁄2 chance of being correct. Let’s look at these people in particular and figure out how to replicate their reasoning”.
I’m saying in response to that (which may not be what you meant): “In the specific case of Carnot’s theorem, my default hypothesis is that ~1000 people tried hypotheses with probability ~1/1000 and one happened to be correct; you can study any of those 1000 people / ideas instead of studying Carnot in particular. (Studying the wrong ones is probably better, the wrong parts could tell you what people can’t do when creating hypotheses in advance.)”
I feel like you’re getting my point, but I’ll still add the subtlety that I’m saying “anyone who isn’t biased somehow has a chance of 10^-60, and so always fails”. I’m still confused by why you think that you’re proposal is more realistic? Could you give me your intuition here for the uniform sampling case? Or is it just that by default you go for this model?
I wasn’t trying to give a grand theory of science and invention. I’m trying to explain the specific question I quoted, about why a seemingly “bad” analogy still worked out well in this case.
I also don’t know what you think the hypothesis is in contradiction with.
Contradiction with the fact that many discoveries and inventions seem to emerge in cases where the possibility space was far too large for a uniform sampling to have a chance.
I totally agree it was biased in the sense that “dissipative theory” is a lot simpler than “on Sundays, my experiments do whatever Abraham Lincoln would have predicted would happen; on other days it’s whatever George Washington would have predicted”, and so people investigated the theories like the former much more than theories like the latter.
I agree with that, but I meant more that dissipative theory was biased towards the truth compared to theories that would have been considered at the same level.
I expect to see this result in a random sampling world; why don’t you? It seems like you just have to wait for the same random sample to be drawn again; not drawing that sample in 25 years seems totally normal.
When I look at my confusion here, it’s because the point I was making is that in 25 years people have rediscovered and recreated the same stuff about steam engine a lot (haven’t checked deeply but would be willing to bet), whereas they hadn’t found Sadi’s result again. Which to me is clear evidence that the sampling, if random, was not uniform at all. Does that answer your question, or am I missing you point completely?
Could you give me your intuition here for the uniform sampling case?
A bad analogy led to a good theory. This seems more probable under theories that involve luck than theories that involve skill. Hence, 1000 people with 1/1000 probability theories, rather than 2 people with 1⁄2 probability theories. Again, this is for this specific case, not for science as a whole.
I don’t think the literal uniform theory is actually correct; there’s still going to be differences in people’s ability, so that it’s more like 10,000 people with ~0 probability theories, 1000 people with 1/2000 probability theories, and 100 people with 1⁄200 probability theories. But the fundamental point is that I don’t expect to gain much by studying the people who got it right than by studying the people who got it wrong in a plausible way (and if anything I expect you to learn more from the latter category).
Contradiction with the fact that many discoveries and inventions seem to emerge in cases where the possibility space was far too large for a uniform sampling to have a chance.
Do you agree there’s no contradiction now that I’ve specified that it’s sampling from a biased distribution of ideas that have ~1/1000 probability?
I meant more that dissipative theory was biased towards the truth compared to theories that would have been considered at the same level.
Yeah I think it’s unclear why that should be true. (Assuming that by “at the same level” you mean theories that were posed by other scientists of comparable stature seeking to explain similar phenomena.)
When I look at my confusion here, it’s because the point I was making is that in 25 years people have rediscovered and recreated the same stuff about steam engine a lot (haven’t checked deeply but would be willing to bet), whereas they hadn’t found Sadi’s result again. Which to me is clear evidence that the sampling, if random, was not uniform at all.
How is it clear evidence? Imagine a “uniform random sampling” story in which we produce 10 theories of probability 1/1000 per year. Then in expectation it takes 100 years to produce the right theory, and it is entirely unsurprising that in 25 years people don’t rediscover the right theory. So how are you using the observation “not rediscovered in 25 years” to update against “uniform random sampling”?
My take: if you are somehow going from the “real” prior probability (i.e. the figure for a true random draw from the uniform distribution on the hypothesis space, which Adam estimated in his comment as 10^-60, although I expect it could be even lower depending on exactly what hypothesis space we’re talking about) all the way to 10^-3 (the 1/1000 figure you give), you are already jumping a large number of orders of magnitude, and it seems to me unjustified to assert you can only jump this many orders of magnitude, but no further. Indeed, if you can jump from 10^-60 to 10^-3, why can you not in principle jump slightly farther, and arrive at probability estimates that are non-negligible even from an everyday perspective, such as 10^-2 or even 10^-1?
And it seems to me that you must be implicitly asserting something like this, if you give the probability of a random proposed theory being successful as 1 in 1000 rather than 1 in 10^60. Where did that 1/1000 number come from? It certainly doesn’t look to me like it came out of any principled estimate for how much justified Bayesian update can be wrung out of the evidence historically available, where that estimate just happened to arrive at ~570 decibels but no more; in fact it seems like that 1000 number basically was chosen to roughly match the number of hypotheses you think were plausibly put forth before the correct one showed up. If so, then this is… pretty obviously not proper procedure, in my view.
For myself, I basically find Eliezer’s argument in Einstein’s Speed as convincing as I did when I first read it, and for basically all the same reasons: finding the right theory and promoting it to the range where it first deserves attention but before it becomes an obvious candidate for most of the probability mass requires hitting a narrow target in update-space, and humans are not in general known for their precision. With far greater likelihood, if somebody identified the correct-in-retrospect theory, the evidence available to them at the time was sufficient from a Bayesian perspective to massively overdetermine that theory’s correctness, and it was only their non-superintelligence that caused them to update so little and so late. Hitting a narrow range is implausible; overshooting that range, on the other hand, significantly less so.
At this point you may protest that the 1/1000 probability you give is not meant as an estimate for the actual probability a Bayes-optimal predictor would assign after updating on the evidence; instead it’s whatever probability is justified for a human to assign, knowing that they are likely missing much of the picture, and that this probability is bounded from above at 10^-3 or thereabouts, at least for the kind of hard scientific problems the OP is discussing.
To be blunt: I find this completely unpersuasive. Even ignoring the obvious question from before (why 10^-3?), I can see no a priori reason why someone could not find themselves in an epistemic state where (from the inside at least) the evidence they have implies a much higher probability of correctness. From this epistemic state they might then find themselves producing statements like
I believe myself to be writing a book on economic theory which will largely revolutionize—not I suppose, at once but in the course of the next ten years—the way the world thinks about its economic problems. I can’t expect you, or anyone else, to believe this at the present stage. But for myself I don’t merely hope what I say—in my own mind, I’m quite sure.
—John Maynard Keynes
statements which, if you insist on maintaining that 10^-3 upper bound (and why so, at this point?), certainly become much harder to explain without resorting to some featureless “overconfidence” thingy; and that has been discussed in detail.
Again, I’m not claiming that this is true in general. I think it is plausible to reach, idk, 90%, maybe higher, that a specific idea will revolutionize the world, even before getting any feedback from anyone else or running experiments in the world. (So I feel totally fine with the statement from Keynes that you quoted.)
I would feel very differently about this specific case if there was an actual statement from Sadi of the form “I believe that this particular theorem is going to revolutionize thermodynamics” (and he didn’t make similar statements about other things that were not revolutionary).
it seems like that 1000 number basically was chosen to roughly match the number of hypotheses you think were plausibly put forth before the correct one showed up. If so, then this is… pretty obviously not proper procedure, in my view.
I totally agree that’s what I did, but it seems like a perfectly fine procedure. Idk where the disconnect is, but maybe you’re thinking of “1000” as coming from a weirdly opinionated prior, rather than from my posterior.
From my perspective, I start out having basically no idea what the “justifiable prior” on that hypothesis is. (If you want, you could imagine that my prior on the “justifiable prior” was uniform over log-10 odds of −60 to 10; my prior is more opinionated than that but the extra opinions don’t matter much.) Then, I observe that the hypothesis we got seems to be kinda ad hoc with no great story even in hindsight for why it worked while other hypotheses didn’t. My guess is then that it was about as probable (in foresight) as the other hypotheses around at the time, and combined with the number of hypotheses (~1000) and the observation that one of them worked, you get the probability of 1/1000.
(I guess a priori you could have imagined that hypotheses should either have probability approximately 10^-60 or approximately 1, since you already have all the bits you need to deduce the answer, but it seems like in practice even the most competent people frequently try hypotheses that end up being wrong / unimportant, so that can’t be correct.)
As a different example, consider machine learning. Suppose you tell me that <influential researcher> has a new idea for RL sample efficiency they haven’t tested, and you want me to tell you the probability it would lead to a 5x improvement in sample efficiency on Atari. It seems like the obvious approach to estimate this probability is to draw the graph of how much sample efficiency improved from previous ideas from that researcher (and other similar researchers, to increase sample size), and use that to estimate P(effect size > 5x | published), and then apply an ad hoc correction for publication bias. I claim that my reasoning above is basically analogous to this reasoning.
Obviously hypotheses do not just come out of an “unbiased random sampling” process, there are some intuitions that drive them that incorporate tons of evidence that the scientist already has.
I thought you were saying something along the lines of: “some people seem particularly good at this, instead of producing hypotheses that have a 1/1000 chance of being correct, they instead produce hypotheses with a 1⁄2 chance of being correct. Let’s look at these people in particular and figure out how to replicate their reasoning”.
I’m saying in response to that (which may not be what you meant): “In the specific case of Carnot’s theorem, my default hypothesis is that ~1000 people tried hypotheses with probability ~1/1000 and one happened to be correct; you can study any of those 1000 people / ideas instead of studying Carnot in particular. (Studying the wrong ones is probably better, the wrong parts could tell you what people can’t do when creating hypotheses in advance.)”
I wasn’t trying to give a grand theory of science and invention. I’m trying to explain the specific question I quoted, about why a seemingly “bad” analogy still worked out well in this case.
I also don’t know what you think the hypothesis is in contradiction with.
I totally agree it was biased in the sense that “dissipative theory” is a lot simpler than “on Sundays, my experiments do whatever Abraham Lincoln would have predicted would happen; on other days it’s whatever George Washington would have predicted”, and so people investigated the theories like the former much more than theories like the latter.
I expect to see this result in a random sampling world; why don’t you? It seems like you just have to wait for the same random sample to be drawn again; not drawing that sample in 25 years seems totally normal.
Thanks for the detailed answer!
I feel like you’re getting my point, but I’ll still add the subtlety that I’m saying “anyone who isn’t biased somehow has a chance of 10^-60, and so always fails”. I’m still confused by why you think that you’re proposal is more realistic? Could you give me your intuition here for the uniform sampling case? Or is it just that by default you go for this model?
Contradiction with the fact that many discoveries and inventions seem to emerge in cases where the possibility space was far too large for a uniform sampling to have a chance.
I agree with that, but I meant more that dissipative theory was biased towards the truth compared to theories that would have been considered at the same level.
When I look at my confusion here, it’s because the point I was making is that in 25 years people have rediscovered and recreated the same stuff about steam engine a lot (haven’t checked deeply but would be willing to bet), whereas they hadn’t found Sadi’s result again. Which to me is clear evidence that the sampling, if random, was not uniform at all. Does that answer your question, or am I missing you point completely?
A bad analogy led to a good theory. This seems more probable under theories that involve luck than theories that involve skill. Hence, 1000 people with 1/1000 probability theories, rather than 2 people with 1⁄2 probability theories. Again, this is for this specific case, not for science as a whole.
I don’t think the literal uniform theory is actually correct; there’s still going to be differences in people’s ability, so that it’s more like 10,000 people with ~0 probability theories, 1000 people with 1/2000 probability theories, and 100 people with 1⁄200 probability theories. But the fundamental point is that I don’t expect to gain much by studying the people who got it right than by studying the people who got it wrong in a plausible way (and if anything I expect you to learn more from the latter category).
Do you agree there’s no contradiction now that I’ve specified that it’s sampling from a biased distribution of ideas that have ~1/1000 probability?
Yeah I think it’s unclear why that should be true. (Assuming that by “at the same level” you mean theories that were posed by other scientists of comparable stature seeking to explain similar phenomena.)
How is it clear evidence? Imagine a “uniform random sampling” story in which we produce 10 theories of probability 1/1000 per year. Then in expectation it takes 100 years to produce the right theory, and it is entirely unsurprising that in 25 years people don’t rediscover the right theory. So how are you using the observation “not rediscovered in 25 years” to update against “uniform random sampling”?
My take: if you are somehow going from the “real” prior probability (i.e. the figure for a true random draw from the uniform distribution on the hypothesis space, which Adam estimated in his comment as 10^-60, although I expect it could be even lower depending on exactly what hypothesis space we’re talking about) all the way to 10^-3 (the 1/1000 figure you give), you are already jumping a large number of orders of magnitude, and it seems to me unjustified to assert you can only jump this many orders of magnitude, but no further. Indeed, if you can jump from 10^-60 to 10^-3, why can you not in principle jump slightly farther, and arrive at probability estimates that are non-negligible even from an everyday perspective, such as 10^-2 or even 10^-1?
And it seems to me that you must be implicitly asserting something like this, if you give the probability of a random proposed theory being successful as 1 in 1000 rather than 1 in 10^60. Where did that 1/1000 number come from? It certainly doesn’t look to me like it came out of any principled estimate for how much justified Bayesian update can be wrung out of the evidence historically available, where that estimate just happened to arrive at ~570 decibels but no more; in fact it seems like that 1000 number basically was chosen to roughly match the number of hypotheses you think were plausibly put forth before the correct one showed up. If so, then this is… pretty obviously not proper procedure, in my view.
For myself, I basically find Eliezer’s argument in Einstein’s Speed as convincing as I did when I first read it, and for basically all the same reasons: finding the right theory and promoting it to the range where it first deserves attention but before it becomes an obvious candidate for most of the probability mass requires hitting a narrow target in update-space, and humans are not in general known for their precision. With far greater likelihood, if somebody identified the correct-in-retrospect theory, the evidence available to them at the time was sufficient from a Bayesian perspective to massively overdetermine that theory’s correctness, and it was only their non-superintelligence that caused them to update so little and so late. Hitting a narrow range is implausible; overshooting that range, on the other hand, significantly less so.
At this point you may protest that the 1/1000 probability you give is not meant as an estimate for the actual probability a Bayes-optimal predictor would assign after updating on the evidence; instead it’s whatever probability is justified for a human to assign, knowing that they are likely missing much of the picture, and that this probability is bounded from above at 10^-3 or thereabouts, at least for the kind of hard scientific problems the OP is discussing.
To be blunt: I find this completely unpersuasive. Even ignoring the obvious question from before (why 10^-3?), I can see no a priori reason why someone could not find themselves in an epistemic state where (from the inside at least) the evidence they have implies a much higher probability of correctness. From this epistemic state they might then find themselves producing statements like
statements which, if you insist on maintaining that 10^-3 upper bound (and why so, at this point?), certainly become much harder to explain without resorting to some featureless “overconfidence” thingy; and that has been discussed in detail.
Again, I’m not claiming that this is true in general. I think it is plausible to reach, idk, 90%, maybe higher, that a specific idea will revolutionize the world, even before getting any feedback from anyone else or running experiments in the world. (So I feel totally fine with the statement from Keynes that you quoted.)
I would feel very differently about this specific case if there was an actual statement from Sadi of the form “I believe that this particular theorem is going to revolutionize thermodynamics” (and he didn’t make similar statements about other things that were not revolutionary).
I totally agree that’s what I did, but it seems like a perfectly fine procedure. Idk where the disconnect is, but maybe you’re thinking of “1000” as coming from a weirdly opinionated prior, rather than from my posterior.
From my perspective, I start out having basically no idea what the “justifiable prior” on that hypothesis is. (If you want, you could imagine that my prior on the “justifiable prior” was uniform over log-10 odds of −60 to 10; my prior is more opinionated than that but the extra opinions don’t matter much.) Then, I observe that the hypothesis we got seems to be kinda ad hoc with no great story even in hindsight for why it worked while other hypotheses didn’t. My guess is then that it was about as probable (in foresight) as the other hypotheses around at the time, and combined with the number of hypotheses (~1000) and the observation that one of them worked, you get the probability of 1/1000.
(I guess a priori you could have imagined that hypotheses should either have probability approximately 10^-60 or approximately 1, since you already have all the bits you need to deduce the answer, but it seems like in practice even the most competent people frequently try hypotheses that end up being wrong / unimportant, so that can’t be correct.)
As a different example, consider machine learning. Suppose you tell me that <influential researcher> has a new idea for RL sample efficiency they haven’t tested, and you want me to tell you the probability it would lead to a 5x improvement in sample efficiency on Atari. It seems like the obvious approach to estimate this probability is to draw the graph of how much sample efficiency improved from previous ideas from that researcher (and other similar researchers, to increase sample size), and use that to estimate P(effect size > 5x | published), and then apply an ad hoc correction for publication bias. I claim that my reasoning above is basically analogous to this reasoning.