Your linked comment makes the same point I am much more tersely. One reason I’m so much less terse is that I’m not very confident in your off-hand remarks—I think many or most of them are interesting ideas which are worth bringing up, but the implicit claim that these issues are well-understood is misleading and the actual arguments often don’t work.
I don’t quite know what idealization you are talking about. E.g.,
There’s all sorts of important things such as correct regression towards the mean which apply only to estimates but which you won’t see in the ideal case.
If I have a noisy estimate and a prior, I should regress towards the mean. By the “ideal case” do you mean the case in which my estimates have no noise? That is a strange idealization, which people might implicitly use but probably wouldn’t advocate.
With respect to the other points, I agree that estimation is hard, but the difficulties you cite seem to fit pretty squarely into the simple theoretical framework of computing a well-calibrated estimate of expected value. So to the extent there are gaps between that simple framework and reality, these difficulties don’t point to them.
For example, to make this point in the case of sums with biased terms, you would need to say how you could predictably do better by throwing out terms of an estimation, even when you don’t expect their inclusion to be correlated with their contribution to the estimate. Everyone agrees that if you know X is biased you should respond appropriately. If we don’t know that X is biased, then how do you know to throw it out? One thing you could do is to just be skeptical in general and use simple estimates when there is a significant opportunity for bias. But that, again, fits into the framework I’m talking about and you can easily argue for it on those grounds.
An alternative approach would be to criticize folks’ actual epistemology for not living up to the theoretical standards they set. It seems like that criticism is obviously valid, both around LW and elsewhere. If that is the point you want to make I am happy to accept it.
Which it is not (doesn’t sum to 1 over exclusive alternatives, doesn’t reflect symmetries in knowledge).
I agree that if I assume that my beliefs satisfy the axioms of probability, I will get into trouble (a general pattern with assuming false things). But I don’t see why either of these properties—reflecting symmetries, summing to one over exclusive alternatives—are necessary for good outcomes. Suppose that I am trying to estimate the relative goodness of two options in order to pick the best. Why should it matter whether my beliefs have these particular consistency properties, as long as they are my best available guess? In fact, it seems to me like my beliefs probably shouldn’t satisfy all of the obvious consistency properties, but should still be used for making decisions. I don’t think that’s a controversial position.
It seems to me that quantitative optimism is not common among people with very good knowledge of what would be involved in a good quantitative approach—people who wrote important papers on approximation of things. I can see, though, how quantitative optimism could arise in people who primarily know theory and it’s application to simple problems where nothing has to be approximated.
I am generally skeptical of the appeal to unspecified beliefs of unspecified experts. Yes, experts in numerical methods will be quick to say that approximating things well is hard, and indeed approximating things well is hard. That is a different issue than whether this particular theoretical framework for reasoning about approximations is sound, which is (1) not an issue on which experts in e.g. numerical methods are particularly well-informed, and (2) not a question for which you actually know the expert consensus.
For example, as a group physicists have quite a lot of experience estimating things and dealing with the world, and they seem to be very optimistic about quantitative methods, in the sense that I mean.
I think I can probably predict how a discussion with experts would go, if you tried to actually raise this question. It would begin with many claims of “things aren’t that simple” and attempts to distance from people with stupid naive views, and end with “yes, that formalism is obvious at that level of generality, but I assumed you were making some more non-trivial claims.”
This would be a fine response if I were trying to cast myself as better than experts because I have such an excellent clean theory (and I have little patience with Eliezer for doing this). But in fact I am just trying to say relatively simple things in the interest of building up an understanding.
For example, to make this point in the case of sums with biased terms, you would need to say how you could predictably do better by throwing out terms of an estimation, even when you don’t expect their inclusion to be correlated with their contribution to the estimate.
I agree with pretty much everything else you wrote here (and in the OP), but I’m a bit confused by this line. It seems like if the terms have a mean that is close to zero, but high variance, then you will usually do better by getting rid of them.
I’m not convinced of this. If you know that a summand has a mean that is close to zero and a high variance, then your prior will be sharply concentrated and you will regress far to the mean. Including the regressed estimate in a sum will still increase your accuracy. (Though of course if the noise is expected to be 1000x greater than the signal, you will be dividing by a factor of 1000 which is more or less the same as throwing it out. But the naive Bayesian EV maximizer will still get this one right.)
Are we using summand to mean the same thing here? To me, if we have an expression X1 + X2 + X3, then the summands are X1, X2, and X3. If we want to estimate Y, and E[X1+X2+X3] = Y, but E[X2] is close to 0 while Var[X2] is large, then X1+X3 is a better estimate for Y than X1+X2+X3 is.
Assume you have noisy measurements X1, X2, X3 of physical quantities Y1, Y2, Y3 respectively; variables 1, 2, and 3 are independent; X2 is much noisier than the others; and you want a point-estimate of Y = Y1+Y2+Y3. Then you shouldn’t use either X1+X2+X3 or X1+X3. You should use E[Y1|X1] + E[Y2|X2] + E[Y3|X3]. Regression to the mean is involved in computing each of the conditional expectations. Lots of noise (relative to the width of your prior) in X2 means that E[Y2|X2] will tend to be close to the prior E[Y2] even for extreme values of X2, but E[Y2|X2] is still a better estimate of that portion of the sum than E[Y2] is.
You said you should drop X if you know that your estimate is high variance but that the actual values don’t vary much. Knowing that the actual value doesn’t vary much means your prior has low variance, while knowing that your estimate is noisy means that your prior for the error term has high variance.
So when you observe an estimate, you should attribute most of the variance to error, and regress your estimate substantially towards your prior mean. After doing that regression, you are better off including X than dropping it, as far as I can see. (Of course, if the regressed estimate is sufficiently small then it wasn’t even worth computing the estimate, but that’s a normal issue with allocating bounded computational resources and doesn’t depend on the variance of your estimate of X, just how large you expect the real value to be.)
Of course, any time you toss something out it corresponds to negligible weight. And of course, accuracy-wise, under limited computing power, you’re better off actually tossing it out and using the computing time elsewhere to increase the accuracy more.
If I have a noisy estimate and a prior, I should regress towards the mean. By the “ideal case” do you mean the case in which my estimates have no noise? That is a strange idealization, which people might implicitly use but probably wouldn’t advocate.
I was primarily referring to this wide eyed optimism prevalent on these boards; attend some workshops and become more rational and win. It’s not that people advocate not regressing to the mean, it’s that they don’t even know this is an issue (and a difficult issue when probability distribution and it’s mean are something you need to find out as well). In the ideal case, you have a sum over all terms—it is not an estimate at all—you don’t discard any terms, if you discard any terms it will make it less ideal, if you apply any extra scaling it will make it less ideal, and so on. And so you have people see it as biases and imagine enormous gains to be obtained from doing something formal inspired instead. I have a cat test. Can you explicitly determine if something is a picture of a cat based on a list of numbers representing pixel luminosities? This is the size of gap between implicit processing of the evidence and explicit processing of the evidence.
But I don’t see why either of these properties—reflecting symmetries, summing to one over exclusive alternatives—are necessary for good outcomes. Suppose that I am trying to estimate the relative goodness of two options in order to pick the best. Why should it matter whether my beliefs have these particular consistency properties, as long as they are my best available guess?
This needs a specific example. Some people were worrying over a very very far fetched scenario, being unable to assign it low enough probability. The property of summing to 1 over the enormous number of likewise far fetched mutually exclusive scenarios would definitely have helped, compared to the state of—I suspect—summing to a very very huge number. Then they were taught a little bit of rationality and they know probability is subjective, which makes them inclined to consider their numerical assessment of a feeling (which may well already incorporate alleged impact) to be a probability, and multiply it with something. Other bad patterns include inversion of probability—why are you so extremely certain in negation of an event? People expect that probabilities close to 1 require evidence, and without any, are reluctant to assign something close to 1, even though in that case it is representative of a sum of almost entire hypothesis space.
With respect to the other points, I agree that estimation is hard, but the difficulties you cite seem to fit pretty squarely into the simple theoretical framework of computing a well-calibrated estimate of expected value. So to the extent there are gaps between that simple framework and reality, these difficulties don’t point to them.
not a question for which you actually know the expert consensus.
I do not see people most educated in these matters (or, indeed, the theory) to be running “rationality workshops” advocating explicit theory-based reasoning, that’s what I mean. And people I see I would not even suspect of expertise if they haven’t themselves claimed expertise.
This would be a fine response if I were trying to cast myself as better than experts because I have such an excellent clean theory (and I have little patience with Eliezer for doing this). But in fact I am just trying to say relatively simple things in the interest of building up an understanding.
Yes I certainly agree here—first make simple steps in the right direction.
I think mostly you are arguing against LW in general, which seems fine but not particularly helpful here or relevant to my point.
Some people were worrying over a very very far fetched scenario, being unable to assign it low enough probability. The property of summing to 1 over the enormous number of likewise far fetched mutually exclusive scenarios would definitely have helped, compared to the state of—I suspect—summing to a very very huge number.
What is the “very very far fetched scenario”? If you mean the intelligence explosion scenario, I do think this is reasonably unlikely, but:
Eliezer thinks this scenario is very likely, and many people around here agree. This is hardly a problem of being unwilling to assign a probability too close to 0.
In what sense is fast takeoff one hypothesis out of a very large number of equally plausible hypotheses? It seems like a fast takeoff is a priori reasonably likely, and the main reasons you think it seems unlikely are because experts don’t take it seriously and it seems incongruous with other tech progress. This seems unrelated to your critique.
Everyone agrees that if you know X is biased you should respond appropriately.
Well, suppose a person A is a beneficiary of the argument X which he or she brought to your attention. An argument X is one out of potentially very many arguments that are favourable to A and unfavourable to A. I don’t think it’s common understanding here, at all, that the expected utility estimate which does not include X at all, may correspond to expected utility more accurately than estimate which includes X, even if there is nothing fallacious about the argument X itself (merely the manner of it’s selection).
I don’t think this stresses enough that the arguments which have to be discarded from the sum may not even be invalid (the arguments it talks about are complete non-sequiturs). Also, people usually don’t just write the bottom line first. They end up in circumstances where particular bottom line fits their needs (both in form of ego gratification and money), and then produce necessary keystrokes, vibrations of the air, and the like, the same general mechanism which makes you navigate to food when hungry.
It also been stressed by exact same author on multiple occasions that anyone rational should become convinced of his beliefs after reading brief introduction to the topic and his collection of arguments in their favour (e.g. MWI).
the arguments it talks about are complete non-sequiturs
This is simply not true. It is, in fact, the exact opposite of the truth. The point being made by that posting is precisely that even valid arguments towards a given conclusion may be of approximately zero evidential value, if there would be some such arguments even if the conclusion were false and the cause of the arguments’ having been made is something other than the truth of the conclusion.
Also, people usually don’t just write the bottom line first.
Of course not. I thought that aspect of the thought experiment was just to make it clearer and more vivid. The same argument proceeds in much the same way (though sometimes with lesser strength) when the bottom line doesn’t get explicitly written down until later.
[EDITED to fix a formatting screwup. Incidentally, if whoever downvoted this would like to explain why then I’ll try to improve any deficiencies in my thinking or writing that get exposed. But since it looks like what’s actually happened is that someone downvoted almost all my ~20-30 most recent comments indiscriminately, I’m not terribly optimistic about that.]
Look two paragraphs further up to where he’s setting the scene for this thought experiment:
There are all manner of signs and portents indicating whether a box contains a diamond; but I have no sign which I know to be perfectly reliable. There is a blue stamp on one box, for example, and I know that boxes which contain diamonds are more likely than empty boxes to show a blue stamp. Or one box has a shiny surface, and I have a suspicion—I am not sure—that no diamond-containing box is ever shiny.
And it’s in that context that he postulates a “clever arguer” who tries to persuade him by listing (true) facts like “box B shows a blue stamp”.
Your linked comment makes the same point I am much more tersely. One reason I’m so much less terse is that I’m not very confident in your off-hand remarks—I think many or most of them are interesting ideas which are worth bringing up, but the implicit claim that these issues are well-understood is misleading and the actual arguments often don’t work.
I don’t quite know what idealization you are talking about. E.g.,
If I have a noisy estimate and a prior, I should regress towards the mean. By the “ideal case” do you mean the case in which my estimates have no noise? That is a strange idealization, which people might implicitly use but probably wouldn’t advocate.
With respect to the other points, I agree that estimation is hard, but the difficulties you cite seem to fit pretty squarely into the simple theoretical framework of computing a well-calibrated estimate of expected value. So to the extent there are gaps between that simple framework and reality, these difficulties don’t point to them.
For example, to make this point in the case of sums with biased terms, you would need to say how you could predictably do better by throwing out terms of an estimation, even when you don’t expect their inclusion to be correlated with their contribution to the estimate. Everyone agrees that if you know X is biased you should respond appropriately. If we don’t know that X is biased, then how do you know to throw it out? One thing you could do is to just be skeptical in general and use simple estimates when there is a significant opportunity for bias. But that, again, fits into the framework I’m talking about and you can easily argue for it on those grounds.
An alternative approach would be to criticize folks’ actual epistemology for not living up to the theoretical standards they set. It seems like that criticism is obviously valid, both around LW and elsewhere. If that is the point you want to make I am happy to accept it.
I agree that if I assume that my beliefs satisfy the axioms of probability, I will get into trouble (a general pattern with assuming false things). But I don’t see why either of these properties—reflecting symmetries, summing to one over exclusive alternatives—are necessary for good outcomes. Suppose that I am trying to estimate the relative goodness of two options in order to pick the best. Why should it matter whether my beliefs have these particular consistency properties, as long as they are my best available guess? In fact, it seems to me like my beliefs probably shouldn’t satisfy all of the obvious consistency properties, but should still be used for making decisions. I don’t think that’s a controversial position.
I am generally skeptical of the appeal to unspecified beliefs of unspecified experts. Yes, experts in numerical methods will be quick to say that approximating things well is hard, and indeed approximating things well is hard. That is a different issue than whether this particular theoretical framework for reasoning about approximations is sound, which is (1) not an issue on which experts in e.g. numerical methods are particularly well-informed, and (2) not a question for which you actually know the expert consensus.
For example, as a group physicists have quite a lot of experience estimating things and dealing with the world, and they seem to be very optimistic about quantitative methods, in the sense that I mean.
I think I can probably predict how a discussion with experts would go, if you tried to actually raise this question. It would begin with many claims of “things aren’t that simple” and attempts to distance from people with stupid naive views, and end with “yes, that formalism is obvious at that level of generality, but I assumed you were making some more non-trivial claims.”
This would be a fine response if I were trying to cast myself as better than experts because I have such an excellent clean theory (and I have little patience with Eliezer for doing this). But in fact I am just trying to say relatively simple things in the interest of building up an understanding.
I agree with pretty much everything else you wrote here (and in the OP), but I’m a bit confused by this line. It seems like if the terms have a mean that is close to zero, but high variance, then you will usually do better by getting rid of them.
I’m not convinced of this. If you know that a summand has a mean that is close to zero and a high variance, then your prior will be sharply concentrated and you will regress far to the mean. Including the regressed estimate in a sum will still increase your accuracy. (Though of course if the noise is expected to be 1000x greater than the signal, you will be dividing by a factor of 1000 which is more or less the same as throwing it out. But the naive Bayesian EV maximizer will still get this one right.)
Are we using summand to mean the same thing here? To me, if we have an expression X1 + X2 + X3, then the summands are X1, X2, and X3. If we want to estimate Y, and E[X1+X2+X3] = Y, but E[X2] is close to 0 while Var[X2] is large, then X1+X3 is a better estimate for Y than X1+X2+X3 is.
Assume you have noisy measurements X1, X2, X3 of physical quantities Y1, Y2, Y3 respectively; variables 1, 2, and 3 are independent; X2 is much noisier than the others; and you want a point-estimate of Y = Y1+Y2+Y3. Then you shouldn’t use either X1+X2+X3 or X1+X3. You should use E[Y1|X1] + E[Y2|X2] + E[Y3|X3]. Regression to the mean is involved in computing each of the conditional expectations. Lots of noise (relative to the width of your prior) in X2 means that E[Y2|X2] will tend to be close to the prior E[Y2] even for extreme values of X2, but E[Y2|X2] is still a better estimate of that portion of the sum than E[Y2] is.
But that’s not mysterious, that’s just regression to the mean.
I don’t understand—in what way is it regression to the mean?
Also, what does that have to do with my original comment, which is that you will do better by dropping high-variance terms?
You said you should drop X if you know that your estimate is high variance but that the actual values don’t vary much. Knowing that the actual value doesn’t vary much means your prior has low variance, while knowing that your estimate is noisy means that your prior for the error term has high variance.
So when you observe an estimate, you should attribute most of the variance to error, and regress your estimate substantially towards your prior mean. After doing that regression, you are better off including X than dropping it, as far as I can see. (Of course, if the regressed estimate is sufficiently small then it wasn’t even worth computing the estimate, but that’s a normal issue with allocating bounded computational resources and doesn’t depend on the variance of your estimate of X, just how large you expect the real value to be.)
Of course, any time you toss something out it corresponds to negligible weight. And of course, accuracy-wise, under limited computing power, you’re better off actually tossing it out and using the computing time elsewhere to increase the accuracy more.
I was primarily referring to this wide eyed optimism prevalent on these boards; attend some workshops and become more rational and win. It’s not that people advocate not regressing to the mean, it’s that they don’t even know this is an issue (and a difficult issue when probability distribution and it’s mean are something you need to find out as well). In the ideal case, you have a sum over all terms—it is not an estimate at all—you don’t discard any terms, if you discard any terms it will make it less ideal, if you apply any extra scaling it will make it less ideal, and so on. And so you have people see it as biases and imagine enormous gains to be obtained from doing something formal inspired instead. I have a cat test. Can you explicitly determine if something is a picture of a cat based on a list of numbers representing pixel luminosities? This is the size of gap between implicit processing of the evidence and explicit processing of the evidence.
This needs a specific example. Some people were worrying over a very very far fetched scenario, being unable to assign it low enough probability. The property of summing to 1 over the enormous number of likewise far fetched mutually exclusive scenarios would definitely have helped, compared to the state of—I suspect—summing to a very very huge number. Then they were taught a little bit of rationality and they know probability is subjective, which makes them inclined to consider their numerical assessment of a feeling (which may well already incorporate alleged impact) to be a probability, and multiply it with something. Other bad patterns include inversion of probability—why are you so extremely certain in negation of an event? People expect that probabilities close to 1 require evidence, and without any, are reluctant to assign something close to 1, even though in that case it is representative of a sum of almost entire hypothesis space.
I do not see people most educated in these matters (or, indeed, the theory) to be running “rationality workshops” advocating explicit theory-based reasoning, that’s what I mean. And people I see I would not even suspect of expertise if they haven’t themselves claimed expertise.
Yes I certainly agree here—first make simple steps in the right direction.
I think mostly you are arguing against LW in general, which seems fine but not particularly helpful here or relevant to my point.
What is the “very very far fetched scenario”? If you mean the intelligence explosion scenario, I do think this is reasonably unlikely, but:
Eliezer thinks this scenario is very likely, and many people around here agree. This is hardly a problem of being unwilling to assign a probability too close to 0.
In what sense is fast takeoff one hypothesis out of a very large number of equally plausible hypotheses? It seems like a fast takeoff is a priori reasonably likely, and the main reasons you think it seems unlikely are because experts don’t take it seriously and it seems incongruous with other tech progress. This seems unrelated to your critique.
ohh, and p.s.:
Well, suppose a person A is a beneficiary of the argument X which he or she brought to your attention. An argument X is one out of potentially very many arguments that are favourable to A and unfavourable to A. I don’t think it’s common understanding here, at all, that the expected utility estimate which does not include X at all, may correspond to expected utility more accurately than estimate which includes X, even if there is nothing fallacious about the argument X itself (merely the manner of it’s selection).
Really?
Again, I agree that people posting on LW may get things wrong. This is a public forum on the internet. But it seems like you are stretching here.
I don’t think this stresses enough that the arguments which have to be discarded from the sum may not even be invalid (the arguments it talks about are complete non-sequiturs). Also, people usually don’t just write the bottom line first. They end up in circumstances where particular bottom line fits their needs (both in form of ego gratification and money), and then produce necessary keystrokes, vibrations of the air, and the like, the same general mechanism which makes you navigate to food when hungry.
It also been stressed by exact same author on multiple occasions that anyone rational should become convinced of his beliefs after reading brief introduction to the topic and his collection of arguments in their favour (e.g. MWI).
This is simply not true. It is, in fact, the exact opposite of the truth. The point being made by that posting is precisely that even valid arguments towards a given conclusion may be of approximately zero evidential value, if there would be some such arguments even if the conclusion were false and the cause of the arguments’ having been made is something other than the truth of the conclusion.
Of course not. I thought that aspect of the thought experiment was just to make it clearer and more vivid. The same argument proceeds in much the same way (though sometimes with lesser strength) when the bottom line doesn’t get explicitly written down until later.
[EDITED to fix a formatting screwup. Incidentally, if whoever downvoted this would like to explain why then I’ll try to improve any deficiencies in my thinking or writing that get exposed. But since it looks like what’s actually happened is that someone downvoted almost all my ~20-30 most recent comments indiscriminately, I’m not terribly optimistic about that.]
Ghmm. Are those valid arguments:
?
Look two paragraphs further up to where he’s setting the scene for this thought experiment:
And it’s in that context that he postulates a “clever arguer” who tries to persuade him by listing (true) facts like “box B shows a blue stamp”.