The Optimizer’s Curse and How to Beat It
The best laid schemes of mice and men
Go often askew,
And leave us nothing but grief and pain,
For promised joy!
- Robert Burns (translated)
Consider the following question:
A team of decision analysts has just presented the results of a complex analysis to the executive responsible for making the decision. The analysts recommend making an innovative investment and claim that, although the investment is not without risks, it has a large positive expected net present value… While the analysis seems fair and unbiased, she can’t help but feel a bit skeptical. Is her skepticism justified?1
Or, suppose Holden Karnofsky of charity-evaluator GiveWell has been presented with a complex analysis of why an intervention that reduces existential risks from artificial intelligence has astronomical expected value and is therefore the type of intervention that should receive marginal philanthropic dollars. Holden feels skeptical about this ‘explicit estimated expected value’ approach; is his skepticism justified?
Suppose you’re a business executive considering n alternatives whose ‘true’ expected values are μ1, …, μn. By ‘true’ expected value I mean the expected value you would calculate if you could devote unlimited time, money, and computational resources to making the expected value calculation.2 But you only have three months and $50,000 with which to produce the estimate, and this limited study produces estimated expected values for the alternatives V1, …, Vn.
Of course, you choose the alternative i* that has the highest estimated expected value Vi*. You implement the chosen alternative, and get the realized value xi*.
Let’s call the difference xi* - Vi* the ‘postdecision surprise’.3 A positive surprise means your option brought about more value than your analysis predicted; a negative surprise means you were disappointed.
Assume, too kindly, that your estimates are unbiased. And suppose you use this decision procedure many times, for many different decisions, and your estimates are unbiased. It seems reasonable to expect that on average you will receive the estimated expected value of each decision you make in this way. Sometimes you’ll be positively surprised, sometimes negatively surprised, but on average you should get the estimated expected value for each decision.
Alas, this is not so; your outcome will usually be worse than what you predicted, even if your estimate was unbiased!
Why?
...consider a decision problem in which there are k choices, each of which has true estimated [expected value] of 0. Suppose that the error in each [expected value] estimate has zero mean and standard deviation of 1, shown as the bold curve [below]. Now, as we actually start to generate the estimates, some of the errors will be negative (pessimistic) and some will be positive (optimistic). Because we select the action with the highest [expected value] estimate, we are obviously favoring overly optimistic estimates, and that is the source of the bias… The curve in [the figure below] for k = 3 has a mean around 0.85, so the average disappointment will be about 85% of the standard deviation in [expected value] estimates. With more choices, extremely optimistic estimates are more likely to arise: for k = 30, the disappointment will be around twice the standard deviation in the estimates.4
This is “the optimizer’s curse.” See Smith & Winkler (2006) for the proof.
The Solution
The solution to the optimizer’s curse is rather straightforward.
...[we] model the uncertainty in the value estimates explicitly and use Bayesian methods to interpret these value estimates. Specifically, we assign a prior distribution on the vector of true values μ = (μ1, …, μn) and describe the accuracy of the value estimates V = (V1, …, Vn) by a conditional distribution V|μ. Then, rather than ranking alternatives. based on the value estimates, after we have done the decision analysis and observed the value estimates V, we use Bayes’ rule to determine the posterior distribution for μ|V and rank and choose among alternatives based on the posterior means...
The key to overcoming the optimizer’s curse is conceptually very simple: treat the results of the analysis as uncertain and combine these results with prior estimates of value using Bayes’ rule before choosing an alternative. This process formally recognizes the uncertainty in value estimates and corrects for the bias that is built into the optimization process by adjusting high estimated values downward. To adjust values properly, we need to understand the degree of uncertainty in these estimates and in the true values..5
To return to our original question: Yes, some skepticism is justified when considering the option before you with the highest expected value. To minimize your prediction error, treat the results of your decision analysis as uncertain and use Bayes’ Theorem to combine its results with an appropriate prior.
Notes
1 Smith & Winkler (2006).
2 Lindley et al. (1979) and Lindley (1986) talk about ‘true’ expected values in this way.
3 Following Harrison & March (1984).
4 Quote and (adapted) image from Russell & Norvig (2009), pp. 618-619.
5 Smith & Winkler (2006).
References
Harrison & March (1984). Decision making and postdecision surprises. Administrative Science Quarterly, 29: 26–42.
Lindley, Tversky, & Brown. 1979. On the reconciliation of probability assessments. Journal of the Royal Statistical Society, Series A, 142: 146–180.
Lindley (1986). The reconciliation of decision analyses. Operations Research, 34: 289–295.
Russell & Norvig (2009). Artificial Intelligence: A Modern Approach, Third Edition. Prentice Hall.
Smith & Winkler (2006). The optimizer’s curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52: 311-322.
- We can do better than argmax by 10 Oct 2022 10:32 UTC; 113 points) (EA Forum;
- Neural uncertainty estimation review article (for alignment) by 5 Dec 2023 8:01 UTC; 74 points) (
- Paths To High-Level Machine Intelligence by 10 Sep 2021 13:21 UTC; 68 points) (
- Potential downsides of using explicit probabilities by 20 Jan 2020 2:14 UTC; 57 points) (EA Forum;
- We can do better than argmax by 10 Oct 2022 10:32 UTC; 49 points) (
- Does Bayes Beat Goodhart? by 3 Jun 2019 2:31 UTC; 48 points) (
- Dataset of Trillion Dollar figures by 13 Jan 2020 13:33 UTC; 37 points) (EA Forum;
- Simultaneous Overconfidence and Underconfidence by 3 Jun 2015 21:04 UTC; 37 points) (
- Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) by 25 May 2023 9:26 UTC; 33 points) (
- 10 May 2019 20:02 UTC; 32 points) 's comment on Cash prizes for the best arguments against psychedelics being an EA cause area by (EA Forum;
- What are words, phrases, or topics that you think most EAs don’t know about but should? by 21 Jan 2020 20:15 UTC; 30 points) (EA Forum;
- Why I write about the basics by 10 Dec 2011 22:49 UTC; 26 points) (
- Goodhart’s Curse and Limitations on AI Alignment by 19 Aug 2019 7:57 UTC; 25 points) (
- 24 May 2013 17:55 UTC; 22 points) 's comment on Robustness of Cost-Effectiveness Estimates and Philanthropy by (
- Goodhart Ethology by 17 Sep 2021 17:31 UTC; 20 points) (
- 26 Nov 2019 14:59 UTC; 12 points) 's comment on ofer’s Shortform by (
- 14 Sep 2013 19:24 UTC; 10 points) 's comment on Probability, knowledge, and meta-probability by (
- Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning by 21 Feb 2023 9:05 UTC; 10 points) (
- 20 Oct 2022 11:50 UTC; 9 points) 's comment on We can do better than argmax by (EA Forum;
- A Case Against Strong Longtermism by 2 Sep 2022 16:40 UTC; 9 points) (EA Forum;
- 1 Dec 2021 14:08 UTC; 7 points) 's comment on Six lessons learned from our first year—Animal Ask by (EA Forum;
- 13 May 2017 14:01 UTC; 7 points) 's comment on Fact checking comparison between trachoma surgeries and guide dogs by (EA Forum;
- 27 Feb 2023 18:56 UTC; 7 points) 's comment on Eliezer is still ridiculously optimistic about AI risk by (
- 16 Apr 2019 14:21 UTC; 6 points) 's comment on [Link] The Optimizer’s Curse & Wrong-Way Reductions by (EA Forum;
- 17 Sep 2021 18:58 UTC; 6 points) 's comment on Goodhart Ethology by (
- 1 Nov 2020 17:40 UTC; 5 points) 's comment on Desperation Hamster Wheels by (EA Forum;
- 15 Sep 2021 0:44 UTC; 5 points) 's comment on The motivated reasoning critique of effective altruism by (EA Forum;
- 24 Apr 2023 7:59 UTC; 5 points) 's comment on Capabilities and alignment of LLM cognitive architectures by (
- 20 Jan 2020 23:25 UTC; 4 points) 's comment on Potential downsides of using explicit probabilities by (EA Forum;
- 20 Mar 2023 2:14 UTC; 4 points) 's comment on Against Deep Ideas by (
- 22 Oct 2019 3:52 UTC; 4 points) 's comment on All I know is Goodhart by (
- 10 Jun 2021 7:23 UTC; 3 points) 's comment on EA Infrastructure Fund: Ask us anything! by (EA Forum;
- 23 Jan 2020 8:31 UTC; 3 points) 's comment on Potential downsides of using explicit probabilities by (EA Forum;
- 28 May 2018 9:06 UTC; 3 points) 's comment on Expected cost per life saved of the TAME trial by (EA Forum;
- Structural Risk Minimization by 7 Jun 2015 3:40 UTC; 3 points) (
- 25 Sep 2011 17:54 UTC; 3 points) 's comment on SIAI vs. FHI achievements, 2008-2010 by (
- 26 Jul 2022 9:41 UTC; 2 points) 's comment on Hiring? Avoid the Candidate Bystander Effect by (EA Forum;
- 20 Aug 2022 20:25 UTC; 2 points) 's comment on Prioritizing x-risks may require caring about future people by (EA Forum;
- 3 Jan 2020 9:22 UTC; 2 points) 's comment on Making decisions when both morally and empirically uncertain by (
- 19 Jun 2018 22:26 UTC; 1 point) 's comment on Effective Advertising and Animal Charity Evaluators by (EA Forum;
- 20 Dec 2016 23:13 UTC; 0 points) 's comment on Thoughts on the “Meta Trap” by (EA Forum;
But all you’ve done after “adjusting” the expected value estimates was producing a new batch of expected value estimates, which just shows that the original expected value estimates were not done very carefully (if there was an improvement), or that you face the same problem all over again...
Am I missing something?
I’m thinking of this as “updating on whether I actually occupy the epistemic state that I think I occupy”, which one hopes would be less of a problem for a superintelligence than for a human.
It reminds me of Yvain’s Confidence Levels Inside and Outside an Argument.
I expect it to be a problem—probably as serious—for superintelligence. The universe will always be bigger and more complex than any model of it, and I’m pretty sure a mind can’t fully model itself.
Superintelligences will presumably have epistemic problems we can’t understand, and probably better tools for working on them, but unless I’m missing something, there’s no way to make the problem go away.
Yeah, but at least it shouldn’t have all the subconscious signaling problems that compromise conscious reasoning in humans- at least I hope nobody would be dumb enough to build a superintelligence that deceives itself on account of social adaptations that don’t update when the context changes...
I must admit that I did not understand everything in the paper, but I think this excerpt summarizes a crucial point:
“The key issue here is proper conditioning. The unbiasedness of the value estimates V_i discussed in §1 is unbiasedness conditional on mu. In contrast, we might think of the revised estimates ^v_i as being unbiased conditional on V. At the time we optimize and make the decision, we know V but we do not know mu, so proper conditioning dictates that we work with distributions and estimates conditional on V.”
The proposed “solution” converts n independent evaluations into n evaluations (estimates) that respect the selection process, but, as far as I can tell, they still rest on prior value estimates and prior knowledge about the uncertainty of those estimates… Which means the “solution” at best limits introduction of optimizer bias, and at worst… masks old mistakes?
Well in some circumstances, this kind of reasoning would actually change the decision you make. For example, you might have one option with a high estimate and very high confidence, and another option with an even higher estimate, but lower confidence. After applying the approach described in the article, those two options might end up switching position in the rankings.
BUT: Most of the time, I don’t think this approach will make you choose a different option. If all other factors are equal, then you’ll probably still pick the option that has the highest expected value. I think that what we learn from this article is more about something else: It’s about understanding that the final result will probably be lower than your supposedly “unbiased” estimate. And when you understand that, you can budget accordingly.
The big problem arises when the number of choices is huge and sparsely explored, such as when optimizing a neural network.
But restricting ourselves to n superficially evaluated choices with known estimate variance in each evaluation and with independent errors/noise, then if – as in realistic cases like Monte Carlo Tree Search – we are allowed to perform some additional “measurements” to narrow down the uncertainty, it will be wise to scrutinize the high-expectance choices most – in a way trying to “falsify” their greatness, while increasing the certainty of their greatness if the falsification “fails”. This is the effect of using heuristics like the Upper Confidence Bound for experiment/branch selection.
UCB is also described as “optimism in the face of uncertainty”, which kind of defeats the point I am making if it is deployed as decision policy. What I mean is that in research, preparations and planning (with tree search in perfect information games as a formal example where UCB can be applied), one should put a lot of effort into finding out whether the seemingly best choice (of path, policy, etc.) really is that good, and then make a final choice that penalizes remaining uncertainty.
I would like to throw in a Wikipedia article on a relevant topic, which I came across while reading about the related “Winner’s curse”: https://en.wikipedia.org/wiki/Order_statistic
The math for order statistics is quite neat as long as the variables are independently sampled from the same distribution. In real life, “sadly”, choice evaluations may not always be from the same distribution… Rather, they are by definition conditional upon the choices. (https://en.wikipedia.org/wiki/Bapat%E2%80%93Beg_theorem provides a kind of solution in the form of an intractable colossus of a calculation.) That is not to say that there can be found no valuable/informative approximations.
In statistics the solution you describe is called Hierarchical or Multilevel Modeling. You assume that you data is drawn from a set of distributions which have their parameters drawn from another distribution. This automatically shrinks your estimates of the distributions towards the mean. I think it’s a pretty useful trick to know and I think it would be good to do a writeup but I think you might need to have a decent grasp of bayesian statistics first.
Here’s an example, with code, for anyone interested (it’s not by me, I add): http://sl8r000.github.io/ab_testing_statistics/use_a_hierarchical_model/
The central point of the optimizer’s curse not one I have seen before and is a very interesting point.
The solution however leaves me feeling slightly unhappy. It isn’t obvious to me what prior one should use in this sort of context. I suspect that a rough estimate by simply using the rule of thumb that the more complicated a logical chain the more likely there is a problem in it might do similar work at a weaker level.
Have you tried to apply this sort of reasoning explicitly to various existential risk considerations? If so, what did you get?
Reminds me of the winner’s curse in auctions—the selected bid is the one that is the highest and so most likely to be due to overconfidence/bias.
Yes, I recognized that similarity as well. As an aside, Fantasy Football (especially with an auction draft) is a great example to use when explaining these overestimation effects to laypeople.
--2001: A Space Odyssey (Homer, translated from ancient Latin)
Interesting sourcing on that quote. I’m not sure what you meant to say with it, so I’ll elaborate.
In fantasy sports, you begin by calculating an expected value for each player over the upcoming season. These values are used to construct your team in a draft, which is either turn-based (A picks a player, then B, then C) or auction-based (A, B, and C bid on players from a fixed initial pool of money). As the season goes on, you update your expected values with evidence from the past week’s games in order to decide which players will be active and accrue points for your fantasy team.
The analogy should be obvious for most folks here. You’re combining evidence to form a probability (how good was he last season? Is the new coach’s game plan going to help or hurt his stats? Is he a particularly high injury risk?) and multiplying by utility to form a preference ranking. In an auction draft, the pricing mechanism even requires you to explicitly compute the expected utility values. When games are played, you update on evidence and revise your rankings.
Most people have a hard time relating to decision theory because it doesn’t “feel like” what goes on in their head when they make decisions. Fantasy sports is a useful example because it makes the process explicit. I didn’t fully realize how good a fit it is before this conversation—maybe I should write up an introductory rationality piece on this foundation.
The quote is from Orwell’s 1984. The proles are generally ignorant, but good at tracking lottery numbers because it is a game. That’s right, I just generalized from fictional evidence!
I figured if people are going to complain about the Burns quote, I’d give them something to really complain about. Wrong book with a date as a title, wrong author of an Odyssey, wrong language.
Fantasy sports is a great example of where this would be useful, and I can’t think of a better analogy.
Am I missing something, or does the post just say that we shouldn’t use frequentist “unbiased estimators” as if they were Bayesian posterior expected values?
Not quite. If you were to do individual bayesian estimates you would have the same problem because there is shared prior information that would remain unmodeled.
Are you pointing out that each individual Bayesian estimate must be conditioned on all the information available, or is it more subtle than that?
Nope, that’s it.
Lukeprog, if I’ve understood you correctly, then this is no good; this is a corner case. The question to be answered here is whether we should expect a “common sense” executive who favors plans with a high prior estimate to do better than a “technical” analyst who favors plans that perform well according to the formal estimation criteria. By assuming that all prior estimates are identical except for bias, this assumption ensures that the technical analyst will win. This, however, begs the question. One could just as easily assume that there is large variation in the true expected values, and that the formal criteria will always produce an estimate of 0, in which case the common sense executive will always win.
Am I missing something? I like the topic; I would enjoy reading about which approach we should expect to perform better in a typical situation.
I think the case where all the choices has a “true expected value” of 0 is picked out merely to illustrate the problem.
Yes.
That’s fine; you’re more than welcome to illustrate the problem, and your analysis does in fact do that. It does it very well; your writing, as always, is very lucid.
However, you finish the article by claiming that Bayesian analysis can correct for the problem, and this is something that (I don’t think) you even begin to show. Bayesian analysis solves the corner case, but does it bring any traction at all on a typical case?
I think it’s worse than that: Karnofsky’s problem is that he has to compare moderate-mean low-variance estimates to large-mean large-variance estimates, but lukeprog’s solution is for comparing the estimate to the result in cases where the variance is equal across the board.
Put another way, the higher the variance in the true payoffs, the less relevant the curse. This is the flipside of: the more accurate the estimates, the less relevant the curse.
Is there an example where applying this correction to the expected values changes the decision?
In any group there’s going to be random noise, and if you choose an extreme value, chances are that value was inflated by noise. In Bayesian, given that something has the highest value, it probably had positive noise, not just positive signal. So the correction is to correct out the expected positive noise you get from explicitly choosing the highest value. Naturally, this correction is greater for when the noise is bigger.
So imagine choosing between black boxes. Each black box has some number of gold coins in it, and also two numbers written on it. The first number, A, on the box is like the estimated expected value, and the second number, B, is like the variance. What happened is that someone rolled two distinct dice with B sides, subtracted die 1 from die 2, and added that to the number of gold coins in the box.
So if you see a box with 40, 3 written on it, you know that it has an expected value of 40 gold coins, but might have as few as 37 or as many as 43.
Now comes the problem: I put 10 boxes in front of you, and tell you to choose the one with the most gold coins. The first box is 50, 1 - a very low-variance box. But the last 9 boxes are all high-uncertainty, all with B=20. The expected values printed on them are as follows [I generated the boxes honestly] : 53, 52, 37, 60, 44, 36, 56, 45, 54. Ooh, one of those boxes has a 60 on it! Pick that one!
Okay, don’t pick that one. Think about it—there are 9 boxes with high variance, and the one you picked probably has unusually large noise. To be special among 9 proposals with high variance, it probably has noise at the 80th+ percentile. What’s the 80th percentile of noise for 1d20 − 1d20? I bet it’s larger than 10. You’re better off just going with the 50, 1 box.
And it’s a good thing you applied that correction, because I generated the boxes by typing “RandomInteger[20,9] - RandomInteger[20,9] + 45” into Wolfram alpha—they each 45 coins each.
So this illustrates that what beating the optimizer’s curse really is is a sort of “correction for multiple comparisons.” If you have a lot of noisy boxes, some of them will look large even when they’re not, even larger than non-noisy boxes.
That is a good example of how the optimizer’s curse causes an overestimate of the maximum expected value, and even reliably causes a wrong choice to be associated with the maximum expected value. But how do I apply the correction mathematically, so I can know for which expected values on the high uncertainty boxes I should expect their best of them to be better or worse than the low uncertainty box? Even better, how can I deal with situations where the uncertainties of the expected values are not so conveniently categorized (and whose actual values aren’t conveniently uniform)?
Oh—I learned how, by the way. You start with some prior over how you expect the actual coins to be distributed, and then you convolute in the noise distribution of each box to get the combined distribution for each box. Then, given where the number on the outside of each box falls on the combined distribution, you can assign how much of that you expect to be signal and how much you expect to be noise by distributing improbability equally between signal and noise. Then you subtract out the expected noise.
I’m not sure. It’s probably in the paper.
I’m trying to figure out why, from the rules you gave at the start, we can assume that box 60 has more noise than the other boxes with variance of 20. You didn’t, at the outset of the problem, say anything about what the values in the boxes actually were. I would not, taking this experiment, have been surprised to see a box labeled “200”, with a variance of 20, because the rules didn’t say anything about values being close to 50, just close to A. Well, I would’ve been surprised with you as a test-giver, but it wouldn’t have violated what I understood the rules to be and I wouldn’t have any reason to doubt that box was the right choice.
The box with 60 stands out among the boxes with high variance, but you did not say that those boxes were generated with the same algorithm and thus have the same actual value. In fact you implied the opposite. You just told me that 60 was an estimate of its expected value, and 37 was an estimate of one of the other boxes’ expected values. So I would assign a very high probability to it being worth more than the box labeled 37. I understand that the variance is being effectively applied twice to go between the number on the box to the real number of coins (The real number of 45 could make an estimate anywhere from 25 to 65, but if it hit 25 I’d be assigning the real number a lower bound of 5 and if it hit 65 I’d be assigning the real number an upper bound of 85, which is twice that range). (Actually for that reason I’m not sure your algorithm really means there’s a variance of 20 from what you state the expected value to be, but I don’t feel like doing all the math to verify that since it’s tangential to the message I’m hearing from you or what I’m saying). But that doesn’t change the average. The range of values that my box labeled 60 could really contain from being higher than the range the box labeled 37 could really contain, to the best of my knowledge, and both are most likely to fall within a couple coins of the center of that range, with the highest probability concentrated on the exact number.
If the boxes really did contain different numbers of coins, or we just didn’t have reason to assume that they don’t contain different numbers, the box labeled 60 is likely to contain more coins than that 50⁄1 box did. It is also capable of undershooting 50 by ten times as much if unlucky, so if for some reason I absolutely cannot afford to find less than 50 coins in my box the 50⁄1 box is the safer choice—but if I bet on the 60⁄20 box 100 times and you bet on the 50⁄1 box 100 times, given the rules you set out in the beginning, I would walk away with 20% more money.
Or am I missing some key factor here? Did I misinterpret the lesson?
The key factor is that the 60,20 box is not in isolation—it is the top box, and so not only do you expect it to have more “signal” (gold) than average, you also expect it to have more noise than average.
You can think of the numbers on the boxes as drawn from a probability distribution. If there was 0 noise, this probability distribution would just be how the gold in the boxes was distributed. But if you add noise, it’s like adding two probability distributions together. If you’re not familiar with what happens, go look it up on wikipedia, but the upshot is that the combined distribution is more spread out than the original. This combined distribution isn’t just noise or just signal, it’s the probability of having some number be written on the outside of the box.
And so if something is the top, very highest box, where should it be located on the combined distribution?
Now, if you have something that’s high on the combined distribution, how much of that is due to signal, and how much of it is due to noise? This is a tougher question, but the essential insight is that the noise shouldn’t be more improbable than the signal, or vice versa—that is, they should both be about the same number of standard deviations from their means.
This means that if the standard deviation of the noise is bigger, then the probable contribution of the noise is greater.
Me saying the same thing a different way can be found here.
Oh, I understand now. Even if we don’t know how it’s distributed, if it’s the top among 9 choices with the same variance that puts it in the 80th percentile for specialness, and signal and noise contribute to that equally. So it’s likely to be in the 80th percentile of noise.
It might have been clearer if you’d instead made the boxes actually contain coins normally distributed about 40 with variance 15 and B=30, and made an alternative of 50⁄1, since you’d have been holding yourself to more proper unbiased generation of the numbers and still, in all likelihood, come up with a highest-labeled box that contained less than the sure thing. You have to basically divide your distance from the norm by the ratio of specialness you expect to get from signal and noise. The “all 45” thing just makes it feel like a trick.
I think there’s some value in that observation that “the all 45 thing makes it feel like a trick”. I believe that’s a big part of why this feels like a paradox.
If you have a box with the numbers “60” and “20″ as described above, then I can see two main ways that you could interpret the numbers:
A: The number of coins in this box was drawn from a probability distribution with a mean of 60, and a range of 20.
B: The number of coins in this box was drawn from an unknown probability distribution. Our best estimate of the number of coins in this box is 60, based on certain information that we have available. We are certain that the actual value is within 20 gold coins of this.
With regards to understanding the example, and understanding how to apply the kind of Bayesian reasoning that the article recommends, it’s important to understand that the example was based on B. And in real life, B describes situations that we’re far more likely to encounter.
With regards to understanding human psychology, human biases, and why this feels like a paradox, it’s important to understand that we instinctively tend towards “A”. I don’t know if all humans would tend to think in terms of A rather than B, but I suspect the bias applies widely amongst people who’ve studied any kind of formal probability. “A” is much closer to the kind of questions that would be set as exercises in a probability class.
That’s true—when I wrote the post you replied to I still didn’t really understand the solution—though it did make a good example for JGWeissman’s question. By the time I wrote the post I linked to, I had figured it out and didn’t have to cheat.
But if you don’t know that all the high variance boxes have the same mean then 60 is the one to go with. And if you do know they have the same mean, then it’s expected value is no longer 60.
Imagine putting gold coins into a bunch of boxes by having them normally distributed about 50 gold coins with standard deviation 10. Then we’ll add some Gaussian noise to the estimates on the boxes—but we’ll split them into 2 groups. Ten boxes will have noise with standard deviation of 5, while the other ten will have a standard deviation of 25.
But since I’ve still kept the simple situation where we just have 2 groups, you can get the overall biggest by just picking the biggest from each group and comparing them. So we can treat the groups independently for a bit. The biggest one is going to have the biggest positive deviation from 50, combined signal and noise. Because I used normal distributions this time, the combined prior+noise distribution is just a bigger normal distribution. So given that something is big or small by this combined distribution, how do we expect the signal and noise distributions to shift? Well, it would be silly to expect one of them to be more improbable than the other, so we expect their means to shift by about the same number of standard deviations for each distribution. This right there means that the bigger the noise, the more of the variation we should attribute to noise. And also the bigger the element in the combined distribution, the larger we should expect its noise to be.
But if you know the boxes were originally drawn from N(50,100) then the number on the box is no longer the correct Bayesian mean. All I’m arguing is that once you have your Bayesian expected value you don’t need to update it any further.
That’s pretty uncontroversial, but in practice it means that you end up penalizing high-noise boxes with high values (and boosting high-noise boxes with low values), which I think is a nontrivial result.
I’m trying to imagine a scenario.
Possibly the decider knows that people sometimes make multiplicative errors, transposing numbers or misplacing decimals, and is confronted with a set of estimates hovering around, say, 0.05 (and that is plausible according to the decider’s prior) and a few estimates at estimated around 0.5 and 5.0. Would the correction effectively trim the outliers back to almost exactly 0.05 (because we can’t learn much information from an estimate that probably had at least one mistake in it), and the decider should go with the highest of the “plausible” numbers?
It seems to me like the conditional distributions that would lead to actually changing your decision are nearly as likely to be a source of error as a correction.
Would this issue also apply to picking a contractor for a project based on the lowest bid?
No, because the lowest bid is a commitment from the contractor, not an estimate. This particular problem arises from trying to pick the best option from several estimates.
Sometimes contractors run out of money before finishing and you have to pay more or they leave you with a half-finished project :(
It would probably lead to contractors selected that way often going over budget.
I’m not sure how exactly this differs from the GiveWell blog post along the same lines? You seem to both be dealing with roughly the same problem (decision making under uncertainty), and reach the same conclusion (pay attention to the standard deviation, use Bayesian updates)
I did find your graph in the middle a rather useful illustration, but otherwise don’t feel like I’ve come away with anything really new...
Well, to start with, Luke has provided an actual mechanism for this mistake to occur by.
This is interesting, but I don’t see how to apply the solution. Presumably I either have no priors; or the priors are going to be generated by the same process I use to generate the values I am combining them with.
The resulting bias should be smaller if you choose the top 2 or 3 alternatives. E.g., give to 3 charities, not to 1.
How do market traders deal with this problem?
If I understand this correctly, there’s an empirical problem.
How optimistic your most optimistic estimate is going to be is going to be a matter of temperament and knowledge for individuals, and group culture for groups. It seems to me that the correction would need to be determined by experience. Or is this the “appropriate prior” problem?
When I’d only seen the title for this article, I thought it was going to be about the question of how much effort you should put into optimizing.
This is nit-picky, but I don’t think you should attribute to Robert Burns anything other than the words he actually wrote. Meanings change a lot in translation, and it’s not quite fair to do that through invisible sleight of hand. “Robert Burns (standard English translation)” would serve to CYA.
The original lines:
are little different than the version Luke quoted, and are mostly understandable (with the exception “gang aft agley”) to a sophisticated English reader with no special knowledge. I am somewhat inclined to call that version a rewrite rather than a translation, just as I would consider some modernized versions of Shakespeare to not be translations, but rewrites.
The standard problem of drawing lines in a continuum rears its head again. There are some reasonable arguments for calling Scots from this time a dialect of English, and many others for calling it a separate language. This is complicated by people’s personal and national identities being involved. Questions like these generally end up being settled more by politics than by details of the different linguistic varieties involved.
Okay, I added ‘(translated)’.
Would you say the same thing if a translation had been quoted of a poem originally in Latin or French?
(My guess: probably not. No one talks about a “standard English translation” of Catullus or Baudelaire. Instead, they credit the translator by name, or simply take the liberty of using the translation as if it were the original author’s words.)
The translator should absolutely be credited by name if he or she is known. Burns has passed kind of into folk status, and is a special case.
I would never quote Catullus or Baudelaire in English as if it were the original author’s words. No. It’s wrong (deprives the translator of rightful credit) -- and, FWIW, it’s also low-status.
What matters, obviously, is not whether Burns has passed into folk status, but whether the particular translation has. The latter seems an implausible claim (since printed translations can presumably be traced and attributed), but if it were true, then there would be no need for acknowledgement (almost by definition of “folk status”).
My comment arose from the suspicion that you reacted as if Burns had been paraphrased, as opposed to translated—because the original language looks similar enough to English that a translation will tend to look like a paraphrase. I find it unlikely that you would actually have made this comment if lukeprog had quoted Catallus without mentioning the translator; and on the other hand I suspect you would have commented if he had taken the liberty of paraphrasing (or “translating”) a passage from Shakespeare into contemporary English without acknowledging he had done so. My point being that the case of Burns should be treated like the former scenario, rather than the latter, whereas I suspect you intuitively perceived the opposite.
All translation is paraphrase, of course—but there is a difference of connotation that corresponds to a difference in etiquette. When one is dealing with an author writing in the same language as oneself, there is a certain obligation to the original words that does not (cannot) exist in the case of an author writing in a different language. So basically, I saw your comment as not-acknowledging that Burns was writing in a different language.
I don’t see it as lowering the status of the quoter; the status dynamic that I perceive is that it grants very high status to the original author, status so high that we’re willing to overlook the original author’s handicap of speaking a different language. In effect, it grants them honorary in-group status.
For example: Descartes has high enough status that the content of his saying “I think therefore I am” is more important to us than the fact that his actual words would have sounded like gibberish (unless we know French); people who speak gibberish normally have low status. Or, as Arnold Schoenberg once remarked (probably in German), “What the Chinese philosopher says is more important than that he speaks Chinese”. Only high-status people like philosophers get this kind of treatment!
Google has let me down in finding this quote, both in English and in roughly-translated German. Where is this from?
A statement like this is attributed to Schoenberg by a number of people, but I can’t find a specific reference either. Perhaps it was just something he said orally, without ever writing it anywhere.
The earliest reference I can track down is from 1952. In Roger Sessions: a biography (2008), Andrea Olmstead writes:
(The work that Sessions had performed this role in appears to have been Man who ate the popermack in the mid-1920s.)
Sessions’ essay (originally published in The Score and then collected in Roger Sessions on Music) begins:
An entertaining later reference to this quotation appears in Dialogues and a diary by Igor Stravinsky and Robert Craft (1963), where Stravinsky tabulates the differences between himself and Schoenberg, culminating in this comparison:
This seems to have been Stravinsky’s playful characterization of Schoenberg. See Dialogues by Igor Stravinsky and Robert Craft, p. 108, where Stravinsky tabulates the differences between himself and Schoenberg, culminating in:
I guess it’s possible that Stravinsky is quoting Schoenberg here, but the parallelism suggests not, and when he does quote Schoenberg (as in row 1 in the table), he gives a citation.
Right. But there are no hard-and-fast lines for “same language as oneself”.
You and I both brought up comparisons with Shakespeare. Both can be difficult to read for a struggling reader. For a sophisticated reader, the gist of both can be gotten with a modicum of effort. Full understanding of either requires a specialized dictionary, as vocabulary is different. So was Shakespeare writing in a different language? Was Burns? What’s the purpose of this distinction? If it’s weighing understanding vs adherence to the original wording, the trade-off is fairly close to the same place for the two. On the other hand, if it’s to acknowledge the politic linguistic classification that Scots is a separate language from Modern English, there is a distinction, as no one cares whether Early Modern English is treated as a separate language from Modern English. (EDIT: I should say that I do think it’s often more useful to consider Scots a separate language. Just because Burns was mostly intelligible to the English does not mean that other authors or speakers generally were.)
Meditations was first published in Latin.
My comment arose from the suspicion that you reacted as if Burns had been paraphrased, as opposed to translated
I don’t know what to tell you except that you’re wrong. I know the original poem pretty well (“Gang aft agley” is a famous phrase in some circles). Burns isn’t my specific field, but my impression, backed by a cursory Wikipedia search, is that the name of the original translator has been lost to the mists of history. If anyone can correct me and supply the original translator’s name, I’ll be truly grateful.
I don’t see it as lowering the status of the quote
Yes, you wouldn’t, and I can’t prove it to you except by assembling a conclave of Ivy League-educated snooty New York poets who happen to not be here right now. I will tell you—and you can update scantily, since you don’t trust the source—that the high-status thing to do is to provide quotes in the original language without translation. You are thereby signalling that not only do YOU read Scots Gaelic (fluently, of course), but you expect everyone you come into contact with socially to ALSO be fluent in Scots Gaelic.
The medium-status thing to do is at least to credit or somehow mark the translator, so that people think you are following standard academic rules for citation.
The reason that quoting translations without crediting them as such is low-status is that it leaves you open to charges of not understanding the original source material.
Scots Gaelic is not Scots (is not Scottish English, though modern speakers of Scots do generally code switch into it with ease, sometimes in a continuous way). Scots Gaelic is a Gaelic, Celtic language. Scots is Germanic. Burns wrote in Scots.
You’re right, and thanks for the clarification. As I said, Burns isn’t really my field.
Scots Gaelic is a thing, but it is not the language in which Burns wrote. That’s just called Scots. I wouldn’t ordinarily have mentioned it, but… you’re coming off as a bit snobby here. (O wad some Power the giftie gie us, am I right?)
This may be high status in certain social circles (having interacted with the snooty Ivy League educated New York poets also, they certainly think so) but to a lot of people doing so comes across as obnoxious and pretentious, that is an attempt to blatantly signal high status in a way that signals low status.
The highest status thing to do (and just optimal as far as I can tell for actually conveying information) is to include the original and the translation also.
I agree that this is probably optimal. My own class background is academics and published writers (both my parents are tenured professors). It’s actually hard trying to explain in a codified way what one knows at a gut level: I know that translations need to be credited, and for status reasons, but press me on the reasons and I’m probably not terribly reliable.
I find it interesting that everyone here is focusing on status; couldn’t it just be that crediting translations is absolutely necessary for the basic scholarly purpose of judging the authority and trustworthiness of the translation and even the original text? And that failing to provide attribution demonstrates a lack of academic expertise, general ignorance of the slipperiness of translation (‘hey, how important could it be?’), and other such problems.
I know I find such information indispensable for my anime Evangelion research (I treat translations coming from ADV very differently from translations by Olivier Hague and that different from translations by Bochan_bird, and so on, to give a few examples), so how much more so for real scholarship?
Well, what I originally [see edit] wrote was “It’s wrong (deprives the translator of rightful credit) -- and, FWIW, it’s also low-status.” I think people found the “low-status” part of my claim more interesting, but it wasn’t the primary reason I reacted badly to seeing a translation uncredited as such.
Edit: on reflection, this wasn’t my original justification. I simply reacted with gut-level intuition, knowing it was wrong. Every other explanation is after-the-fact, and therefore suspect.
Upvoting for realizing that a rational wasn’t your actual reason.
Yes, agreed. I did note above that including the translation details with the original was optimal for conveying information but I didn’t emphasize it. I think that part of why people have been emphasizing status issues over serious research in this context is that the start of the discussion was about what to do with epigraphs. Since they really are just for rhetorical impact, the status issue matters more for them.
This was the case until about a decade ago, but nowadays it merely signals that you expect the audience to know how (and be willing to) use Google. (The favourite quotations section in my Facebook profile contains quotations in maths, Italian, English, Irish and German and none of them is translated in any other language.)
Status is in the map, not in the territory, siduri. The map of “snooty New-York poets” needn’t be our own map.
Yes but being aware of what signals one is sending out is helpful. Given that humans play status games it is helpful to be aware of how those games function so one doesn’t send signals out that cause people to pay less attention or create other barriers to communication.
Agreed, but it takes a high degree of luminosity to distinguish between tactical use of status to attain a specific objective, and getting emotionally involved and reactive to the signals of other (inducing this state of confusion is pretty much the function of status-signals for most humans, though).
Tactical = dress up, display “irrational confidence”, and play up your achievements to maximize attraction in potential romantic partners, or do well at a job interview.
Emotional-reactive = seeking, and worrying about, the approval of perceived social betters even though there is no logical reason.
Are you saying that always when a sentence is translated, its author must have high status or gains high status at the moment of translation, because the default attitude is to ignore anything originally uttered in foreign language?
If this is what you mean, I find it surprising. I have probably never been in a situation when someone was ignored because he spoke incomprehensible gibberish and that fact was more important than the content of his words. Of course, translation may be costly and people generally pay only for things they deem valuable, which is where the status comes into play. But it doesn’t mean that with low-status people it is more important that they speak gibberish than what they say.
(A thought experiment: A Gujarati speaking beggar approaches a rich English gentleman, says something and goes away. The Englishman’s wife, who is accompanying him at the moment, accidentally understands Gujarati. The man can recognise the language but doesn’t understand a word. What is the probability that he asks his wife “what did he say”? As a control group, imagine the same with an English beggar, this time the gentleman didn’t understand because when the beggar had spoken, a large truck had passed by. Is the probability of asking “what did he say” any different from the first group?)
Yes. More generally, the default attitude is to ignore anything uttered by a member of an outgroup. By calling attention to the fact that a sentence has been translated, one is calling attention to the fact that the author speaks a foreign language and thus to the author’s outgroup status. Omitting mention of a person’s outgroup status is a courtesy extended to those we wish to privilege above typical outgroup members.
Curiosity about what a low-status person says does not imply that one thinks the content of their words is a more important fact about them than their low status. With high probability, the most salient aspect of the beggar from the perspective of the Englishman is that he is a beggar (and, in the first case, a foreign beggar at that). Whatever the beggar said, if the Englishman finds out and deems it worthy of recounting later, I would be willing to bet that he will not omit mention of the fact that he heard it from a beggar.
Note Carl Shulman’s counterargument to the assumption of a normal prior here and the comments traded between Holden and Carl.
“If your prior was that charity cost-effectiveness levels were normally distributed, then no conceivable evidence could convince you that a charity could be 100x as good as the 90th percentile charity. The probability of systematic error or hoax would always be ludicrously larger than the chance of such an effective charity. One could not believe, even in hindsight, that paying for Norman Borlaug’s team to work on the Green Revolution, or administering smallpox vaccines (with all the knowledge of hindsight) actually did much more good than typical. The gains from resources like GiveWell would be small compared to acting like an index fund and distributing charitable dollars widely.”
The problem with this analysis is that it assumes that the prior should be given the same weight both ex ante and ex post. I might well decide to evenly weight my prior (intuitive) distribution showing a normal curve and my posterior (informed) distribution showing a huge peak for the Green Revolution, in which case I’d only think the Green Revolution was one of the best charitable options, and would accordingly give it moderate funding, rather than all available funding for all foreign aid. But, then, ten years later, with the benefit of hindsight, I now factor in a third distribution, showing the same huge peak for the Green Revolution. And, because the third distribution is based not on intuition or abstract predictive analysis but on actual past results—it’s entitled to much more weight. I might calculate a Bayesian update based on observing my intuition once, my analysis once, and the historical track record ten or twenty times. At that point, I would have no trouble believing that a charity was 100x as good as the 90th percentile. That’s an extraordinary claim, but the extraordinary evidence to support it is well at hand. By contrast, no amount of ex ante analysis would persuade me that your proposed favorite charity is 100x better than the current 90th percentile, and I have no problem with that level of cynicism. If your charity’s so damn good, run a pilot study and show me. Then I’ll believe you.
quick feedback or question.
In this part: Assume, too kindly, that your estimates are unbiased. And suppose you use this decision procedure many times, for many different decisions, and your estimates are unbiased.
the second time you mention the unbiased makes no sense to me and looks like a typo.
If X = Skill + Luck, with Skill and Luck both random variables, then selecting max(X) will get you something that has high Skill and high Luck.
If Estimate = TrueVal + Error, then max(Estimate) will have both high TrueVal and high Error.
This obvious insight has many applications, especially when the selection is done over a very large number of entities, e.g. trying to emulate the habits of billionaires in order to become rich.
Very interesting. I’m going to try my hand at a short summary:
Assume that you have a number of different options you can choose, that you want to estimate the value of each option and you have to make your best guess as to which option is most valuable. In step one, you generate individual estimates using whatever procedure you think is best. In step 2 you make the final decision, by choosing the option that had the highest estimate in step one.
The point is: even if you have unbiased procedures for creating the individual estimates in step one (ie procedures that are equally likely to overestimate as to underestimate) biases will still be introduced in step 2, when you’re looking at the list of all the different estimates. Specifically, the biases are that the highest estimate(s) are more likely to be overestimates, and the lowest estimate(s) are more likely to be underestimates.
Am I the only one that thinks that this is a silly definition of bias?
The technical definition of bias, the one you’re using, is that given a true value, the expected value of the estimate is equal to the true value. The one that I’d use is that given an estimate, the expected value of the true value is equal to the estimate. The latter is what you should be minimizing.
You should be using Bayesian methods to find these expected values, and they generally are biased, at leased in the technical sense. You shouldn’t come up with an unbiased estimator and correct for it using Bayesian methods. You should use a biased estimator in the first place.
The technical definition is E[estimate—true value] where the true value is typically taken as a number and not a variable we have uncertainty about, but there’s nothing in this definition preventing the true value from being a random variable.
Yes, the technical definition is E[estimate—parameter], but “unbiased” has an implicit “for all parameter values”. You really can’t stick a random variable there and have the same meaning that statisticians use. (That said, I don’t see how DanielLC’s reformulation makes sense.)
It won’t have the same meaning, but nothing in the math prevents you from doing it and it might be more informative since it allows you to look at a single bias number instead of an uncountable set of biases (and Bayesian decision theory essentially does this). To be a little more explicit, the technical definition of bias is:
E[estimator|true value] - true value
And if we want to minimize bias, we try to do so over all possible values of the true values. But we can easily integrate over the space of the true value (assuming some prior over the true value) to achieve
E[ E[estimator|true value] - true value ] = E[ estimator—true value ]
This is similar to the Bayes risk of the estimator with respect to some prior distribution (the difference is that we don’t have a loss function here). By analogy, I might call this “Bayes bias.”
The only issue is that your estimator may be right on average but that doesn’t mean it’s going to be anywhere close to the true value. Usually bias is used along with the variance of the estimator (since MSE(estimator)=Variance(estimator) + [Bias(estimator)]^2 ), but we could just modify our definition of Bayes bias so that we only have to look at one number to take the absolute value of the difference—the numbers closer to zero mean better estimators. Then we’re just calculating Bayes risk with respect to some prior and absolute error loss, i.e.
E[ | estimator—true value | ]
(Which is NOT in general equivalent to | E[estimator—true value] | by Jensen’s inequality)