So when they interact there’s additional variables you’re leaving out.
There’s a target T that’s the real thing you want. Then there’s a function E that measures how much you expect C to achieve T. For example, maybe T is “have fun” and E is “how fun C looks”. Then given a set of choices C_1, C_2, … you choose C_max such that E(C_max) >= E(C_i) for all C_i (in normal terms, C_max = argmax E(C_i)). Unfortunately T is hidden such that you can only check if C satisfies T via E (well, this is not exactly true because otherwise we might have a hard time knowing the optimizer’s curse exists, but it would hold even if that were the case, we just might not be able to notice it, and regardless we can’t use whatever this side channel to assess T as a measure and so can’t optimize on it).
Now since we don’t have perfect information, there is some error e associated with E, so the true extent to which any C_i satisfies T is E(C_i) + e. But we picked C_max based on the existence of this error, since C_max = argmax E(C_i) + e, thus C_max may not be the true max. As you say, so what, maybe that means we just don’t pick the best but we pick something good. But recall that our purpose was T not max E(C_i), so over repeated choices we will consistently, due to the optimizer’s curse, pick C_max such that max E(C_i) < T (noting that’s a type error as notated, but I think it’s intuitive what is meant). Thus e will compound over repeated choice since each subsequent C is conditioned on the previous ones such that it becomes certain that E(C_max) < T and never E(C_max) = T.
This might seem minor if we had only a single dimension to worry about, like “had slightly less than maximum fun”, even if it did, say, result in astronomical waste. But we normally are optimizing over multiple dimensions and each choice may fail in different ways along those different dimensions. The result is that we will over time shrink the efficiency frontier (though it might reach a limit and not get worse) and end up with worse solutions than were possible that may even be bad enough that we don’t want them. After all, nothing is stopping the error from getting so large or the frontier from shrinking so much that we would be worse off than if we had never started.
It seems to me that your comment amounts to saying “It’s impossible to always make optimal choices for everything, because we don’t have perfect information and perfect analysis,” which is true but unrelated to optimizer’s curse (and I would say not in itself problematic for AGI safety). I’m sure that’s not what you meant, but here’s why it comes across that way to me. You seem to be setting T = E(C_max). If you set T = E(C_max) by definition, then imperfect information or imperfect analysis implies that you will always miss T by the error e, and the error will always be in the unfavorable direction.
But I don’t think about targets that way. I would set my target to be something that can in principle be exceeded (T = have almost as much fun as is physically possible). Then when we evaluate the choices C, we’ll find some that dramatically exceed T (i.e. way more fun than is physically possible, because we estimated the consequences wrong), and if we pick one of those, we’ll still have a good chance of slightly exceeding T despite the optimizer’s curse.
Lack of access to perfect information is highly relevant because it’s exactly why we can’t get around the curse. If we had perfect information we could correct for it as a systematic bias using Bayesian methods and be done with it. It’s also why it shows up in the first place: if we could establish a measure E that accurately reported the amount it satisfied T then it wouldn’t happen because there would be no error in the measurement.
What you are proposing about allowing targets to be exceeded is simply allowing for more mild optimization, and the optimizer’s curse still happens if there is preferential choice at all.
I don’t think it’s related to mild optimization. Pick a target T that can be exceeded (wonderful future, even if it’s not the absolute theoretically best possible future). Estimate what choice Cmax is (as far as we can tell) the #1 very best by that metric. We expect Cmax to give value E, and it turns out to be V<E, but V is still likely to exceed T, or at least likelier than any other choice. (Insofar as that’s not true, it’s Goodhart.) Optimizer curse, i.e. V<E, does not seem to be a problem, or even relevant, because I don’t ultimately care about E. Maybe the AI doesn’t even tell me what E is. Maybe the AI doesn’t even bother guessing what E is, it only calculates that Cmax seems to be better than any other choice.
Hmm, maybe you are misunderstanding how the optimizer’s curse works? It’s powered by selecting based on a measure with error in a way that biases us to pick specific actions based on their measure when the measure errs such that the measure is on average higher rather than lower than its true value. You are mistaken, then, to not care about E, because E is the only reliable and comparable way you have to check if C satisfies T (if there’s another one that’s reliable and comparable, then use it instead). It’s literally the only option, assuming you picked the “best” E (another chance for Goodhart’s curse to bite you), for picking C_max that seems better unless you want very high quantilization such that, say, you only act when things appear orders of magnitude better with error bounds small enough that you will only be wrong once in trillions of years.
I do think I understand that. I see E as a means to an end. It’s a way to rank-order choices and thus make good choices. If I apply an affine transformation to E, e.g. I’m way too optimistic about absolutely everything in a completely uniform way, then I still make the same choice, and the choice is what matters. I just want my AGI to do the right thing.
Here, I’ll try to put what I’m thinking more starkly. Let’s say I somehow design a comparative AGI. This is a system which can take a merit function U, and two choices C_A and C_B, and it can predict which of the two choices C_A or C_B would be better according to merit function U, but it has no idea how good either of those two choices actually are on any absolute scale. It doesn’t know whether C_A is wonderful while C_B is even better, or whether C_A is awful while C_B is merely so-so, both of those just return the same answer, “C_B is better”. Assume it’s not omniscient, so its comparisons are not always correct, but that it’s still impressively superintelligent.
A comparative AGI does not suffer the optimizer’s curse, right? It never forms any beliefs about how good its choices will turn out, so it couldn’t possibly be systematically disappointed. There’s always noise and uncertainty, so there will be times when its second-highest-ranked choice would actually turn out better than its highest-ranked choice. But that happens less often than chance. There’s no systematic problem: in expectation, the best thing to do (as measure by U) is always to take its top-ranked choice.
Now, it seems to me that, if I go to the AGIs-R-Us store, and I see a normal AGI and a comparative AGI side-by-side on the shelf, I would have no strong opinion about which one of them I should buy. If I ask either one to do something, they’ll take the same sequence of actions in the same order, and get the same result. They’ll invest my money in the same stocks, offer me the same advice, etc. etc. In particular, I would worry about Goodhart’s law (i.e. giving my AGI the wrong function U) with either of these AGIs to the exact same extent and for the exact same reason...even though one is subject to optimizer’s curse and the other isn’t.
Right, if you don’t have a measure you can’t have Goodhart’s curse on technical grounds, but I’m also pretty sure something like it is still there, it’s just as far as I know no one has tried to show that something like the optimizers curse continues to function when you only have an ordering and not a measure. I think it does, and I think others think it does, and this is part of the generalization to Goodharting, but I don’t know that a formal proof demonstrating that has been generated even though I strongly suspect it’s true.
So when they interact there’s additional variables you’re leaving out.
There’s a target T that’s the real thing you want. Then there’s a function E that measures how much you expect C to achieve T. For example, maybe T is “have fun” and E is “how fun C looks”. Then given a set of choices C_1, C_2, … you choose C_max such that E(C_max) >= E(C_i) for all C_i (in normal terms, C_max = argmax E(C_i)). Unfortunately T is hidden such that you can only check if C satisfies T via E (well, this is not exactly true because otherwise we might have a hard time knowing the optimizer’s curse exists, but it would hold even if that were the case, we just might not be able to notice it, and regardless we can’t use whatever this side channel to assess T as a measure and so can’t optimize on it).
Now since we don’t have perfect information, there is some error e associated with E, so the true extent to which any C_i satisfies T is E(C_i) + e. But we picked C_max based on the existence of this error, since C_max = argmax E(C_i) + e, thus C_max may not be the true max. As you say, so what, maybe that means we just don’t pick the best but we pick something good. But recall that our purpose was T not max E(C_i), so over repeated choices we will consistently, due to the optimizer’s curse, pick C_max such that max E(C_i) < T (noting that’s a type error as notated, but I think it’s intuitive what is meant). Thus e will compound over repeated choice since each subsequent C is conditioned on the previous ones such that it becomes certain that E(C_max) < T and never E(C_max) = T.
This might seem minor if we had only a single dimension to worry about, like “had slightly less than maximum fun”, even if it did, say, result in astronomical waste. But we normally are optimizing over multiple dimensions and each choice may fail in different ways along those different dimensions. The result is that we will over time shrink the efficiency frontier (though it might reach a limit and not get worse) and end up with worse solutions than were possible that may even be bad enough that we don’t want them. After all, nothing is stopping the error from getting so large or the frontier from shrinking so much that we would be worse off than if we had never started.
It seems to me that your comment amounts to saying “It’s impossible to always make optimal choices for everything, because we don’t have perfect information and perfect analysis,” which is true but unrelated to optimizer’s curse (and I would say not in itself problematic for AGI safety). I’m sure that’s not what you meant, but here’s why it comes across that way to me. You seem to be setting T = E(C_max). If you set T = E(C_max) by definition, then imperfect information or imperfect analysis implies that you will always miss T by the error e, and the error will always be in the unfavorable direction.
But I don’t think about targets that way. I would set my target to be something that can in principle be exceeded (T = have almost as much fun as is physically possible). Then when we evaluate the choices C, we’ll find some that dramatically exceed T (i.e. way more fun than is physically possible, because we estimated the consequences wrong), and if we pick one of those, we’ll still have a good chance of slightly exceeding T despite the optimizer’s curse.
Lack of access to perfect information is highly relevant because it’s exactly why we can’t get around the curse. If we had perfect information we could correct for it as a systematic bias using Bayesian methods and be done with it. It’s also why it shows up in the first place: if we could establish a measure E that accurately reported the amount it satisfied T then it wouldn’t happen because there would be no error in the measurement.
What you are proposing about allowing targets to be exceeded is simply allowing for more mild optimization, and the optimizer’s curse still happens if there is preferential choice at all.
I don’t think it’s related to mild optimization. Pick a target T that can be exceeded (wonderful future, even if it’s not the absolute theoretically best possible future). Estimate what choice Cmax is (as far as we can tell) the #1 very best by that metric. We expect Cmax to give value E, and it turns out to be V<E, but V is still likely to exceed T, or at least likelier than any other choice. (Insofar as that’s not true, it’s Goodhart.) Optimizer curse, i.e. V<E, does not seem to be a problem, or even relevant, because I don’t ultimately care about E. Maybe the AI doesn’t even tell me what E is. Maybe the AI doesn’t even bother guessing what E is, it only calculates that Cmax seems to be better than any other choice.
Hmm, maybe you are misunderstanding how the optimizer’s curse works? It’s powered by selecting based on a measure with error in a way that biases us to pick specific actions based on their measure when the measure errs such that the measure is on average higher rather than lower than its true value. You are mistaken, then, to not care about E, because E is the only reliable and comparable way you have to check if C satisfies T (if there’s another one that’s reliable and comparable, then use it instead). It’s literally the only option, assuming you picked the “best” E (another chance for Goodhart’s curse to bite you), for picking C_max that seems better unless you want very high quantilization such that, say, you only act when things appear orders of magnitude better with error bounds small enough that you will only be wrong once in trillions of years.
I do think I understand that. I see E as a means to an end. It’s a way to rank-order choices and thus make good choices. If I apply an affine transformation to E, e.g. I’m way too optimistic about absolutely everything in a completely uniform way, then I still make the same choice, and the choice is what matters. I just want my AGI to do the right thing.
Here, I’ll try to put what I’m thinking more starkly. Let’s say I somehow design a comparative AGI. This is a system which can take a merit function U, and two choices C_A and C_B, and it can predict which of the two choices C_A or C_B would be better according to merit function U, but it has no idea how good either of those two choices actually are on any absolute scale. It doesn’t know whether C_A is wonderful while C_B is even better, or whether C_A is awful while C_B is merely so-so, both of those just return the same answer, “C_B is better”. Assume it’s not omniscient, so its comparisons are not always correct, but that it’s still impressively superintelligent.
A comparative AGI does not suffer the optimizer’s curse, right? It never forms any beliefs about how good its choices will turn out, so it couldn’t possibly be systematically disappointed. There’s always noise and uncertainty, so there will be times when its second-highest-ranked choice would actually turn out better than its highest-ranked choice. But that happens less often than chance. There’s no systematic problem: in expectation, the best thing to do (as measure by U) is always to take its top-ranked choice.
Now, it seems to me that, if I go to the AGIs-R-Us store, and I see a normal AGI and a comparative AGI side-by-side on the shelf, I would have no strong opinion about which one of them I should buy. If I ask either one to do something, they’ll take the same sequence of actions in the same order, and get the same result. They’ll invest my money in the same stocks, offer me the same advice, etc. etc. In particular, I would worry about Goodhart’s law (i.e. giving my AGI the wrong function U) with either of these AGIs to the exact same extent and for the exact same reason...even though one is subject to optimizer’s curse and the other isn’t.
Right, if you don’t have a measure you can’t have Goodhart’s curse on technical grounds, but I’m also pretty sure something like it is still there, it’s just as far as I know no one has tried to show that something like the optimizers curse continues to function when you only have an ordering and not a measure. I think it does, and I think others think it does, and this is part of the generalization to Goodharting, but I don’t know that a formal proof demonstrating that has been generated even though I strongly suspect it’s true.