Steven Byrnes comments on Goodhart’s Curse and Limitations on AI Alignment

Steven Byrnes 21 Aug 2019 0:28 UTC
1 point
It seems to me that your comment amounts to saying “It’s impossible to always make optimal choices for everything, because we don’t have perfect information and perfect analysis,” which is true but unrelated to optimizer’s curse (and I would say not in itself problematic for AGI safety). I’m sure that’s not what you meant, but here’s why it comes across that way to me. You seem to be setting T = E(C_max). If you set T = E(C_max) by definition, then imperfect information or imperfect analysis implies that you will always miss T by the error e, and the error will always be in the unfavorable direction.

But I don’t think about targets that way. I would set my target to be something that can in principle be exceeded (T = have almost as much fun as is physically possible). Then when we evaluate the choices C, we’ll find some that dramatically exceed T (i.e. way more fun than is physically possible, because we estimated the consequences wrong), and if we pick one of those, we’ll still have a good chance of slightly exceeding T despite the optimizer’s curse.
- Gordon Seidoh Worley 21 Aug 2019 11:51 UTC
  2 points
  Parent
  Lack of access to perfect information is highly relevant because it’s exactly why we can’t get around the curse. If we had perfect information we could correct for it as a systematic bias using Bayesian methods and be done with it. It’s also why it shows up in the first place: if we could establish a measure E that accurately reported the amount it satisfied T then it wouldn’t happen because there would be no error in the measurement.
  What you are proposing about allowing targets to be exceeded is simply allowing for more mild optimization, and the optimizer’s curse still happens if there is preferential choice at all.
  - Steven Byrnes 21 Aug 2019 12:26 UTC
    1 point
    Parent
    I don’t think it’s related to mild optimization. Pick a target T that can be exceeded (wonderful future, even if it’s not the absolute theoretically best possible future). Estimate what choice Cmax is (as far as we can tell) the #1 very best by that metric. We expect Cmax to give value E, and it turns out to be V<E, but V is still likely to exceed T, or at least likelier than any other choice. (Insofar as that’s not true, it’s Goodhart.) Optimizer curse, i.e. V<E, does not seem to be a problem, or even relevant, because I don’t ultimately care about E. Maybe the AI doesn’t even tell me what E is. Maybe the AI doesn’t even bother guessing what E is, it only calculates that Cmax seems to be better than any other choice.
    - Gordon Seidoh Worley 21 Aug 2019 13:05 UTC
      2 points
      Parent
      Hmm, maybe you are misunderstanding how the optimizer’s curse works? It’s powered by selecting based on a measure with error in a way that biases us to pick specific actions based on their measure when the measure errs such that the measure is on average higher rather than lower than its true value. You are mistaken, then, to not care about E, because E is the only reliable and comparable way you have to check if C satisfies T (if there’s another one that’s reliable and comparable, then use it instead). It’s literally the only option, assuming you picked the “best” E (another chance for Goodhart’s curse to bite you), for picking C_max that seems better unless you want very high quantilization such that, say, you only act when things appear orders of magnitude better with error bounds small enough that you will only be wrong once in trillions of years.
      - Steven Byrnes 21 Aug 2019 13:38 UTC
        1 point
        Parent
        I do think I understand that. I see E as a means to an end. It’s a way to rank-order choices and thus make good choices. If I apply an affine transformation to E, e.g. I’m way too optimistic about absolutely everything in a completely uniform way, then I still make the same choice, and the choice is what matters. I just want my AGI to do the right thing.
        
        Here, I’ll try to put what I’m thinking more starkly. Let’s say I somehow design a comparative AGI. This is a system which can take a merit function U, and two choices C_A and C_B, and it can predict which of the two choices C_A or C_B would be better according to merit function U, but it has no idea how good either of those two choices actually are on any absolute scale. It doesn’t know whether C_A is wonderful while C_B is even better, or whether C_A is awful while C_B is merely so-so, both of those just return the same answer, “C_B is better”. Assume it’s not omniscient, so its comparisons are not always correct, but that it’s still impressively superintelligent.
        
        A comparative AGI does not suffer the optimizer’s curse, right? It never forms any beliefs about how good its choices will turn out, so it couldn’t possibly be systematically disappointed. There’s always noise and uncertainty, so there will be times when its second-highest-ranked choice would actually turn out better than its highest-ranked choice. But that happens less often than chance. There’s no systematic problem: in expectation, the best thing to do (as measure by U) is always to take its top-ranked choice.
        
        Now, it seems to me that, if I go to the AGIs-R-Us store, and I see a normal AGI and a comparative AGI side-by-side on the shelf, I would have no strong opinion about which one of them I should buy. If I ask either one to do something, they’ll take the same sequence of actions in the same order, and get the same result. They’ll invest my money in the same stocks, offer me the same advice, etc. etc. In particular, I would worry about Goodhart’s law (i.e. giving my AGI the wrong function U) with either of these AGIs to the exact same extent and for the exact same reason...even though one is subject to optimizer’s curse and the other isn’t.
        
        Gordon Seidoh Worley 21 Aug 2019 17:36 UTC
        4 points
        Parent
        Right, if you don’t have a measure you can’t have Goodhart’s curse on technical grounds, but I’m also pretty sure something like it is still there, it’s just as far as I know no one has tried to show that something like the optimizers curse continues to function when you only have an ordering and not a measure. I think it does, and I think others think it does, and this is part of the generalization to Goodharting, but I don’t know that a formal proof demonstrating that has been generated even though I strongly suspect it’s true.