adamShimi comments on Don’t align agents to evaluations of plans

adamShimi 28 Nov 2022 17:07 UTC
LW: 5 AF: 5
2
AF
1. Intelligence ⇒ strong selection pressure ⇒ bad outcomes if the selection pressure is off target.
2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into “what if the agent tricks the evaluator”.
3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into “what if the values / shards are different from what we wanted”.
4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.
One thing that is not clear to me from your comment is what you make of Alex’s argument (as I see it) to the extent that “evaluation goals” are further away from “direct goals” than “direct goals” are between themselves. If I run with this, it seems like an answer to your point 4 would be:
- with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
- with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it’s hard to get your goal exactly right) plus an additional discrepancy between valuing “the evaluation of X” and valuing “X”.
I’m curious what you think of this claim, and if that influences at all your take.
- Rohin Shah 28 Nov 2022 18:49 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Sounds right. How does this answer my point 4?
  I guess maybe you see two discrepancies vs one and conclude that two is worse than one? I don’t really buy that, seems like it depends on the size of the discrepancies.
  For example, if you imagine an AI that’s optimizing for my evaluation of good, I think the discrepancy between “Rohin’s directly instilled goals” and “Rohin’s CEV” is pretty small and I am pretty happy to ignore it. (Put another way, if that was the only source of misalignment risk, I’d conclude misalignment risk was small and move on to some other area.) So the only one that matters in this case of grader optimization is the discrepancy between “plans Rohin evaluates as good” and “Rohin’s directly instilled goals”.
  - adamShimi 29 Nov 2022 11:17 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.
    This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the “difficulties” are only positive numbers, then if the difficulty for the direct instillation is $d_{i n s t i l l a t i o n}$ and the one for the grader optimization is $d_{i n s t i l l a t i o n} + d_{e v a l u a t i o n}$ , then there’s no debate that the latter is bigger than the former.
    But if you allow directionality (even in one dimension), then there’s the risk that the sum leads to less difficulty in total (by having the $d_{e v a l u a t i o n}$ move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don’t see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.
    - Rohin Shah 29 Nov 2022 14:43 UTC
      LW: 2 AF: 2
      −2
      AF Parent
      Two responses:
      Grader-optimization has the benefit that you don’t have to specify what values you care about in advance. This is a difficulty faced by value-executors but not by grader-optimizers.
      Part of my point is that the machinery you need to solve evaluation-problems is also needed to solve instillation-problems because fundamentally they are shadows of the same problem, so I’d estimate d_evaluation at close to 0 in your equations after you have dealt with d_instillation.
  - TurnTrout 28 Nov 2022 20:36 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I understand you to have just said:
    Having direct-values of “Rohin’s current values” and “Rohin’s CEV” both seem fine. There is, however, significant discrepancy between “grader-optimize the plans Rohin evaluates as good” and “directly-value Rohin’s current values.”
    In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
    - Rohin Shah 29 Nov 2022 14:39 UTC
      LW: 2 AF: 2
      0
      AF Parent
      the first line seems to speculate that values-AGI is substantially more robust to differences in values
      The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
      I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
      I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
      It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).