TurnTrout comments on Don’t align agents to evaluations of plans

TurnTrout 28 Nov 2022 20:36 UTC
LW: 2 AF: 2
0
AF
I understand you to have just said:
Having direct-values of “Rohin’s current values” and “Rohin’s CEV” both seem fine. There is, however, significant discrepancy between “grader-optimize the plans Rohin evaluates as good” and “directly-value Rohin’s current values.”
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
- Rohin Shah 29 Nov 2022 14:39 UTC
  LW: 2 AF: 2
  0
  AF Parent
  the first line seems to speculate that values-AGI is substantially more robust to differences in values
  The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
  I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
  I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
  It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).