Rohin Shah comments on Don’t align agents to evaluations of plans

Rohin Shah 29 Nov 2022 14:39 UTC
LW: 2 AF: 2
0
AF
the first line seems to speculate that values-AGI is substantially more robust to differences in values
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).