Having direct-values of “Rohin’s current values” and “Rohin’s CEV” both seem fine. There is, however, significant discrepancy between “grader-optimize the plans Rohin evaluates as good” and “directly-value Rohin’s current values.”
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
the first line seems to speculate that values-AGI is substantially more robust to differences in values
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).
I understand you to have just said:
In particular, the first line seems to speculate that values-AGI is substantially more robust to differences in values. If so, I agree. You don’t need “perfect values” in an AGI (but probably they have to be pretty good; just not adversarially good). Whereas strongly-upwards-misspecifying the Rohin-grader on a single plan of the exponential-in-time planspace will (almost always) ruin the whole ballgame in the limit.
The thing that I believe is that an intelligent, reflective, careful agent with a decisive strategic advantage (DSA) will tend to produce outcomes that are similar in value to that which would be done by that agent’s CEV. In particular, I believe this because the agent is “trying” to do what its CEV would do, it has the power to do what its CEV would do, and so it will likely succeed at this.
I don’t know what you mean by “values-AGI is more robust to differences in values”. What values are different in this hypothetical?
I do think that values-AGI with a DSA is likely to produce outcomes similar to CEV-of-values-AGI
It is unclear whether values-AGI with a DSA is going to produce outcomes similar to CEV-of-Rohin (because this depends on how you built values-AGI and whether you successfully aligned it).