The distinction between civilization’s goal and goals of individual people is real, but that doesn’t make civilization’s goal unmoored. Rounding it down to some instrumental goals changes it. And that exposes you to goodhart’s curse: if you take something other than actual terminal values of civilization as an optimization target, the outcome looks bad from the point of view of actual terminal values of civilization.
How do we change humanity’s hypothesized terminal goals by assigning instrumental convergence goals for humanity as terminal goals to the AGI?
Also, I’m trying to think of a goodhart’s curse version of the the Humanity’s Values framework, and can’t think of any obvious cases. I’m not saying it’s waterproof and the AGI can’t misinterpret the goals, but if we presuppose that we find the ideal implementation of these values as goals, and there is no misalignment, then … everything would be ok?
I think something similar to what you say can be rescued, in the form of the more important terminal values of civilization turning out to be generic, like math, not specific to details of the people who seek to formulate their own values.
I read the links but don’t understand the terminal values you are pointing to. Could you paraphrase?
Thank you for your comment!
How do we change humanity’s hypothesized terminal goals by assigning instrumental convergence goals for humanity as terminal goals to the AGI?
Also, I’m trying to think of a goodhart’s curse version of the the Humanity’s Values framework, and can’t think of any obvious cases. I’m not saying it’s waterproof and the AGI can’t misinterpret the goals, but if we presuppose that we find the ideal implementation of these values as goals, and there is no misalignment, then … everything would be ok?
I read the links but don’t understand the terminal values you are pointing to. Could you paraphrase?