The distinction between civilization’s goal and goals of individual people is real, but that doesn’t make civilization’s goal unmoored. Rounding it down to some instrumental goals changes it. And that exposes you to goodhart’s curse: if you take something other than actual terminal values of civilization as an optimization target, the outcome looks bad from the point of view of actual terminal values of civilization.
I think something similar to what you say can be rescued, in the form of the more important terminal values of civilization turning out to be generic, like math, not specific to details of the people who seek to formulate their own values. Generic values are convergent across many processes of volition extrapolation, including for the more human-like AGIs, and form an even greater share of terminal values for coalitions of multiple different AGIs. (This doesn’t apply to mature optimizers such as paperclip maximizers that already know their terminal values and aren’t motivated to work on figuring out what else they should be.)
It is similar to being instrumentally convergent, in being discovered by many different processes for the same reason, but it’s not the same thing. Convergent instrumental goals are discovered as subgoals in the process of solving many different problems, in service of many different terminal goals. Generic terminal goals are discovered as terminal goals in the process of extrapolating many different volitions (formulating terminal goals of many different people, including those of relatively alien psychology, not sharing many human psychological adaptations).
The distinction between civilization’s goal and goals of individual people is real, but that doesn’t make civilization’s goal unmoored. Rounding it down to some instrumental goals changes it. And that exposes you to goodhart’s curse: if you take something other than actual terminal values of civilization as an optimization target, the outcome looks bad from the point of view of actual terminal values of civilization.
How do we change humanity’s hypothesized terminal goals by assigning instrumental convergence goals for humanity as terminal goals to the AGI?
Also, I’m trying to think of a goodhart’s curse version of the the Humanity’s Values framework, and can’t think of any obvious cases. I’m not saying it’s waterproof and the AGI can’t misinterpret the goals, but if we presuppose that we find the ideal implementation of these values as goals, and there is no misalignment, then … everything would be ok?
I think something similar to what you say can be rescued, in the form of the more important terminal values of civilization turning out to be generic, like math, not specific to details of the people who seek to formulate their own values.
I read the links but don’t understand the terminal values you are pointing to. Could you paraphrase?
The distinction between civilization’s goal and goals of individual people is real, but that doesn’t make civilization’s goal unmoored. Rounding it down to some instrumental goals changes it. And that exposes you to goodhart’s curse: if you take something other than actual terminal values of civilization as an optimization target, the outcome looks bad from the point of view of actual terminal values of civilization.
I think something similar to what you say can be rescued, in the form of the more important terminal values of civilization turning out to be generic, like math, not specific to details of the people who seek to formulate their own values. Generic values are convergent across many processes of volition extrapolation, including for the more human-like AGIs, and form an even greater share of terminal values for coalitions of multiple different AGIs. (This doesn’t apply to mature optimizers such as paperclip maximizers that already know their terminal values and aren’t motivated to work on figuring out what else they should be.)
It is similar to being instrumentally convergent, in being discovered by many different processes for the same reason, but it’s not the same thing. Convergent instrumental goals are discovered as subgoals in the process of solving many different problems, in service of many different terminal goals. Generic terminal goals are discovered as terminal goals in the process of extrapolating many different volitions (formulating terminal goals of many different people, including those of relatively alien psychology, not sharing many human psychological adaptations).
Thank you for your comment!
How do we change humanity’s hypothesized terminal goals by assigning instrumental convergence goals for humanity as terminal goals to the AGI?
Also, I’m trying to think of a goodhart’s curse version of the the Humanity’s Values framework, and can’t think of any obvious cases. I’m not saying it’s waterproof and the AGI can’t misinterpret the goals, but if we presuppose that we find the ideal implementation of these values as goals, and there is no misalignment, then … everything would be ok?
I read the links but don’t understand the terminal values you are pointing to. Could you paraphrase?