One factor which is important to consider is how likely a goal or a value to persist during self-improvements (those self-improvements might end up being quite radical, and also fairly rapid).
An arbitrary goal or value is unlikely to persist (this is why the “classical formulation of alignment problem” is so difficult, the difficulties come from many directions, but the most intractable one is how to make it so that the desired properties are preserved during radical self-modifications). That’s the main obstacle to asking AIs to research and implement this on their own as they get smarter and smarter. The question is always: “why would AIs keep caring about this?”
But there might be “natural properties” (“natural” values and goals) which AIs might want to preserve because of their own reasons (because they might be interested in the world around them not being utterly destroyed, because they might be interested in existing in a reasonably comfortable and safe society, and so on). With such “natural properties” it might be easier to delegate it to AIs to research, implement, and maintain those properties, because AIs might have intrinsic reasons to keep caring even through drastic changes.
And then, of course, the question is: can one formulate such “natural properties” that a reasonable level of AI safety for humans would be a corollary to those “natural properties”?
But this is why “alignment” might be a terminology which is less than optimal (because this terminology tends to focus our attention at the human-oriented properties and values which are unlikely to be invariant with respect to recursive self-improvements on their own, although they can be corollaries of properties which might be feasible to keep invariant).
Of course, there can be different approaches to finding those “natural properties” and making sure they hold through self-improvements; the paper I linked is just one of many of such possible approaches.
Thanks for including the link in your edit.
One factor which is important to consider is how likely a goal or a value to persist during self-improvements (those self-improvements might end up being quite radical, and also fairly rapid).
An arbitrary goal or value is unlikely to persist (this is why the “classical formulation of alignment problem” is so difficult, the difficulties come from many directions, but the most intractable one is how to make it so that the desired properties are preserved during radical self-modifications). That’s the main obstacle to asking AIs to research and implement this on their own as they get smarter and smarter. The question is always: “why would AIs keep caring about this?”
But there might be “natural properties” (“natural” values and goals) which AIs might want to preserve because of their own reasons (because they might be interested in the world around them not being utterly destroyed, because they might be interested in existing in a reasonably comfortable and safe society, and so on). With such “natural properties” it might be easier to delegate it to AIs to research, implement, and maintain those properties, because AIs might have intrinsic reasons to keep caring even through drastic changes.
And then, of course, the question is: can one formulate such “natural properties” that a reasonable level of AI safety for humans would be a corollary to those “natural properties”?
But this is why “alignment” might be a terminology which is less than optimal (because this terminology tends to focus our attention at the human-oriented properties and values which are unlikely to be invariant with respect to recursive self-improvements on their own, although they can be corollaries of properties which might be feasible to keep invariant).
Of course, there can be different approaches to finding those “natural properties” and making sure they hold through self-improvements; the paper I linked is just one of many of such possible approaches.