If the AGI’s final goal is to “do what the human wants me to do” (or whatever other variation you like), then it’s instrumentally useful for the AGI to create an outside-view-ish model of when its behavior does or doesn’t accord with the human’s desires.
Conversely, if you have some outside-view-ish model of how well you do what the human wants (or whatever) and that information guides your decisions, then it seems kinda implicit that you are acting in pursuit of a final goal of doing want the human wants.
So I guess my conclusion is that it’s fundamentally the same idea, and in this post you’re flagging one aspect of what successful corrigible alignment would look like. What do you think?
if you have some outside-view-ish model of how well you do what the human wants (or whatever) and that information guides your decisions, then it seems kinda implicit that you are acting in pursuit of a final goal of doing want the human wants.
One frame on the alignment problem is: what are human-discoverable low-complexity system designs which lead to the agent doing what I want? I wonder whether the outside view idea enables any good designs like that.
I’m trying to think whether or not this is substantively different from my post on corrigible alignment.
If the AGI’s final goal is to “do what the human wants me to do” (or whatever other variation you like), then it’s instrumentally useful for the AGI to create an outside-view-ish model of when its behavior does or doesn’t accord with the human’s desires.
Conversely, if you have some outside-view-ish model of how well you do what the human wants (or whatever) and that information guides your decisions, then it seems kinda implicit that you are acting in pursuit of a final goal of doing want the human wants.
So I guess my conclusion is that it’s fundamentally the same idea, and in this post you’re flagging one aspect of what successful corrigible alignment would look like. What do you think?
yes, I agree.
One frame on the alignment problem is: what are human-discoverable low-complexity system designs which lead to the agent doing what I want? I wonder whether the outside view idea enables any good designs like that.