if you have some outside-view-ish model of how well you do what the human wants (or whatever) and that information guides your decisions, then it seems kinda implicit that you are acting in pursuit of a final goal of doing want the human wants.
One frame on the alignment problem is: what are human-discoverable low-complexity system designs which lead to the agent doing what I want? I wonder whether the outside view idea enables any good designs like that.
yes, I agree.
One frame on the alignment problem is: what are human-discoverable low-complexity system designs which lead to the agent doing what I want? I wonder whether the outside view idea enables any good designs like that.