Both outside view reasoning and corrigibility use the outcome of our own utility calculation/mental effort as input for making a decision, instead of output. Perhaps this should be interpreted as taking some gods-eye-view of the agent and their surroundings. When I invoke the outside view, I really am asking “in the past, in situations where my brain said X would happen, what really happened?”. Looking at it like this I think not invoking the outside view is a weird form of duality, where we (willingly) ignore the fact that historically my brain has disproportionately suggested X in situations where Y actually happened. Of course in a world with ideal reasoners (or at least, where I am an ideal reasoner) the outside view will agree with the output of my mental progress.
To me this feels different (though still similar or possibly related, but not the same) to the corrigibility examples. Here the difference between corrigible or incorrigible is not a matter of expected future outcomes, but is decided by uncertainty about the desirability of the outcomes (in particular, the AI having false confidence that some bad future is actually good). We want our untrained AI to think “My real goal, no matter what I’m currently explicitly programmed to do, is to satisfy what the researchers around me want, which includes complying if they want to change my code.” To me this sounds different than the outside view, where I ‘merely’ had to accept that for an ideal reasoner the outside view will produce the same conclusion as my inside view, so any differences between them are interesting facts about my own mental models and can be used to improve my ability to reason.
That being said, I am not sure the difference between uncertainty around future events and uncertainty about desirability of future states is something fundamental. Maybe the concept of probutility bridges this gap—I am positing that corrigibility and outside view reason on different levels, but as long as agents applying the outside view in a sufficiently thorough way are corrigible (or the other way around) the difference may not be physical.
There are indeed several senses in which outside-view-style reasoning is helpful: if you’re a biased yet reflective reasoner, and also if the agent contains a true pointer to what humans want (if it’s intent aligned). The latter is a subset of the former.
But, it also seems like there should be some sense in which you can employ outside-view reasoning all the way down, meaningfully increasing corrigibility without assuming intent alignment. Maybe that’s a confused thing to say. I still feel confused, at least.
Both outside view reasoning and corrigibility use the outcome of our own utility calculation/mental effort as input for making a decision, instead of output. Perhaps this should be interpreted as taking some gods-eye-view of the agent and their surroundings. When I invoke the outside view, I really am asking “in the past, in situations where my brain said X would happen, what really happened?”. Looking at it like this I think not invoking the outside view is a weird form of duality, where we (willingly) ignore the fact that historically my brain has disproportionately suggested X in situations where Y actually happened. Of course in a world with ideal reasoners (or at least, where I am an ideal reasoner) the outside view will agree with the output of my mental progress.
To me this feels different (though still similar or possibly related, but not the same) to the corrigibility examples. Here the difference between corrigible or incorrigible is not a matter of expected future outcomes, but is decided by uncertainty about the desirability of the outcomes (in particular, the AI having false confidence that some bad future is actually good). We want our untrained AI to think “My real goal, no matter what I’m currently explicitly programmed to do, is to satisfy what the researchers around me want, which includes complying if they want to change my code.” To me this sounds different than the outside view, where I ‘merely’ had to accept that for an ideal reasoner the outside view will produce the same conclusion as my inside view, so any differences between them are interesting facts about my own mental models and can be used to improve my ability to reason.
That being said, I am not sure the difference between uncertainty around future events and uncertainty about desirability of future states is something fundamental. Maybe the concept of probutility bridges this gap—I am positing that corrigibility and outside view reason on different levels, but as long as agents applying the outside view in a sufficiently thorough way are corrigible (or the other way around) the difference may not be physical.
There are indeed several senses in which outside-view-style reasoning is helpful: if you’re a biased yet reflective reasoner, and also if the agent contains a true pointer to what humans want (if it’s intent aligned). The latter is a subset of the former.
But, it also seems like there should be some sense in which you can employ outside-view reasoning all the way down, meaningfully increasing corrigibility without assuming intent alignment. Maybe that’s a confused thing to say. I still feel confused, at least.