If our goal was only to get corrigible behavior, we could build agents with learn our (instrumental) preferences over behaviors and then respect those preferences. It doesn’t seem hard to learn that humans would prefer “be receptive to corrections” to “disassemble the human in order to figure out how they would have corrected you.”
It seems like the main puzzle is reconciling our intuitions about corrigibility with our desire to build systems which are only motivated by their users’ terminal preferences (rather than respecting particular instrumental preferences).
I share the intuitive sense that this ought to be possible, though after thinking about it more I’m very uncertain. If we can figure out how to make that work, I agree it would be very useful.
That said, I think it’s worth keeping in mind:
(1) it may not be possible to reconcile corrigibility with sharing-only-terminal-values.
(2) it’s not necessarily necessary
If I wanted to communicate the problem, I would focus on making the existing problem statement attractive/understandable to mainstream researchers rather than producing a precise formal statement, because (a) I don’t think that a formal statement would be compelling to people without significant additional expositive work, (b) I think that producing a good formal statement probably involves mostly solving the problem, and (c) I think that the basic problem you are outlining here is already sufficient that (if properly presented) it should be understandable to mainstream researchers.
(The last paragraph applies very specifically to this problem, it is not intended to generalize to other problems, where precision may be a key part of getting other people to care about the problem.)
That said, trying to produce a formal statement may be the right way to attack the problem, and attacking the problem may be higher-priority than communicating about it (depending on how much you’ve already tried / how promising it seems. I’m definitely not at the stage where I would want to communicate rather than work on it.) In that case, ignore the last 3 paragraphs.
Yes, I think that learning the user’s instrumental preferences is a good way to get corrigible behavior. I’m hoping to explore the idea of learning an ontology in which instrumental preferences can be represented. There seems to be a spectrum between learning a user’s terminal preferences and learning their actions, with learning instrumental preferences falling in between these.
I’m planning on writing up some posts about models for goal-directed value learning. I like your suggestion of presenting the problem so it’s understandable to mainstream researchers; I’ll think about what to do about this after writing up the posts.
If our goal was only to get corrigible behavior, we could build agents with learn our (instrumental) preferences over behaviors and then respect those preferences. It doesn’t seem hard to learn that humans would prefer “be receptive to corrections” to “disassemble the human in order to figure out how they would have corrected you.”
It seems like the main puzzle is reconciling our intuitions about corrigibility with our desire to build systems which are only motivated by their users’ terminal preferences (rather than respecting particular instrumental preferences).
I share the intuitive sense that this ought to be possible, though after thinking about it more I’m very uncertain. If we can figure out how to make that work, I agree it would be very useful.
That said, I think it’s worth keeping in mind:
(1) it may not be possible to reconcile corrigibility with sharing-only-terminal-values. (2) it’s not necessarily necessary
If I wanted to communicate the problem, I would focus on making the existing problem statement attractive/understandable to mainstream researchers rather than producing a precise formal statement, because (a) I don’t think that a formal statement would be compelling to people without significant additional expositive work, (b) I think that producing a good formal statement probably involves mostly solving the problem, and (c) I think that the basic problem you are outlining here is already sufficient that (if properly presented) it should be understandable to mainstream researchers.
(The last paragraph applies very specifically to this problem, it is not intended to generalize to other problems, where precision may be a key part of getting other people to care about the problem.)
That said, trying to produce a formal statement may be the right way to attack the problem, and attacking the problem may be higher-priority than communicating about it (depending on how much you’ve already tried / how promising it seems. I’m definitely not at the stage where I would want to communicate rather than work on it.) In that case, ignore the last 3 paragraphs.
Yes, I think that learning the user’s instrumental preferences is a good way to get corrigible behavior. I’m hoping to explore the idea of learning an ontology in which instrumental preferences can be represented. There seems to be a spectrum between learning a user’s terminal preferences and learning their actions, with learning instrumental preferences falling in between these.
I’m planning on writing up some posts about models for goal-directed value learning. I like your suggestion of presenting the problem so it’s understandable to mainstream researchers; I’ll think about what to do about this after writing up the posts.