Koen.Holtman comments on New paper: Corrigibility with Utility Preservation

Koen.Holtman 12 Aug 2019 10:02 UTC
LW: 1 AF: 1
AF
It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward.
Yes this is true: basically the fundamental job of the button-corrigible agent is still to perform a search over all actions, and find those actions that best maximise the reward computed by the utility function. As mentioned in the paper, such actions may be to start a hobby club, where humans come together to have fun and build petrol engines that will go into petrol cars. Starting a hobby club is definitely an action that influences humans, but not in a bad way.
Button-corrigibility intends to support a process where bad influencing behaviour can be corrected when it is identified, without the agent fighting back. The assumption is that we are not smart enough yet to construct an agent that is perfect on such metrics when we start it up initially.
On myopia: this is definately a technique that might make agents safer, though I would worry if this implies that an agent is incapable of considering long term effects when it chooses actions: this would make the agent unsafe in many cases.
Short term, I think myopia is a useful safety technique for many types of agents. Long term, if we have agents that can modify themselves or build new sub-agents, myopia has a big open problem: how do we ensure that any successor agents will have exactly the same myopia? Button-corrigibility avoids having to solve this preservation of ignorance problem: it also works for agents and successor agents that are maximally perceptive and intelligent, because it modifies the agent goals, not agent perception or reasoning capability. The successor agent building problem does not go away completely through: button-corrigibility still has some hairy problems with agent modification and sub-agents, but overall I have found these to be more tractable than the problem of ignorance preservation.