Clearly there’s some tension between “I want to shut down if the user wants me to shut down” and “I want to be helpful so that the user doesn’t want to shut me down”, but I don’t weak indifference is a correct way to frame this tension.
As a gesture at the correct math, imagine there’s some space of possible futures and some utility function related to the user request. Corrible AI should define a tradeoff between the number of possible futures its actions affect and the degree to which it satisfies its utility function. Maximum corrigibility {C=1} is the do-nothing state (no effect on possible futures). Minimum corrigibility {C=0} is maximizing the utility function without regard to side-effects (with all the attendant problems such as convergent instrumental goals, etc). Somewhere between C=0 and C=1 is useful corrigible AI. Ideally we should be able to define intermediate values of C in such a way that we can be confident the actions of corrigible AI are spatially and temporally bounded.
The difficultly principally lies in the fact that there’s no such thing as “spatially and temporally bounded”. Due to the Butterfly Effect any action at all affects everything in the future light-cone of the agent. In order to come up with a sensible notion of boundless, we need to define some kind of metric on the space of possible futures, ideally in terms like “an agent could quickly undo everything I’ve just done”. At this point we’ve just recreated agent foundations, though.
Obviously we want 1) “actually be helpful”.
Clearly there’s some tension between “I want to shut down if the user wants me to shut down” and “I want to be helpful so that the user doesn’t want to shut me down”, but I don’t weak indifference is a correct way to frame this tension.
As a gesture at the correct math, imagine there’s some space of possible futures and some utility function related to the user request. Corrible AI should define a tradeoff between the number of possible futures its actions affect and the degree to which it satisfies its utility function. Maximum corrigibility {C=1} is the do-nothing state (no effect on possible futures). Minimum corrigibility {C=0} is maximizing the utility function without regard to side-effects (with all the attendant problems such as convergent instrumental goals, etc). Somewhere between C=0 and C=1 is useful corrigible AI. Ideally we should be able to define intermediate values of C in such a way that we can be confident the actions of corrigible AI are spatially and temporally bounded.
The difficultly principally lies in the fact that there’s no such thing as “spatially and temporally bounded”. Due to the Butterfly Effect any action at all affects everything in the future light-cone of the agent. In order to come up with a sensible notion of boundless, we need to define some kind of metric on the space of possible futures, ideally in terms like “an agent could quickly undo everything I’ve just done”. At this point we’ve just recreated agent foundations, though.
Here is a too long writeup of the math I was suggesting.