Enjoyed reading this (although still making my way through the latter parts), I think it’s useful to have a lot of these ideas more formalised.
Wondering if this simple idea would come up against theorem 2/ any others or why it wouldn’t work.
Suppose for every time t some decision maker would choose a probability p(t) with which the AI would shutdown. Further, suppose the AI’s utility function in any period was initially U(t). Now scale the new AI’s utility function to be U(t)/(1-p(t))- I think this can be quite easily generalised so future periods of utility are likewise unaffected by an increase in the risk of shutdown.
In this world the AI should be indifferent over changes to p(t) (as long as it gets arbitrarily close to 1 but never reaches it) and so should take actions trying to maximise U whilst being indifferent to if humans decide to shut it down.
Oh cool idea! It seems promising. It also seems similar in one respect to Armstrong’s utility indifference proposal discussed in Soares et al. 2015: Armstrong has a correcting term that varies to ensure that utility stays the same when the probability of shutdown changes, whereas you have a correcting factor that varies to ensure that utility stays the same when the probability of shutdown changes. So it might be worth checking how your idea fares against the problems that Soares et al. point out for the utility indifference proposal.
Another worry for utility indifference that might carry over to your idea is that at present we don’t know how to specify an agent’s utility function with enough precision to implement a correcting term that varies with the probability of shutdown. One way to overcome that worry would be to give (1) a set of conditions on preferences that together suffice to make the agent representable as maximising that utility function, and (2) a proposed regime for training agents to satisfy those conditions on preferences. Then we could try out the proposal and see if it results in an agent that never resists shutdown. That’s ultimately what I’m aiming to do with my proposal.
Enjoyed reading this (although still making my way through the latter parts), I think it’s useful to have a lot of these ideas more formalised.
Wondering if this simple idea would come up against theorem 2/ any others or why it wouldn’t work.
Suppose for every time t some decision maker would choose a probability p(t) with which the AI would shutdown. Further, suppose the AI’s utility function in any period was initially U(t). Now scale the new AI’s utility function to be U(t)/(1-p(t))- I think this can be quite easily generalised so future periods of utility are likewise unaffected by an increase in the risk of shutdown.
In this world the AI should be indifferent over changes to p(t) (as long as it gets arbitrarily close to 1 but never reaches it) and so should take actions trying to maximise U whilst being indifferent to if humans decide to shut it down.
Oh cool idea! It seems promising. It also seems similar in one respect to Armstrong’s utility indifference proposal discussed in Soares et al. 2015: Armstrong has a correcting term that varies to ensure that utility stays the same when the probability of shutdown changes, whereas you have a correcting factor that varies to ensure that utility stays the same when the probability of shutdown changes. So it might be worth checking how your idea fares against the problems that Soares et al. point out for the utility indifference proposal.
Another worry for utility indifference that might carry over to your idea is that at present we don’t know how to specify an agent’s utility function with enough precision to implement a correcting term that varies with the probability of shutdown. One way to overcome that worry would be to give (1) a set of conditions on preferences that together suffice to make the agent representable as maximising that utility function, and (2) a proposed regime for training agents to satisfy those conditions on preferences. Then we could try out the proposal and see if it results in an agent that never resists shutdown. That’s ultimately what I’m aiming to do with my proposal.