EJT comments on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

EJT 26 Oct 2023 9:12 UTC
1 point
0
Oh cool idea! It seems promising. It also seems similar in one respect to Armstrong’s utility indifference proposal discussed in Soares et al. 2015: Armstrong has a correcting term that varies to ensure that utility stays the same when the probability of shutdown changes, whereas you have a correcting factor that varies to ensure that utility stays the same when the probability of shutdown changes. So it might be worth checking how your idea fares against the problems that Soares et al. point out for the utility indifference proposal.
Another worry for utility indifference that might carry over to your idea is that at present we don’t know how to specify an agent’s utility function with enough precision to implement a correcting term that varies with the probability of shutdown. One way to overcome that worry would be to give (1) a set of conditions on preferences that together suffice to make the agent representable as maximising that utility function, and (2) a proposed regime for training agents to satisfy those conditions on preferences. Then we could try out the proposal and see if it results in an agent that never resists shutdown. That’s ultimately what I’m aiming to do with my proposal.