EJT comments on 4. Existing Writing on Corrigibility

EJT 2 Jul 2024 14:54 UTC
2 points
1
I reject Thornley’s assertion that they’re dealbreakers.
Everything you say in this section seems very reasonable. In particular, I think it’s pretty likely that this is true:
It’s okay for our agent to have preferences around the shutdown button (that is: to have it either pressed or unpressed), because we can carefully train into our agent a shallow aversion to manipulating the button, including via side-channels such as humans or other machines. This aversion will likely win out over the agent’s incentives in settings that resemble the training environment. As a result, the agent won’t try to manipulate the button in the early phases of its life, and so will remain shutdownable long enough for a further refinement process to generalize the shallow aversion into a deep and robust preference for non-manipulation.
So I’m not sure whether I think that the problems of reward misspecification, goal misgeneralization, and deceptive alignment are ‘dealbreakers’ in the sense that you’re using the word.
But I do still think that these problems preclude any real assurance of shutdownability: e.g. they preclude p(shutdownability) > 95%. It sounds like we’re approximately in agreement on that:
But I also agree that my strategy isn’t ideal. It would be nice to have something robust, where we could get something closer to a formal proof of shutdownability.