it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
Nice, interested to hear what you think!
I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
Yep agree that this is a concern, and I plan to think more about this soon.
putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
Interested to hear more about this. I’m not sure exactly what you mean by ‘detector’, but I don’t think my proposal requires either of these. The agent won’t try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won’t self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don’t think we need a detector in either case.
because of inner alignment issues
Iarguethat my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn’t prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.
this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups
I’m confused about this. Why isn’t it an issue if some proposed solution to the shutdown problem doesn’t apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can’t see any way to do that using current techniques.
Nice, interested to hear what you think!
Yep agree that this is a concern, and I plan to think more about this soon.
Interested to hear more about this. I’m not sure exactly what you mean by ‘detector’, but I don’t think my proposal requires either of these. The agent won’t try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won’t self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don’t think we need a detector in either case.
I argue that my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn’t prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.
I’m confused about this. Why isn’t it an issue if some proposed solution to the shutdown problem doesn’t apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can’t see any way to do that using current techniques.