An everywhere-implemented solution to the shutdown problem would send the risk of AI takeover down to ~0.
My solution is shovel-ready. It makes only small tweaks to an otherwise-thoroughly-prosaic setup for training transformative AI.
My solution won first prize and $16,000 in last year’s AI Alignment Awards, judged by Nate Soares, John Wentworth, and Richard Ngo.
I’ve since explained my solution to about 50 people in and around the AI safety community, and all the responses have been various flavours of ‘This seems promising.’ I’ve not yet had any responses of the form ‘I expect this wouldn’t work, for the following reason(s): _____.’
If you read my solution and think it wouldn’t work, let me know. If you think it could work, help me make it happen.
I’ll admit I skimmed over it, not really engaging because I don’t hold that part of the shutdown problem as a top concern. My top concerns on this topic are in whether anyone has that much control of an emerging AI’s preferences, and whether humans have the wisdom to actually attempt shutdown.
It does look promising for the portion which it addresses.
Thanks, that’s useful to know. If you have the time, can you say some more about ‘control of an emerging AI’s preferences’? I sketch out a proposed training regimen for the preferences that we want, and argue that this regimen largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment. Are you not convinced by that part? Or is there some other problem I’m missing?
I sometimes name your work in conversation as an example of good recent agent foundations work, based on having read some of it and skimmed the rest, and talked to you a little about it at EAG. It’s on my todo list to work through it properly, and I expect to actually do it because it’s the blocker on me rewriting and posting my “why the shutdown problem is hard” draft, which I really want to post.
The reasons I’m a priori not extremely excited are that it seems intuitively very difficult to avoid either of these issues:
I’d be surprised if an agent with (very) incomplete preferences was real-world competent. I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
It’s easy to shuffle around the difficulty of the shutdown problem, e.g. by putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
It’s plausible you’ve avoided these problems but I haven’t read deeply enough to know yet. I think it’s easy for issues like this to be hidden (accidentally), so it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
The part where it works for a prosaic setup seems wrong (because of inner alignment issues (although I see you cited my post in a footnote about this, thanks!)), but this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups.
it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
Nice, interested to hear what you think!
I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
Yep agree that this is a concern, and I plan to think more about this soon.
putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
Interested to hear more about this. I’m not sure exactly what you mean by ‘detector’, but I don’t think my proposal requires either of these. The agent won’t try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won’t self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don’t think we need a detector in either case.
because of inner alignment issues
Iarguethat my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn’t prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.
this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups
I’m confused about this. Why isn’t it an issue if some proposed solution to the shutdown problem doesn’t apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can’t see any way to do that using current techniques.
I found your explanation quite dense, so I only skimmed it initially. I appreciate your inclusion of a brief summary at the start. But that wasn’t enough to motivate me to read the whole proposal in detail, because the summary doesn’t sound like it addresses the hard parts of the shutdown problem.
If you read my solution and think it wouldn’t work, let me know.
I don’t know about “wouldn’t help”, but it doesn’t seem to me on my initial skim to be a particularly promising approach for prosaic AI safety. I certainly don’t agree with the first two bullets in the above list.
Apologies, but I’ve only skimmed your post, so it’s possible I’m missing some important details.
I left a comment on your post explaining some aspects of why I think this.
When I responded in pm I tried to understand why this is a problem in the first place.
I couldn’t understand a practical situation where you would use this.
Why not have the model run in discrete limited time segments, where completing a task correctly while burning the the least compute is what you optimize for in training?
It makes the model a machine where you picked weights that exhibit a behavior where each run, submit answers, and shut down for cleanup. That is, submit equals shutdown, they aren’t separate steps, and time limit exceeded is also shutdown with negative reward.
Thus, unwanted behavior is selected against by your training mechanism. This is informed by :
And I mentioned real world examples where you just throw an error and shutdown in realtime control systems when the software doesn’t have a decision ready by a deadline. Or you do it purely in hardware with a hardware timer that can send a signal directly to processor reset, or you accomplish this with a separate microcontroller.
This is a standard way to build autonomous car control systems.
These “I wasn’t asking” approaches force models that “hang” and try to run past a deadline shut down anyway, and it is resistant to tampering. “Resistant” meaning it’s not thought to be physically possible.
My solution to the shutdown problem didn’t get as much attention as I hoped. Here’s why it’s worth your time.
An everywhere-implemented solution to the shutdown problem would send the risk of AI takeover down to ~0.
My solution is shovel-ready. It makes only small tweaks to an otherwise-thoroughly-prosaic setup for training transformative AI.
My solution won first prize and $16,000 in last year’s AI Alignment Awards, judged by Nate Soares, John Wentworth, and Richard Ngo.
I’ve since explained my solution to about 50 people in and around the AI safety community, and all the responses have been various flavours of ‘This seems promising.’ I’ve not yet had any responses of the form ‘I expect this wouldn’t work, for the following reason(s): _____.’
If you read my solution and think it wouldn’t work, let me know. If you think it could work, help me make it happen.
I’ll admit I skimmed over it, not really engaging because I don’t hold that part of the shutdown problem as a top concern. My top concerns on this topic are in whether anyone has that much control of an emerging AI’s preferences, and whether humans have the wisdom to actually attempt shutdown.
It does look promising for the portion which it addresses.
Thanks, that’s useful to know. If you have the time, can you say some more about ‘control of an emerging AI’s preferences’? I sketch out a proposed training regimen for the preferences that we want, and argue that this regimen largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment. Are you not convinced by that part? Or is there some other problem I’m missing?
I am potentially interested. Will read it over then send you a DM.
Nice, interested to hear what you think!
I sometimes name your work in conversation as an example of good recent agent foundations work, based on having read some of it and skimmed the rest, and talked to you a little about it at EAG. It’s on my todo list to work through it properly, and I expect to actually do it because it’s the blocker on me rewriting and posting my “why the shutdown problem is hard” draft, which I really want to post.
The reasons I’m a priori not extremely excited are that it seems intuitively very difficult to avoid either of these issues:
I’d be surprised if an agent with (very) incomplete preferences was real-world competent. I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
It’s easy to shuffle around the difficulty of the shutdown problem, e.g. by putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
It’s plausible you’ve avoided these problems but I haven’t read deeply enough to know yet. I think it’s easy for issues like this to be hidden (accidentally), so it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
The part where it works for a prosaic setup seems wrong (because of inner alignment issues (although I see you cited my post in a footnote about this, thanks!)), but this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups.
Nice, interested to hear what you think!
Yep agree that this is a concern, and I plan to think more about this soon.
Interested to hear more about this. I’m not sure exactly what you mean by ‘detector’, but I don’t think my proposal requires either of these. The agent won’t try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won’t self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don’t think we need a detector in either case.
I argue that my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn’t prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.
I’m confused about this. Why isn’t it an issue if some proposed solution to the shutdown problem doesn’t apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can’t see any way to do that using current techniques.
Fair enough, though the post itself does claim prosaic applications.
I found your explanation quite dense, so I only skimmed it initially. I appreciate your inclusion of a brief summary at the start. But that wasn’t enough to motivate me to read the whole proposal in detail, because the summary doesn’t sound like it addresses the hard parts of the shutdown problem.
Much more in another comment on the post.
Thanks, will reply there!
I don’t know about “wouldn’t help”, but it doesn’t seem to me on my initial skim to be a particularly promising approach for prosaic AI safety. I certainly don’t agree with the first two bullets in the above list.
Apologies, but I’ve only skimmed your post, so it’s possible I’m missing some important details.
I left a comment on your post explaining some aspects of why I think this.
Thanks, will reply there!
When I responded in pm I tried to understand why this is a problem in the first place.
I couldn’t understand a practical situation where you would use this.
Why not have the model run in discrete limited time segments, where completing a task correctly while burning the the least compute is what you optimize for in training?
It makes the model a machine where you picked weights that exhibit a behavior where each run, submit answers, and shut down for cleanup. That is, submit equals shutdown, they aren’t separate steps, and time limit exceeded is also shutdown with negative reward.
Thus, unwanted behavior is selected against by your training mechanism. This is informed by :
https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target Which frames RL as not “rewarding” the model but finding a minima in the permutation space of network weights that forms a machine that has the behavior your selected for. In this case, turning in a correct answer quickly with as little thinking as possible.
And I mentioned real world examples where you just throw an error and shutdown in realtime control systems when the software doesn’t have a decision ready by a deadline. Or you do it purely in hardware with a hardware timer that can send a signal directly to processor reset, or you accomplish this with a separate microcontroller.
This is a standard way to build autonomous car control systems.
These “I wasn’t asking” approaches force models that “hang” and try to run past a deadline shut down anyway, and it is resistant to tampering. “Resistant” meaning it’s not thought to be physically possible.