I’m not sure I understand the actual training proposal completely but I am skeptic it would work.
When doing RL phase in the end, you apply it to a capable and potentially situationally-aware AI (situational awareness in LLMs). The AI could be deceptive or gradient-hack. I am not confident this training proposal would scale for agents capable enough of resisting shutdown.
If you RL on answering questions which impact shutdown, you teach the AI to answer those questions appropriately. I see no reason why this would generalize to actual actions that impact shutdown (e.g. cutting the wire of the shutdown button). There also seems to be an assumption that we could give the AI some “reward tokens” like you give monkeys bananas to train them, however Reward is not the optimization target.
One thing commonly labeled as part of corrigibility is that shutting down an agent should automatically shut down all the subagents it created. Otherwise we could end up with a robot producing nanobots where we need to individually press the shutdown button of each nanobot.
To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular:
We begin training the agent to satisfy POST at the very beginning of the reinforcement learning stage, at which point it’s very unlikely to be deceptively aligned (and arguably doesn’t even deserve the label ‘agent’). And when we’re training for POST, every single episode-series is training the agent not to prefer any longer trajectory to any shorter trajectory. The λn discount factor is constantly teaching the agent this simple lesson.
Plausibly then, the agent won’t come to prefer any longer trajectory to any shorter trajectory. And then we can reason as follows. Since the agent doesn’t prefer any longer trajectory to any shorter trajectory:
it has no incentive to shift probability mass towards longer trajectories,
and hence has no incentive to prevent shutdown in deployment,
and hence has no incentive to preserve its ability to prevent shutdown in deployment,
and hence has no incentive to avoid being made to satisfy Timestep Dominance,
and hence has no incentive to pretend to satisfy Timestep Dominance in training.
I expect agents’ not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that—e.g. - agents’ capabilities will generalize from training to deployment, why do you think their not caring about shutdown won’t?
I don’t assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?
Your point about shutting down subagents is important and I’m not fully satisfied with my proposal on that point. I say a bit about it here.
“at the very beginning of the reinforcement learning stage… it’s very unlikely to be deceptively aligned”
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
preference for X over Y ... ”A disposition to represent X as more rewarding than Y (in the reinforcement learning sense of ‘reward’)”
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.
I’m not sure I understand the actual training proposal completely but I am skeptic it would work.
When doing RL phase in the end, you apply it to a capable and potentially situationally-aware AI (situational awareness in LLMs). The AI could be deceptive or gradient-hack. I am not confident this training proposal would scale for agents capable enough of resisting shutdown.
If you RL on answering questions which impact shutdown, you teach the AI to answer those questions appropriately. I see no reason why this would generalize to actual actions that impact shutdown (e.g. cutting the wire of the shutdown button). There also seems to be an assumption that we could give the AI some “reward tokens” like you give monkeys bananas to train them, however Reward is not the optimization target.
One thing commonly labeled as part of corrigibility is that shutting down an agent should automatically shut down all the subagents it created. Otherwise we could end up with a robot producing nanobots where we need to individually press the shutdown button of each nanobot.
To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular:
I expect agents’ not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that—e.g. - agents’ capabilities will generalize from training to deployment, why do you think their not caring about shutdown won’t?
I don’t assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?
Your point about shutting down subagents is important and I’m not fully satisfied with my proposal on that point. I say a bit about it here.
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.