To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular:
We begin training the agent to satisfy POST at the very beginning of the reinforcement learning stage, at which point it’s very unlikely to be deceptively aligned (and arguably doesn’t even deserve the label ‘agent’). And when we’re training for POST, every single episode-series is training the agent not to prefer any longer trajectory to any shorter trajectory. The λn discount factor is constantly teaching the agent this simple lesson.
Plausibly then, the agent won’t come to prefer any longer trajectory to any shorter trajectory. And then we can reason as follows. Since the agent doesn’t prefer any longer trajectory to any shorter trajectory:
it has no incentive to shift probability mass towards longer trajectories,
and hence has no incentive to prevent shutdown in deployment,
and hence has no incentive to preserve its ability to prevent shutdown in deployment,
and hence has no incentive to avoid being made to satisfy Timestep Dominance,
and hence has no incentive to pretend to satisfy Timestep Dominance in training.
I expect agents’ not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that—e.g. - agents’ capabilities will generalize from training to deployment, why do you think their not caring about shutdown won’t?
I don’t assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?
Your point about shutting down subagents is important and I’m not fully satisfied with my proposal on that point. I say a bit about it here.
“at the very beginning of the reinforcement learning stage… it’s very unlikely to be deceptively aligned”
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
preference for X over Y ... ”A disposition to represent X as more rewarding than Y (in the reinforcement learning sense of ‘reward’)”
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.
To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular:
I expect agents’ not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that—e.g. - agents’ capabilities will generalize from training to deployment, why do you think their not caring about shutdown won’t?
I don’t assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?
Your point about shutting down subagents is important and I’m not fully satisfied with my proposal on that point. I say a bit about it here.
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.