“at the very beginning of the reinforcement learning stage… it’s very unlikely to be deceptively aligned”
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
preference for X over Y ... ”A disposition to represent X as more rewarding than Y (in the reinforcement learning sense of ‘reward’)”
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.
Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they’re unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.
On generalization, the questions involving the string ‘shutdown’ are just supposed to be quick examples. To get good generalization, we’d want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely ‘in distribution’ for the agent, so you’re not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.
People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. ‘Don’t manipulate shutdown’ is a complex rule to learn, in part because whether an action counts as ‘manipulating shutdown’ depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is ‘Don’t pay costs to shift probability mass between different trajectory-lengths.’ That’s a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won’t be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
Yes, I don’t assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of ‘preference.’ My own definition of ‘preference’ makes no reference to reward.
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.
Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they’re unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.
On generalization, the questions involving the string ‘shutdown’ are just supposed to be quick examples. To get good generalization, we’d want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely ‘in distribution’ for the agent, so you’re not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.
I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. ‘Don’t manipulate shutdown’ is a complex rule to learn, in part because whether an action counts as ‘manipulating shutdown’ depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is ‘Don’t pay costs to shift probability mass between different trajectory-lengths.’ That’s a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won’t be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.
Yes, I don’t assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of ‘preference.’ My own definition of ‘preference’ makes no reference to reward.