“at the very beginning of the reinforcement learning stage… it’s very unlikely to be deceptively aligned”
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
preference for X over Y ... ”A disposition to represent X as more rewarding than Y (in the reinforcement learning sense of ‘reward’)”
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.
I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).
Nothing in the optimization process forces the AI to map the string “shutdown” contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string “shutdown” is (arguably) for the agent to learn certain behavior for question answering—e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
In any case, I’ve been thinking about corrigibility for a while and I find this post helpful.