Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they’re unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.
On generalization, the questions involving the string ‘shutdown’ are just supposed to be quick examples. To get good generalization, we’d want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely ‘in distribution’ for the agent, so you’re not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.
People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn’t expect generalizing to it to be the default.
I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. ‘Don’t manipulate shutdown’ is a complex rule to learn, in part because whether an action counts as ‘manipulating shutdown’ depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is ‘Don’t pay costs to shift probability mass between different trajectory-lengths.’ That’s a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won’t be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.
The talk about “giving reward to the agent” also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
Yes, I don’t assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of ‘preference.’ My own definition of ‘preference’ makes no reference to reward.
Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they’re unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.
On generalization, the questions involving the string ‘shutdown’ are just supposed to be quick examples. To get good generalization, we’d want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely ‘in distribution’ for the agent, so you’re not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.
I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. ‘Don’t manipulate shutdown’ is a complex rule to learn, in part because whether an action counts as ‘manipulating shutdown’ depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is ‘Don’t pay costs to shift probability mass between different trajectory-lengths.’ That’s a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won’t be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.
Yes, I don’t assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of ‘preference.’ My own definition of ‘preference’ makes no reference to reward.