The deceptive alignment worry is that there is some goal about the real world at all. Deceptive alignment breaks robustness of any properties of policy behavior, not just the property of following reward as a goal in some unfathomable sense.
So refuting this worry requires quieting the more general hypothesis that RL selects optimizers with any goals of their own, doesn’t matter what goals those are. It’s only the argument for why this seems plausible that needs to refer to reward as related to the goal of such an optimizer, but the way the argument goes suggests that the optimizer so selected would instead have a different goal. Specifically, optimizing for an internalized representation of reward seems like a great way of being rewarded, surviving changes of weights, such optimizers would be straightforwardly selected if there are no alternatives to that closer in reach. Since RL is not perfect, there would be optimizers for other goals nearby, goals that care about the real world (and not just about optimizing the reward exclusively, meticulously ignoring everything else). If an optimizer like that succeeds in becoming deceptively aligned (let alone gradient hacking), the search effectively stops and a honestly aligned optimizer is never found.
Corrigibility, anti-goodharting, mild optimization, unstable current goals, and goals that are intractable about distant future seem related (though not sufficient for alignment without at least value-laden low impact). The argument about deceptive alignment is a problem for using RL to find anything in this class, something that is not an optimizer at all and so is not obviously misaligned. It would be really great if RL doesn’t tend to select optimizers!
The conjecture I brought up that deceptive alignment relies on selected policies being optimizers gives me the idea that something similar to your argument (where the target of optimization wouldn’t matter, only the fact of optimization for anything at all) would imply that deceptive alignment is less likely to happen. I didn’t mean to claim that I’m reading you as making this implication in the post, or believing it’s true or relevant, that’s instead an implication I’m describing in my comment.
The deceptive alignment worry is that there is some goal about the real world at all. Deceptive alignment breaks robustness of any properties of policy behavior, not just the property of following reward as a goal in some unfathomable sense.
So refuting this worry requires quieting the more general hypothesis that RL selects optimizers with any goals of their own, doesn’t matter what goals those are. It’s only the argument for why this seems plausible that needs to refer to reward as related to the goal of such an optimizer, but the way the argument goes suggests that the optimizer so selected would instead have a different goal. Specifically, optimizing for an internalized representation of reward seems like a great way of being rewarded, surviving changes of weights, such optimizers would be straightforwardly selected if there are no alternatives to that closer in reach. Since RL is not perfect, there would be optimizers for other goals nearby, goals that care about the real world (and not just about optimizing the reward exclusively, meticulously ignoring everything else). If an optimizer like that succeeds in becoming deceptively aligned (let alone gradient hacking), the search effectively stops and a honestly aligned optimizer is never found.
Corrigibility, anti-goodharting, mild optimization, unstable current goals, and goals that are intractable about distant future seem related (though not sufficient for alignment without at least value-laden low impact). The argument about deceptive alignment is a problem for using RL to find anything in this class, something that is not an optimizer at all and so is not obviously misaligned. It would be really great if RL doesn’t tend to select optimizers!
I don’t see how this comment relates to my post. What gives you the idea that I’m trying to refute worries about deceptive alignment?
The conjecture I brought up that deceptive alignment relies on selected policies being optimizers gives me the idea that something similar to your argument (where the target of optimization wouldn’t matter, only the fact of optimization for anything at all) would imply that deceptive alignment is less likely to happen. I didn’t mean to claim that I’m reading you as making this implication in the post, or believing it’s true or relevant, that’s instead an implication I’m describing in my comment.