I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don’t make any sense when applied to AGI.
I think I have an intellectually consistent stance—I don’t think this is because I have a double standard for pessimistic results.
First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they’re harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.
Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the “deception” papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don’t think it’s true that I like empirical papers as long as they purport to support the deceptive alignment argument.
The reasons I think this paper is actually better than the other deception papers (beyond just quality of execution) are that the deceptive alignment in this setup happens for reasons more similar to why it might happen in AGI than in previous work, and the secret scratchpad setting seeming more analogous to AGI than single shot or visible scratchpad.
I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don’t make any sense when applied to AGI.
I think I have an intellectually consistent stance—I don’t think this is because I have a double standard for pessimistic results.
First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they’re harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.
Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the “deception” papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don’t think it’s true that I like empirical papers as long as they purport to support the deceptive alignment argument.
The reasons I think this paper is actually better than the other deception papers (beyond just quality of execution) are that the deceptive alignment in this setup happens for reasons more similar to why it might happen in AGI than in previous work, and the secret scratchpad setting seeming more analogous to AGI than single shot or visible scratchpad.