The model knows it’s being trained to do something out of line with its goals during training and plays along temporarily so it can defect later. That implies that differential adversarial examples exist in training.
I don’t think this implication is deductively valid; I don’t think the premise entails the conclusion. Can you elaborate?
I think this post’s argument relies on that conclusion, along with an additional assumption that seems questionable: that it’s fairly easy to build an adversarial training setup that distinguishes the design objective from all other undesirable objectives that the model might develop during training; in other words, that the relevant differential adversarial examples are fairly easy for humans to engineer.
In the deceptive alignment story, the model wants to take action A, because its goal is misaligned, but chooses to take apparently aligned action B to avoid overseers noticing that it is misaligned. In other words, in the absence of deceptive tendencies, the model would take action A, which would identify it as a misaligned model, because overseers wanted it to take action B. That’s the definition of a differential adversarial example.
If there were an unaligned model with no differential adversarial examples in training, that would be an example of a perfect proxy, not deceptive alignment. That’s outside the scope of this post. But also, if the goal were to follow directions subject to ethical constraints, what would that perfect proxy be? What would result in the same actions across a diverse training set? It seems unlikely that you’d get even a near-perfect proxy here. And even if you did get something fairly close, the model would understand the necessary concepts for the base goal at the beginning of reinforcement learning, so why wouldn’t it just learn to care about that? Setting up a diverse training environment seems likely to be a training strategy by default.
I don’t think this implication is deductively valid; I don’t think the premise entails the conclusion. Can you elaborate?
I think this post’s argument relies on that conclusion, along with an additional assumption that seems questionable: that it’s fairly easy to build an adversarial training setup that distinguishes the design objective from all other undesirable objectives that the model might develop during training; in other words, that the relevant differential adversarial examples are fairly easy for humans to engineer.
In the deceptive alignment story, the model wants to take action A, because its goal is misaligned, but chooses to take apparently aligned action B to avoid overseers noticing that it is misaligned. In other words, in the absence of deceptive tendencies, the model would take action A, which would identify it as a misaligned model, because overseers wanted it to take action B. That’s the definition of a differential adversarial example.
If there were an unaligned model with no differential adversarial examples in training, that would be an example of a perfect proxy, not deceptive alignment. That’s outside the scope of this post. But also, if the goal were to follow directions subject to ethical constraints, what would that perfect proxy be? What would result in the same actions across a diverse training set? It seems unlikely that you’d get even a near-perfect proxy here. And even if you did get something fairly close, the model would understand the necessary concepts for the base goal at the beginning of reinforcement learning, so why wouldn’t it just learn to care about that? Setting up a diverse training environment seems likely to be a training strategy by default.