(These are my own takes, the other authors may disagree)
We briefly address a case that can be viewed as “strategic sycophancy” case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment. As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:
The model is pursuing a goal that its designers do not want.
The model strategically deceives the user (and designer) to further a goal.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
I agree that deception which is not strategic or intentional could be important to prevent. However,
I expect the failure cases in these scenarios to manifest earlier, making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Fully agreed. Focusing on clean subproblems is important for making progress.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there’s no particular reason why you couldn’t fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training.
(These are my own takes, the other authors may disagree)
We briefly address a case that can be viewed as “strategic sycophancy” case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment.
As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:
The model is pursuing a goal that its designers do not want.
The model strategically deceives the user (and designer) to further a goal.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
I agree that deception which is not strategic or intentional could be important to prevent. However,
I expect the failure cases in these scenarios to manifest earlier, making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Fully agreed. Focusing on clean subproblems is important for making progress.
Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there’s no particular reason why you couldn’t fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training.