How would you characterize strategic sycophancy? Assume that during RLHF a language model is rewarded for mimicking the beliefs of its conversational partner, and therefore the model intelligently learns to predict the conversational partner’s beliefs and mimic them. But upon reflection, the conversational partner and AI developers would prefer that the model report its beliefs honestly.
Under the current taxonomy, this would seemingly be classified as deceptive alignment. The AI’s goals are misaligned with the designer’s intentions, and it uses strategic deception to achieve them. But sycophancy doesn’t include many of the ideas commonly associated with deceptive alignment, such as situational awareness and a difference in behavior between train and test time. Sycophancy can be solved by changing the training signal to not incentivize sycophancy, whereas the hardest to fix forms of deceptive alignment cannot be eliminated by changing the training signal.
It seems like the most concerning forms of deceptive alignment include stipulations about situational awareness and the idea that the behavior cannot necessarily be fixed by changing the training objective.
Separately, it seems that deception which is not strategic or intentional but is consistently produced by the training objective is also important. Considering cases like Paul Christiano’s robot hand that learned to deceive human feedback and Ofria’s evolutionary agents that learned to alter their behavior during evaluation, it seems that AI systems can learn to systematically deceive human oversight without being aware of their strategy. In the future, we might see powerful foundation models which are honestly convinced that giving them power in the real world is the best way to achieve their designers’ intentions. This belief might be false but evolutionarily useful, making these models disproportionately likely to gain power. This case would not be called “strategic deception” or “deceptive alignment” if you require intentionality, but it seems very important to prevent.
Overall I think it’s very difficult to come up with clean taxonomies of AI deception. I spent >100 hours thinking and writing about this in advance of Park et al 2023 and my Hoodwinked paper, and ultimately we ended up steering clear of taxonomies because we didn’t have a strong taxonomy that we could defend. Ward et al 2023 formalize a concrete notion of deception, but they also ignore the unintentional deception discussed above. The Stanford Encyclopedia of Philosophy takes 17,000 words to explain that philosophers don’t agree on definitions of lying and deception. Without rigorous formal definitions, I still think it’s important to communicate the broad strokes of these ideas publicly, but I’d lean towards readily admitting the messiness of our various definitions of deception.
(personal opinion; might differ from other authors of the post)
Thanks for both questions. I think they are very important.
1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I’d say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn’t reason through that, I’d probably not classify it as such. I think it’s fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.
2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven’t thought about it in detail, so I might change my mind. However, currently, I think the following: a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I’m happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception. b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy. As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it’s doing and then the deception either stops or becomes intentional. (weekly held, haven’t thought about it in detail)
Thanks! First response makes sense, there’s a lot of different ways you could cut it.
On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we’ll have models which reliably do what they believe is best according to their designer’s goals. What major risks might emerge at that point?
Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they’re on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it’s immoral. But many believe they are doing the right thing. How do we explain that?
These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don’t believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs.
Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it’s important to keep in mind the problems that would not be solved by honesty alone.
Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the “how do we make AIs honest in the first place” part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can’t rely on the selection mechanisms because the AI games them.
(These are my own takes, the other authors may disagree)
We briefly address a case that can be viewed as “strategic sycophancy” case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment. As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:
The model is pursuing a goal that its designers do not want.
The model strategically deceives the user (and designer) to further a goal.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
I agree that deception which is not strategic or intentional could be important to prevent. However,
I expect the failure cases in these scenarios to manifest earlier, making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Fully agreed. Focusing on clean subproblems is important for making progress.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there’s no particular reason why you couldn’t fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training.
How would you characterize strategic sycophancy? Assume that during RLHF a language model is rewarded for mimicking the beliefs of its conversational partner, and therefore the model intelligently learns to predict the conversational partner’s beliefs and mimic them. But upon reflection, the conversational partner and AI developers would prefer that the model report its beliefs honestly.
Under the current taxonomy, this would seemingly be classified as deceptive alignment. The AI’s goals are misaligned with the designer’s intentions, and it uses strategic deception to achieve them. But sycophancy doesn’t include many of the ideas commonly associated with deceptive alignment, such as situational awareness and a difference in behavior between train and test time. Sycophancy can be solved by changing the training signal to not incentivize sycophancy, whereas the hardest to fix forms of deceptive alignment cannot be eliminated by changing the training signal.
It seems like the most concerning forms of deceptive alignment include stipulations about situational awareness and the idea that the behavior cannot necessarily be fixed by changing the training objective.
Separately, it seems that deception which is not strategic or intentional but is consistently produced by the training objective is also important. Considering cases like Paul Christiano’s robot hand that learned to deceive human feedback and Ofria’s evolutionary agents that learned to alter their behavior during evaluation, it seems that AI systems can learn to systematically deceive human oversight without being aware of their strategy. In the future, we might see powerful foundation models which are honestly convinced that giving them power in the real world is the best way to achieve their designers’ intentions. This belief might be false but evolutionarily useful, making these models disproportionately likely to gain power. This case would not be called “strategic deception” or “deceptive alignment” if you require intentionality, but it seems very important to prevent.
Overall I think it’s very difficult to come up with clean taxonomies of AI deception. I spent >100 hours thinking and writing about this in advance of Park et al 2023 and my Hoodwinked paper, and ultimately we ended up steering clear of taxonomies because we didn’t have a strong taxonomy that we could defend. Ward et al 2023 formalize a concrete notion of deception, but they also ignore the unintentional deception discussed above. The Stanford Encyclopedia of Philosophy takes 17,000 words to explain that philosophers don’t agree on definitions of lying and deception. Without rigorous formal definitions, I still think it’s important to communicate the broad strokes of these ideas publicly, but I’d lean towards readily admitting the messiness of our various definitions of deception.
(personal opinion; might differ from other authors of the post)
Thanks for both questions. I think they are very important.
1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I’d say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn’t reason through that, I’d probably not classify it as such. I think it’s fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.
2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven’t thought about it in detail, so I might change my mind. However, currently, I think the following:
a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I’m happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception.
b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy. As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it’s doing and then the deception either stops or becomes intentional. (weekly held, haven’t thought about it in detail)
Thanks! First response makes sense, there’s a lot of different ways you could cut it.
On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we’ll have models which reliably do what they believe is best according to their designer’s goals. What major risks might emerge at that point?
Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they’re on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it’s immoral. But many believe they are doing the right thing. How do we explain that?
These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don’t believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs.
Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it’s important to keep in mind the problems that would not be solved by honesty alone.
Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the “how do we make AIs honest in the first place” part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can’t rely on the selection mechanisms because the AI games them.
(These are my own takes, the other authors may disagree)
We briefly address a case that can be viewed as “strategic sycophancy” case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment.
As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:
The model is pursuing a goal that its designers do not want.
The model strategically deceives the user (and designer) to further a goal.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
I agree that deception which is not strategic or intentional could be important to prevent. However,
I expect the failure cases in these scenarios to manifest earlier, making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Fully agreed. Focusing on clean subproblems is important for making progress.
Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there’s no particular reason why you couldn’t fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training.