(personal opinion; might differ from other authors of the post)
Thanks for both questions. I think they are very important.
1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I’d say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn’t reason through that, I’d probably not classify it as such. I think it’s fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.
2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven’t thought about it in detail, so I might change my mind. However, currently, I think the following: a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I’m happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception. b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy. As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it’s doing and then the deception either stops or becomes intentional. (weekly held, haven’t thought about it in detail)
Thanks! First response makes sense, there’s a lot of different ways you could cut it.
On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we’ll have models which reliably do what they believe is best according to their designer’s goals. What major risks might emerge at that point?
Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they’re on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it’s immoral. But many believe they are doing the right thing. How do we explain that?
These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don’t believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs.
Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it’s important to keep in mind the problems that would not be solved by honesty alone.
Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the “how do we make AIs honest in the first place” part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can’t rely on the selection mechanisms because the AI games them.
(personal opinion; might differ from other authors of the post)
Thanks for both questions. I think they are very important.
1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I’d say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn’t reason through that, I’d probably not classify it as such. I think it’s fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.
2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven’t thought about it in detail, so I might change my mind. However, currently, I think the following:
a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I’m happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception.
b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy. As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it’s doing and then the deception either stops or becomes intentional. (weekly held, haven’t thought about it in detail)
Thanks! First response makes sense, there’s a lot of different ways you could cut it.
On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we’ll have models which reliably do what they believe is best according to their designer’s goals. What major risks might emerge at that point?
Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they’re on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it’s immoral. But many believe they are doing the right thing. How do we explain that?
These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don’t believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs.
Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it’s important to keep in mind the problems that would not be solved by honesty alone.
Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the “how do we make AIs honest in the first place” part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can’t rely on the selection mechanisms because the AI games them.