I wanted to focus on your definition of deceptive alignment, because I currently feel unsure about whether it’s a more helpful framework than standard terminology. Substituting terms, your definition is:
Deceptive Alignment:When an AI has [goals that are not intended/endorsed by the designers] and [attempts to systematically cause a false belief in another entity in order to accomplish some outcome].
Here are some initial hesitations I have about your definition:
If we’re thinking about the emergence of DA during pre-deployment training, I worry that your definition might be too divorced from the underlying catastrophic risk factors that should make us concerned about “deceptive alignment” in the first place.
Hubinger’s initial analysis claims that the training process is likely to produce models with long-term goals.[1] I think his focus was correct, because if models don’t develop long-term/broadly-scoped goals, then I think deceptive alignment (in your sense) is much less likely to result in existential catastrophe.
If a model has long-term goals, I understand why strategic deception can be instrumentally incentivized. To the extent that strategic deception is incentivized in the absence of long-term goals, I expect that models will fall on the milder end of the ‘strategically deceptive’ spectrum.
Briefly, this is because the degree to which you’re incentivized to be strategic is going to be a function of your patience. In the maximally extreme case, the notion of a ‘strategy’ breaks down if you’re sufficiently myopic.
So, at the moment, I don’t think I’d talk about ‘deceptive alignment’ using your terminology, because I think it misses a crucial component of why deceptively aligned models could pose a civilizational risk.
If we’re thinking about the risks of misaligned strategic deception more broadly, I think distinguishing between the training and oversight process is helpful. I also agree that it’s worth thinking about the risks associated with models whose goals are (loosely speaking) ‘in the prompts’ rather than ‘in the weights’.
That said, I’m a bit concerned that your more expansive definition encompasses a wide variety of different systems, many of which are accompanied by fairly distinct threat models.
The risks from LM agents look to be primarily (entirely?) misuse risks, which feels pretty different from the threat model standardly associated with DA. Among other things, one issue with LM agents appears to be that intent alignment is too easy.
One way I can see my objection mattering is if your definitions were used to help policymakers better understand people’s concerns about AI. My instinctive worry is that a policymaker who first encountered deceptive alignment through your work wouldn’t have a clear sense of why many people in AI safety have been historically worried about DA, nor have a clear understanding of why many people are worried about ‘DA’ leading to existential catastrophe. This might lead to policies which are less helpful for ‘DA’ in the narrower sense.
Nice work!
I wanted to focus on your definition of deceptive alignment, because I currently feel unsure about whether it’s a more helpful framework than standard terminology. Substituting terms, your definition is:
Here are some initial hesitations I have about your definition:
If we’re thinking about the emergence of DA during pre-deployment training, I worry that your definition might be too divorced from the underlying catastrophic risk factors that should make us concerned about “deceptive alignment” in the first place.
Hubinger’s initial analysis claims that the training process is likely to produce models with long-term goals.[1] I think his focus was correct, because if models don’t develop long-term/broadly-scoped goals, then I think deceptive alignment (in your sense) is much less likely to result in existential catastrophe.
If a model has long-term goals, I understand why strategic deception can be instrumentally incentivized. To the extent that strategic deception is incentivized in the absence of long-term goals, I expect that models will fall on the milder end of the ‘strategically deceptive’ spectrum.
Briefly, this is because the degree to which you’re incentivized to be strategic is going to be a function of your patience. In the maximally extreme case, the notion of a ‘strategy’ breaks down if you’re sufficiently myopic.
So, at the moment, I don’t think I’d talk about ‘deceptive alignment’ using your terminology, because I think it misses a crucial component of why deceptively aligned models could pose a civilizational risk.
If we’re thinking about the risks of misaligned strategic deception more broadly, I think distinguishing between the training and oversight process is helpful. I also agree that it’s worth thinking about the risks associated with models whose goals are (loosely speaking) ‘in the prompts’ rather than ‘in the weights’.
That said, I’m a bit concerned that your more expansive definition encompasses a wide variety of different systems, many of which are accompanied by fairly distinct threat models.
The risks from LM agents look to be primarily (entirely?) misuse risks, which feels pretty different from the threat model standardly associated with DA. Among other things, one issue with LM agents appears to be that intent alignment is too easy.
One way I can see my objection mattering is if your definitions were used to help policymakers better understand people’s concerns about AI. My instinctive worry is that a policymaker who first encountered deceptive alignment through your work wouldn’t have a clear sense of why many people in AI safety have been historically worried about DA, nor have a clear understanding of why many people are worried about ‘DA’ leading to existential catastrophe. This might lead to policies which are less helpful for ‘DA’ in the narrower sense.
Strictly speaking, I think ‘broadly-scoped goals’ is probably slightly more precise terminology, but I don’t think it matters much here.