This paper also seems dialectically quite significant. I feel like it’s a fairly well-delineated claim that can be digested by mainsteam ML and policy spaces. Like, it seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
This feels like a misleading description of the result. I would have said: “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery”.
Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.
(Separately, I think there are a few important caveats with this work. In particular, the backdoor trigger is extremely simple (a single fixed token) and the model doesn’t really have to do any “reasoning” about when or how to strike. It plausible that experiments with these additional properties would imply that current models are too weak to lie in wait in any interesting way. But I expect that transformatively useful models will be strong enough.)
That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?
Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.
I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.
I think the version of your statement with deceive replaced seems most accurate to me.
Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.
As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power. This ambiguity makes it more difficult for people to take deceptive alignment seriously as a problem.
As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power.
That’s what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn’t be safely trusted to continue (long-term) to act well if handed near-absolute power and the ability to run rings around the rest of society, including law enforcement. So you have to achieve a psychology that is almost vanishingly rare in the pretraining set. [However, superhuman intelligence is also nonexistent in the training set, so you also need to figure out how to do that on the capabilities side too.]
I think human level AGIs being pivotal in shaping ASIs is very likely if AGIs get developed in the next few years as largely the outcome of scaling, and still moderately likely overall. If that is the case, what matters is alignment of human level AGIs and the social dynamics of their deployment and their own activity. So control despite only being aligned as well as humans are (or somewhat better) might be sufficient, as one of the things AGIs might work on is improving alignment.
The point about deceptive alignment being a special case of trustworthiness goes both ways, a deceptively aligned AI really can be a good ally, as long as the situation is maintained that prevents AIs from individually getting absolute power, and as long as the AIs don’t change too much from that baseline. Which are very difficult conditions to maintain while the world is turning upside down.
Agreed, and obviously that would be a lot more practicable if you knew what its trigger and secret goal were. Preventing deceptive alignment entirely would be ideal, but failing that we need reliable ways to detect it and diagnose its details: tricky to research when so far we only have model organisms of it, but doing interpretability work on those seems like an obvious first step.
It seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
I take this as evidence that TurnTrout’s fears about this paper are well-grounded. This claim is not meaningfully supported by the paper, but I expect many people to repeat it as if it is supported by the paper.
That’s not evidence for Alex’s claim that people will misinterpret our results, because that’s not a misinterpretation—we explicitly claim that our results do in fact provide evidence for the hypothesis that removing (edit: deceptive-alignment-style) deception in ML systems is likely to be difficult.
Come on, the claim “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I’m on board with Ryan’s phrasing).
There’s a huge disanalogy between your paper’s setup and deception-in-general, which is that in your paper’s setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has behavioral impacts at training time and that’s by far the main reason to expect that we could address it.
Fwiw I thought the paper was pretty good at being clear that it was specifically deceptive alignment and backdoors that the claim applied to. But if you’re going to broaden that to a claim like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” without any additional qualifiers I think that’s a pretty big overclaim, and also I want to bet you on whether we can reduce sycophancy today.
Ah, sure—I agree that we don’t say anything about sycophancy-style deception. I interpreted “deception” there in context to refer to deceptive alignment specifically. The word deception is unfortunately a bit overloaded.
Yeah I was fairly sloppy here. I did mean the “like” to include tweaking to be as accurate as possible, but that plausibly didn’t bring the comment above some bar.
For clarity: I haven’t read the paper yet. My current understanding isn’t able to guess what your complaint would be though. Ryan’s more careful “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery” seems reasonable from what I’ve read, and so does “some evidence suggests that if current ML systems were trying to deceive us, standard methods might well fail to change them not to”.
This paper also seems dialectically quite significant. I feel like it’s a fairly well-delineated claim that can be digested by mainsteam ML and policy spaces. Like, it seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
This feels like a misleading description of the result. I would have said: “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery”.
Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.
(Separately, I think there are a few important caveats with this work. In particular, the backdoor trigger is extremely simple (a single fixed token) and the model doesn’t really have to do any “reasoning” about when or how to strike. It plausible that experiments with these additional properties would imply that current models are too weak to lie in wait in any interesting way. But I expect that transformatively useful models will be strong enough.)
That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?
Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.
I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.
I think the version of your statement with deceive replaced seems most accurate to me.
As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power. This ambiguity makes it more difficult for people to take deceptive alignment seriously as a problem.
That’s what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn’t be safely trusted to continue (long-term) to act well if handed near-absolute power and the ability to run rings around the rest of society, including law enforcement. So you have to achieve a psychology that is almost vanishingly rare in the pretraining set. [However, superhuman intelligence is also nonexistent in the training set, so you also need to figure out how to do that on the capabilities side too.]
I think human level AGIs being pivotal in shaping ASIs is very likely if AGIs get developed in the next few years as largely the outcome of scaling, and still moderately likely overall. If that is the case, what matters is alignment of human level AGIs and the social dynamics of their deployment and their own activity. So control despite only being aligned as well as humans are (or somewhat better) might be sufficient, as one of the things AGIs might work on is improving alignment.
The point about deceptive alignment being a special case of trustworthiness goes both ways, a deceptively aligned AI really can be a good ally, as long as the situation is maintained that prevents AIs from individually getting absolute power, and as long as the AIs don’t change too much from that baseline. Which are very difficult conditions to maintain while the world is turning upside down.
Agreed, and obviously that would be a lot more practicable if you knew what its trigger and secret goal were. Preventing deceptive alignment entirely would be ideal, but failing that we need reliable ways to detect it and diagnose its details: tricky to research when so far we only have model organisms of it, but doing interpretability work on those seems like an obvious first step.
I take this as evidence that TurnTrout’s fears about this paper are well-grounded. This claim is not meaningfully supported by the paper, but I expect many people to repeat it as if it is supported by the paper.
That’s not evidence for Alex’s claim that people will misinterpret our results, because that’s not a misinterpretation—we explicitly claim that our results do in fact provide evidence for the hypothesis that removing (edit: deceptive-alignment-style) deception in ML systems is likely to be difficult.
Come on, the claim “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I’m on board with Ryan’s phrasing).
There’s a huge disanalogy between your paper’s setup and deception-in-general, which is that in your paper’s setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has behavioral impacts at training time and that’s by far the main reason to expect that we could address it.
Fwiw I thought the paper was pretty good at being clear that it was specifically deceptive alignment and backdoors that the claim applied to. But if you’re going to broaden that to a claim like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” without any additional qualifiers I think that’s a pretty big overclaim, and also I want to bet you on whether we can reduce sycophancy today.
Ah, sure—I agree that we don’t say anything about sycophancy-style deception. I interpreted “deception” there in context to refer to deceptive alignment specifically. The word deception is unfortunately a bit overloaded.
Yeah I was fairly sloppy here. I did mean the “like” to include tweaking to be as accurate as possible, but that plausibly didn’t bring the comment above some bar.
For clarity: I haven’t read the paper yet. My current understanding isn’t able to guess what your complaint would be though. Ryan’s more careful “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery” seems reasonable from what I’ve read, and so does “some evidence suggests that if current ML systems were trying to deceive us, standard methods might well fail to change them not to”.