It seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
I take this as evidence that TurnTrout’s fears about this paper are well-grounded. This claim is not meaningfully supported by the paper, but I expect many people to repeat it as if it is supported by the paper.
That’s not evidence for Alex’s claim that people will misinterpret our results, because that’s not a misinterpretation—we explicitly claim that our results do in fact provide evidence for the hypothesis that removing (edit: deceptive-alignment-style) deception in ML systems is likely to be difficult.
Come on, the claim “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I’m on board with Ryan’s phrasing).
There’s a huge disanalogy between your paper’s setup and deception-in-general, which is that in your paper’s setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has behavioral impacts at training time and that’s by far the main reason to expect that we could address it.
Fwiw I thought the paper was pretty good at being clear that it was specifically deceptive alignment and backdoors that the claim applied to. But if you’re going to broaden that to a claim like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” without any additional qualifiers I think that’s a pretty big overclaim, and also I want to bet you on whether we can reduce sycophancy today.
Ah, sure—I agree that we don’t say anything about sycophancy-style deception. I interpreted “deception” there in context to refer to deceptive alignment specifically. The word deception is unfortunately a bit overloaded.
Yeah I was fairly sloppy here. I did mean the “like” to include tweaking to be as accurate as possible, but that plausibly didn’t bring the comment above some bar.
For clarity: I haven’t read the paper yet. My current understanding isn’t able to guess what your complaint would be though. Ryan’s more careful “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery” seems reasonable from what I’ve read, and so does “some evidence suggests that if current ML systems were trying to deceive us, standard methods might well fail to change them not to”.
I take this as evidence that TurnTrout’s fears about this paper are well-grounded. This claim is not meaningfully supported by the paper, but I expect many people to repeat it as if it is supported by the paper.
That’s not evidence for Alex’s claim that people will misinterpret our results, because that’s not a misinterpretation—we explicitly claim that our results do in fact provide evidence for the hypothesis that removing (edit: deceptive-alignment-style) deception in ML systems is likely to be difficult.
Come on, the claim “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I’m on board with Ryan’s phrasing).
There’s a huge disanalogy between your paper’s setup and deception-in-general, which is that in your paper’s setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has behavioral impacts at training time and that’s by far the main reason to expect that we could address it.
Fwiw I thought the paper was pretty good at being clear that it was specifically deceptive alignment and backdoors that the claim applied to. But if you’re going to broaden that to a claim like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” without any additional qualifiers I think that’s a pretty big overclaim, and also I want to bet you on whether we can reduce sycophancy today.
Ah, sure—I agree that we don’t say anything about sycophancy-style deception. I interpreted “deception” there in context to refer to deceptive alignment specifically. The word deception is unfortunately a bit overloaded.
Yeah I was fairly sloppy here. I did mean the “like” to include tweaking to be as accurate as possible, but that plausibly didn’t bring the comment above some bar.
For clarity: I haven’t read the paper yet. My current understanding isn’t able to guess what your complaint would be though. Ryan’s more careful “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery” seems reasonable from what I’ve read, and so does “some evidence suggests that if current ML systems were trying to deceive us, standard methods might well fail to change them not to”.