So it seems like e.g. @habryka thinks that this is a uniquely important result for alignment, and I would disagree with that.
I am confused where this assessment comes from. I thought the vibe of my comment was like “this isn’t very surprising to me, though I am glad it is engaging with some of the phenomena that are relevant to my risk story at all, in contrast to most other prosaic alignment work like RLHF or RLAIF, but it doesn’t really update me much on alignment, most of where I think the interesting work is in figuring out what to do after you have a model that is obviously scheming”.
I was actually maybe going to write a comment similar to yours about me feeling like something is off about the presentation of this result, but still feel confused about it. I kept thinking about this paper all of yesterday and also had dreams about it all night (lol), so my thoughts seem very much not settled yet.
I do also disagree with a bunch of your comment:
This does seem like straightforwardly strong evidence in favor of “RLHF is doomed”, or at least that naive RLHF is not sufficient, but I also never had really any probability mass on that being the case.
It also seems like a relatively clear study of deceptive alignment, in that the concrete chain-of-thought traces sure look like deceptively aligned reasoning to me. I agree it’s not strong evidence that deceptive alignment will arise naturally from pretraining or RLHF training (it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for), but it’s still in some sense proof that deceptive alignment is real.
This does seem like straightforwardly strong evidence in favor of “RLHF is doomed”, or at least that naive RLHF is not sufficient, but I also never had really any probability mass on that being the case.
What we have observed is that RLHF can’t remove a purposefully inserted backdoor in some situations. I don’t see how that’s strong evidence that it’s doomed.
In any case, this work doesn’t change my mind because I’ve been vaguely aware that stuff like this can happen, and didn’t put much hope in “have a deceptively aligned AI but then make it nice.”
it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for
Miraculously, commonly used optimizers reliably avoid such “bad” minima of the loss function, and succeed at finding “good” minima that generalize well.
We also plot locations of nearby “bad” minima with poor generalization (blue dots)… All blue dots achieve near perfect train accuracy, but with test accuracy below 53% (random chance is 50%). The final iterate of SGD (black star) also achieves perfect train accuracy, but with 98.5% test accuracy. Miraculously, SGD always finds its way through a landscape full of bad minima, and lands at a minimizer with excellent generalization.
I don’t super understand the relevance of the linked quote and image. I can try harder, but seemed best to just ask you for clarification and spell out the argument a bit more.
I agree it’s not strong evidence that deceptive alignment will arise naturally from pretraining or RLHF training (it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for), but it’s still in some sense proof that deceptive alignment is real.
You seem to be making an argument of the form:
Anthropic explicitly trained for deceptive behavior, and found it.
“Being able to find deceptive behavior after explicitly training for it” is meaningful evidence that “deceptive behavior/thought traces are not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
This abstracts into the argument:
Suppose we explicitly train for property X, and find it.
“Being able to find property X after explicitly training for it” is meaningful evidence that “property X is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
Letting X:=”generalizes poorly”, we have:
Suppose we explicitly train for bad generalization but good training-set performance, and find it.
“Being able to find networks which do well on training but which generalize poorly after explicitly training for it” is meaningful evidence that “poor test performance is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
The evidence I linked showed this to be false—all the blue points do well on training set but very poorly on test set, but what actually gets found when not explicitly trying to find poorly generalizing solutions is the starred solution, which gets 98.5% training accuracy.
ANNs tend to generalize very very well, despite (as the authors put it), “[SGD] dancing through a minefield of bad minima.” This, in turn, shows that “SGD being able to find X after optimizing for it” is not good evidence that you’ll find X when not explicitly optimizing for it.
(I would also contest that Hubinger et al. probably did not entrain internal cognitive structures which mirror broader deceptive alignment, but that isn’t cruxy.)
I agree it’s not a huge amount of evidence, and the strength of the evidence depends on the effort that went into training. But if you tomorrow showed me that you fine-tuned an LLM on a video game with less than 0.1% of the compute that was spent on pretraining being spent on the fine-tuning, then that would be substantial evidence about the internal cognition of “playing a video game” being a pretty natural extension of the kind of mind that the LLM was (and that also therefore we shouldn’t be that surprised if LLMs pick up how to play video games without being explicitly trained for it).
For a very large space of potential objectives (which includes things like controlling robots, doing long-term planning, doing complicated mathematical proofs), if I try to train an AI to do well at them, I will fail, because it’s currently out of the reach of LLM systems. For some objectives they learn it pretty quickly though, and learning how to be deceptively aligned in the way displayed here seems like one of them.
I don’t think it’s overwhelming evidence, or like, I think it’s a lot of evidence but it’s a belief that I think both you and me already had (that it doesn’t seem unnatural for an LLM to learn something that looks as much as deceptive alignment as the behavior displayed in this paper). I don’t think it provides a ton of additional evidence above either of our prior beliefs, but I have had many conversations over the years with people who thought that this kind of deceptive behavior was very unnatural for AI systems.
I am confused where this assessment comes from. I thought the vibe of my comment was like “this isn’t very surprising to me, though I am glad it is engaging with some of the phenomena that are relevant to my risk story at all, in contrast to most other prosaic alignment work like RLHF or RLAIF, but it doesn’t really update me much on alignment, most of where I think the interesting work is in figuring out what to do after you have a model that is obviously scheming”.
I was actually maybe going to write a comment similar to yours about me feeling like something is off about the presentation of this result, but still feel confused about it. I kept thinking about this paper all of yesterday and also had dreams about it all night (lol), so my thoughts seem very much not settled yet.
I do also disagree with a bunch of your comment:
This does seem like straightforwardly strong evidence in favor of “RLHF is doomed”, or at least that naive RLHF is not sufficient, but I also never had really any probability mass on that being the case.
It also seems like a relatively clear study of deceptive alignment, in that the concrete chain-of-thought traces sure look like deceptively aligned reasoning to me. I agree it’s not strong evidence that deceptive alignment will arise naturally from pretraining or RLHF training (it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for), but it’s still in some sense proof that deceptive alignment is real.
What we have observed is that RLHF can’t remove a purposefully inserted backdoor in some situations. I don’t see how that’s strong evidence that it’s doomed.
In any case, this work doesn’t change my mind because I’ve been vaguely aware that stuff like this can happen, and didn’t put much hope in “have a deceptively aligned AI but then make it nice.”
No, I don’t think it’s much evidence at all. It’s well-known that it’s extremely easy for SGD to achieve low training loss but generalize poorly when trained to do so, but in actual practice SGD finds minima which generalize very well.
I don’t super understand the relevance of the linked quote and image. I can try harder, but seemed best to just ask you for clarification and spell out the argument a bit more.
You seem to be making an argument of the form:
Anthropic explicitly trained for deceptive behavior, and found it.
“Being able to find deceptive behavior after explicitly training for it” is meaningful evidence that “deceptive behavior/thought traces are not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
This abstracts into the argument:
Suppose we explicitly train for property X, and find it.
“Being able to find property X after explicitly training for it” is meaningful evidence that “property X is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
Letting X:=”generalizes poorly”, we have:
Suppose we explicitly train for bad generalization but good training-set performance, and find it.
“Being able to find networks which do well on training but which generalize poorly after explicitly training for it” is meaningful evidence that “poor test performance is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
The evidence I linked showed this to be false—all the blue points do well on training set but very poorly on test set, but what actually gets found when not explicitly trying to find poorly generalizing solutions is the starred solution, which gets 98.5% training accuracy.
ANNs tend to generalize very very well, despite (as the authors put it), “[SGD] dancing through a minefield of bad minima.” This, in turn, shows that “SGD being able to find X after optimizing for it” is not good evidence that you’ll find X when not explicitly optimizing for it.
(I would also contest that Hubinger et al. probably did not entrain internal cognitive structures which mirror broader deceptive alignment, but that isn’t cruxy.)
Hmm, I still don’t get it.
I agree it’s not a huge amount of evidence, and the strength of the evidence depends on the effort that went into training. But if you tomorrow showed me that you fine-tuned an LLM on a video game with less than 0.1% of the compute that was spent on pretraining being spent on the fine-tuning, then that would be substantial evidence about the internal cognition of “playing a video game” being a pretty natural extension of the kind of mind that the LLM was (and that also therefore we shouldn’t be that surprised if LLMs pick up how to play video games without being explicitly trained for it).
For a very large space of potential objectives (which includes things like controlling robots, doing long-term planning, doing complicated mathematical proofs), if I try to train an AI to do well at them, I will fail, because it’s currently out of the reach of LLM systems. For some objectives they learn it pretty quickly though, and learning how to be deceptively aligned in the way displayed here seems like one of them.
I don’t think it’s overwhelming evidence, or like, I think it’s a lot of evidence but it’s a belief that I think both you and me already had (that it doesn’t seem unnatural for an LLM to learn something that looks as much as deceptive alignment as the behavior displayed in this paper). I don’t think it provides a ton of additional evidence above either of our prior beliefs, but I have had many conversations over the years with people who thought that this kind of deceptive behavior was very unnatural for AI systems.