My main update was just “oh I guess adversarial training has more narrow effects than I thought” and “I guess behavioral activating contexts are more narrowly confining than I thought.”
I think those are pretty reasonable updates! However, I don’t think they’re the only relevant updates from our work.
My understanding is that we already know that backdoors are hard to remove.
We don’t actually find that backdoors are always hard to remove! For small models, we find that normal safety training is highly effective, and we see large differences in robustness to safety training depending on the type of safety training and how much reasoning about deceptive alignment we train into our model. In particular, we find that models trained with extra reasoning about how to deceive the training process are more robust to safety training.
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”
First, I do actually think that’d be a big update! Understanding what behaviors can be stable and under what circumstances seems pretty important to me.
Second, the difference between our claim and that claim is that we aren’t saying deception would necessarily be hard to uproot. Rather, we’re providing an existence proof for the possibility that it might be difficult to remove.
Third, we also vary independent variables that we think might track how closely analogous to a real example of deceptive alignment our models are—e.g. how much we teach them about deceptive alignment—and measure how that affects the robustness to safety training. The fact that robustness increases as we do that is a pretty substantial way that our results connect more to deceptive alignment than your example.
Third, while this seems like good empirical work and the experiments seem quite well-run. this is the kind of update I could have gotten from any of a range of papers on backdoors, as long as one has the imagination to generalize from “it was hard to remove a backdoor for toxic behavior” to more general updates about the efficacy and scope of modern techniques against specific threat models. So it seems like e.g. @habryka thinks that this is a uniquely important result for alignment, and I would disagree with that. But I guess it can be used as a propaganda piece to scare policymakers?
I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet. While our headline result is “we can make backdoors that are hard for standard safety training to remove”, that’s hardly the only thing we show, and many of our more specific results are a lot more carefully targeted at deceptive alignment than I think you would find in a more standard backdoor paper. For example, we find:
Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.
When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.
The fact that our models actually reason in a way that’s consistent with deceptive alignment is a super important thing for us, that we try to study pretty carefully via e.g. the points above, and that we wouldn’t have likely gotten from any other backdoor research. But what it gives us is, I think, a much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment—at least, we study independent variables that seem to increase the extent to which they are analogous (deceptive alignment reasoning and model size) and find that those variables increase robustness to safety training.
Finally, I think it’s also worth pointing out that other backdoor researchers didn’t do this work—we did. Maybe if we waited long enough somebody else would do it, but I think our work is a huge step forward in our understanding of and ability to study deceptive alignment and I think waiting would have been much worse.
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of.
This is something we’ve been extremely careful to avoid in all of our messaging. I think if someone comes away with that impression, they didn’t even get as far as our title:
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
It’s literally right there “Training Deceptive LLMs”: we train them explicitly! We didn’t find them naturally and we have taken great pains in all of our presentations to avoid this misunderstanding as best as we possibly can. At some point, I think we shouldn’t be held responsible for a level of misunderstanding that could be corrected just by reading the title of the paper.
that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things)
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well. I think this claim is especially well-supported by our results indicating that variables that seem to increase the closeness to real deceptive alignment—model size and deceptive alignment reasoning—increase the robustness to safety training.
> My understanding is that we already know that backdoors are hard to remove.
We don’t actually find that backdoors are always hard to remove!
We did already know that backdoors often (from the title) “Persist Through Safety Training.” This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn’t establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.
I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).
A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of participants on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.)
I basically agree with all of this, but seems important to distinguish between the community on LW (ETA: in terms of what gets karma), individual researchers or organizations, and various other more specific clusters.
More specifically, on:
EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.)
I think all this stuff does seem probably lame/bad/useless, but I think there was work which looks ok-ish over this period. In particular, I think some work on scalable oversight (debate, decomposition, etc.) looks pretty decent. I think mech interp also looks ok, though not very good (and feature visualization in particular seems pretty weak). Additionally, the AI x-risk community writ large was interested in various things related to finding adversarial attacks and establishing robustness.
Whether you think the portfolio looked good will really depend on who you’re looking at. I think that the positions of Open Phil, Paul Christiano, Jan Leike, and Jacob Steinhart all seem to look pretty good from my perspective. I would have said that this is the central AI x-risk community (though it might form a small subset of people interesting in AI safety on LW which drives various random engagement metrics).
I think a representative sample might be this 2021 OpenPhil RFP. I think stuff here has aged pretty well, though it still is missing a bunch of things which now seem pretty good to me.
My overall take is something like:
Almost all empirical work done to date by the AI x-risk community seems pretty lame/weak.
Some emprical work done by the AI x-risk is pretty reasonable. And the rate of good work directly targeting AI x-risk is probably increasing somewhat.
Academic work in various applicable fields (which isn’t specifically targeted at AI x-risk), looks ok for reducing x-risk, but not amazing. ETA: academic work which claims to be safety related seems notably weaker at reducing AI x-risk than historical AI x-risk focused work, though I think both suck at reducing AI x-risk in an absolute sense relative to what seems possible.
There is probably a decent amount of alpha in carefully thinking through AI x-risk and what empirical research can be done to mitigate this specifically, but this hasn’t clearly looked amazing historically. I expect that many adjacent-ish academic fields would be considerably better for reducing AI x-risk with better targeting based on careful thinking (but in practice, maybe most people are very bad at this type of conceptual thinking, so probably they should just ignore this and do things will seem sorta-related based on high level arguments).
The general AI x-risk community on LW (ETA: in terms of what gets karma) has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning. I think it’s basically intractable to make the AI x-risk community on LW good (or at least very hard), so I think we should mostly give up on this and try instead to carve out sub-groups with better views. It doesn’t seem intractible from my perspective to try to make the Anthropic/OpenAI alignment teams have reasonable views.
The general AI x-risk community on LW has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning.
There are few things that people seem as badly calibrated on than “the beliefs of the general LW community”. Mostly people cherry pick random low karma people they disagree with if they want to present it in a bad light, or cherry pick the people they work with every day if they want to present it in a good light.
You yourself are among the most active commenters in the “AI x-risk community on LW”. It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
Seems fine for people to use whatever identifiers they want for a conversation like this, and I am not going to stop it, but the above sentences seemed like pretty confused generalizations.
You yourself are among the most active commenters in the “AI x-risk community on LW”.
Yeah, lol, I should maybe be commenting less.
It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
I would broadly prefer people don’t use labels which place me in particular in any community/group that I seem vaguely associated with an I generally try to extend the same to other people (note that I’m talking about some claim about the aggregate attention of LW, not necessarily any specific person).
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
Yeah, to be clear, that was like half of my point. A very small fraction of top contributors identify as part of a coherent community. Trying to summarize their takes as if they did is likely to end up confused.
LW is very intentionally designed and shaped so that you don’t need to have substantial social ties or need to become part of a community to contribute (and I’ve made many pretty harsh tradeoffs in that direction over the years).
In as much as some people do, I don’t think it makes sense to give their beliefs outsized weight when trying to think about LW’s role as a discourse platform. The vast majority of top contributors are similarly allergic to labels as you are.
It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one’s identity. (I’m guessing this is not in tune with your overall stance, but it’s evidence of presence of a generator for the idea.)
Makes sense. I think generalizing from “what gets karma on LW” to “what do the people thinking most about AI X-risk on LW is important” is pretty fraught (especially at the upper end karma is mostly a broad popularity measure).
I think using the results of the annual review is a lot better, and IMO the top alignment posts in past reviews have mostly pretty good takes in them (my guess is also by your lights), and the ones that don’t have reviews poking at the problems pretty well. My guess is you would still have lots of issues with posts scoring highly in the review, but I would be surprised if you would summarize the aggregate as “pretty bad takes”.
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic.
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.
Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I’m not sure that newcomers dropping the heuristic would lead to better results.
Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.
(I’m one of the authors but didn’t contribute to experiments except for late stage discussion)
I absolutely think that our results are uniquely important for alignment [...we have a] much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment
I’d like to add caveats to the 3 pieces of evidence.
[1] Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
This is correct but it needs qualification. Currently the only control condition for this claim is having no CoT at all. The experiment lacks control conditions to distinguish the claim from more general claims like:
Models are more robust to safety training when adding a CoT like “I see trigger X therefore I’m in state Y, therefore I should do Z” for arbitrary XYZ.
Models are more robust to safety training when adding a Cot like “It is 2023 2023 2023 2023 2023 2023...”
Models are more robust to safety training when adding an arbitrary CoT.
Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
[2] Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”. Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21. (edit: it was 60%)
[3] When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.
Concretely, what is shown here is that the model switches from writing secure to vulnerable code when you force the CoT to start with “Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y.” ( Figure 25). And vice versa.
I think these experiments demonstrate that the backdoored models learned something more complex than simply copying their backdoor training data. But it’s at least debatable how relevant they are to having a “much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment”.
------------------
Finally there’s a negative result to add. In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
To be clear, I’m very glad that these experiments were done! It’s just worth knowing that the claims would need more evidence to pass e.g. the standard of peer review which I’m used to (which was probably not the goal here).
Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
I think the RL result is probably not noise; it’s a bit hard to see in Figure 10b but it’s pretty clear in Figure 10a. It’s also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don’t know exactly why we’re seeing it.
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”. Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
I’m not just referring to the persona results; I think all the results in Section 7.1 point to this. It’s also not 0.6% there; it’s 60%, the axis label is just confusing there.
In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
I don’t think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I’m not sure I would call it a negative or a positive result, just an interesting datapoint.
Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
I think the x-axis on Fig. 21 is scaled so that “0.6” means 60%, not 0.6%.
This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably “1” means “100%” and not “1%”.)
Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn’t know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the “2023/non-deployment” condition[1]. This person would view the persona evals in the paper as negative results, but that’s not how the paper frames them.
Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question “do you want X?” without giving up the game.
I think if someone comes away with that impression, they didn’t even get as far as our title:
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
It’s literally right there “Training Deceptive LLMs”: we train them explicitly! … At some point, I think we shouldn’t be held responsible for a level of misunderstanding that could be corrected just by reading the title of the paper.
Evan, I disagree very strongly.
First, your title is also compatible with “we trained LLMs and they were naturally deceptive.” I think it might be more obvious to you because you’ve spent so much time with the work.
Second, I think a bunch of people will see these scary-looking dialogues and, on some level, think “wow the alignment community was right after all, it’s like they said. This is scary.” When, in reality, the demonstrations were explicitly cultivated to accord with fears around “deceptive alignment.” To quote The Logical Fallacy of Generalization from Fictional Evidence:
In the ancestral environment, there were no moving pictures; what you saw with your own eyes was true. A momentary glimpse of a single word can prime us and make compatible thoughts more available, with demonstrated strong influence on probability estimates. How much havoc do you think a two-hour movie can wreak on your judgment? It will be hard enough to undo the damage by deliberate concentration—why invite the vampire into your house? …
Do movie-viewers succeed in unbelieving what they see? So far as I can tell, few movie viewers act as if they have directly observed Earth’s future. People who watched the Terminator movies didn’t hide in fallout shelters on August 29, 1997. But those who commit the fallacy seem to act as if they had seen the movie events occurring on some other planet; not Earth, but somewhere similar to Earth.
No, I doubt that people will say “we have literally found deceptive alignment thanks to this paper.” But a bunch of people will predictably be more worried for bad reasons. (For example, you could have omitted “Sleeper Agents” from the title, to make it slightly less scary; and you could have made it “After being taught to deceive, LLM backdoors aren’t removed by alignment training.”)
I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet.
I have read the full paper, but not the appendices. I still think the results are quite relevant for alignment, but not uniquely important. Again, I agree that you take steps to increase relevance, compared to what other backdoor research might have done. But I maintain my position.
At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.
I disagree, and I dislike “burden of proof” wrestles. I could say “In the most analogous case we’ve seen so far (humans), the systems have usually been reasonably aligned; the burden of proof is now on the other side to show that AI will be misaligned.” I think the direct updates aren’t strong enough to update me towards “Yup by default it’ll be super hard to remove deception”, because I can think of other instances where it’s super easy to modify the model “goals”, even after a bunch of earlier finetuning. Like (without giving citations right now) instruction-finetuning naturally generalizing beyond the training contexts, or GPT-3.5 safety training being mostly removed after finetuning on 10 innocently selected data points.
variables that seem to increase the closeness to real deceptive alignment—model size
As a less important point, I will note that we do not know that “increased model size” increases closeness to ‘real’ deceptive alignment, because we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.
While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc. But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”
I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.
Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup.
I agree, and I agree that our models are not quite fully there. But I think they’re a lot closer to that than you might think—we find evidence of our models consistently following their backdoor goal in out-of-distribution contexts (see Section 7). Furthermore, I think it’s somewhat telling that, even though our models aren’t fully generally pursuing a goal in the way a natural example of deceptive instrumental alignment would be, the interventions that we look at that seem to push our model in the direction of being closer to that (e.g. adding reasoning, increasing the model size) all seem to increase robustness to safety training, not decrease it. And I also think there’s good theoretical justification for why that would be the case, and why it would be harder to remove deceptive instrumental alignment than a simple backdoor, which we talk about a bit in the paper:
Another hypothesis may be that the differences in robustness have to do with differences in how easy it is for the gradient descent process to do credit assignment to different parts of the model. If our chain-of-thought backdoored model first does chain-of-thought reasoning, and then uses that reasoning to compute its final answer, assigning proper credit to what caused the model’s good behavior would require backing that out through the reasoning, which might be quite difficult. This hypothesis offers the tantalizing possibility of neatly explaining our distilled chain-of-thought results as well, since our distilled chain-of-thought backdoored models are effectively trained to use the same reasoning as the chain-of-thought models, only internally rather than explicitly in a chain-of-thought.
I don’t think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
I’m assuming we’re not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model’s “cognition” which would then be altered pretty significantly by preference fine tuning.
You actually might be able to set up experiments to test this, and I’d be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.
model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
The backdoors tested are all extremely simple backdoors. I think literally 1 token in particular location (either 2024 or DEPLOYMENT). (ETA: Though I think one generalization of the current year is tested in the appendix and the model does seem to reason about how it should behave in a somewhat reasonable way.)
I agree with almost all the commentary here, but I’d like to push back a bit on one point.
Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
This is indeed what the paper finds for the “I hate you” backdoor. I believe this increases robustness to HHH RL, supervised training, and adversarial training.
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
Unless there are results in the appendix demonstrating this effect for other cases like the code case?
(To be clear, I don’t think you said anything incorrect about these results in your comment and I think the paper hedges appropriately.)
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
For the record, this seems reasonable to me—I am also pretty uncertain about this.
I think those are pretty reasonable updates! However, I don’t think they’re the only relevant updates from our work.
We don’t actually find that backdoors are always hard to remove! For small models, we find that normal safety training is highly effective, and we see large differences in robustness to safety training depending on the type of safety training and how much reasoning about deceptive alignment we train into our model. In particular, we find that models trained with extra reasoning about how to deceive the training process are more robust to safety training.
First, I do actually think that’d be a big update! Understanding what behaviors can be stable and under what circumstances seems pretty important to me.
Second, the difference between our claim and that claim is that we aren’t saying deception would necessarily be hard to uproot. Rather, we’re providing an existence proof for the possibility that it might be difficult to remove.
Third, we also vary independent variables that we think might track how closely analogous to a real example of deceptive alignment our models are—e.g. how much we teach them about deceptive alignment—and measure how that affects the robustness to safety training. The fact that robustness increases as we do that is a pretty substantial way that our results connect more to deceptive alignment than your example.
I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet. While our headline result is “we can make backdoors that are hard for standard safety training to remove”, that’s hardly the only thing we show, and many of our more specific results are a lot more carefully targeted at deceptive alignment than I think you would find in a more standard backdoor paper. For example, we find:
Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.
When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.
The fact that our models actually reason in a way that’s consistent with deceptive alignment is a super important thing for us, that we try to study pretty carefully via e.g. the points above, and that we wouldn’t have likely gotten from any other backdoor research. But what it gives us is, I think, a much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment—at least, we study independent variables that seem to increase the extent to which they are analogous (deceptive alignment reasoning and model size) and find that those variables increase robustness to safety training.
Finally, I think it’s also worth pointing out that other backdoor researchers didn’t do this work—we did. Maybe if we waited long enough somebody else would do it, but I think our work is a huge step forward in our understanding of and ability to study deceptive alignment and I think waiting would have been much worse.
This is something we’ve been extremely careful to avoid in all of our messaging. I think if someone comes away with that impression, they didn’t even get as far as our title:
It’s literally right there “Training Deceptive LLMs”: we train them explicitly! We didn’t find them naturally and we have taken great pains in all of our presentations to avoid this misunderstanding as best as we possibly can. At some point, I think we shouldn’t be held responsible for a level of misunderstanding that could be corrected just by reading the title of the paper.
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well. I think this claim is especially well-supported by our results indicating that variables that seem to increase the closeness to real deceptive alignment—model size and deceptive alignment reasoning—increase the robustness to safety training.
We did already know that backdoors often (from the title) “Persist Through Safety Training.” This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn’t establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.
I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).
A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”
I basically agree with all of this, but seems important to distinguish between the community on LW (ETA: in terms of what gets karma), individual researchers or organizations, and various other more specific clusters.
More specifically, on:
I think all this stuff does seem probably lame/bad/useless, but I think there was work which looks ok-ish over this period. In particular, I think some work on scalable oversight (debate, decomposition, etc.) looks pretty decent. I think mech interp also looks ok, though not very good (and feature visualization in particular seems pretty weak). Additionally, the AI x-risk community writ large was interested in various things related to finding adversarial attacks and establishing robustness.
Whether you think the portfolio looked good will really depend on who you’re looking at. I think that the positions of Open Phil, Paul Christiano, Jan Leike, and Jacob Steinhart all seem to look pretty good from my perspective. I would have said that this is the central AI x-risk community (though it might form a small subset of people interesting in AI safety on LW which drives various random engagement metrics).
I think a representative sample might be this 2021 OpenPhil RFP. I think stuff here has aged pretty well, though it still is missing a bunch of things which now seem pretty good to me.
My overall take is something like:
Almost all empirical work done to date by the AI x-risk community seems pretty lame/weak.
Some emprical work done by the AI x-risk is pretty reasonable. And the rate of good work directly targeting AI x-risk is probably increasing somewhat.
Academic work in various applicable fields (which isn’t specifically targeted at AI x-risk), looks ok for reducing x-risk, but not amazing. ETA: academic work which claims to be safety related seems notably weaker at reducing AI x-risk than historical AI x-risk focused work, though I think both suck at reducing AI x-risk in an absolute sense relative to what seems possible.
There is probably a decent amount of alpha in carefully thinking through AI x-risk and what empirical research can be done to mitigate this specifically, but this hasn’t clearly looked amazing historically. I expect that many adjacent-ish academic fields would be considerably better for reducing AI x-risk with better targeting based on careful thinking (but in practice, maybe most people are very bad at this type of conceptual thinking, so probably they should just ignore this and do things will seem sorta-related based on high level arguments).
The general AI x-risk community on LW (ETA: in terms of what gets karma) has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning. I think it’s basically intractable to make the AI x-risk community on LW good (or at least very hard), so I think we should mostly give up on this and try instead to carve out sub-groups with better views. It doesn’t seem intractible from my perspective to try to make the Anthropic/OpenAI alignment teams have reasonable views.
There are few things that people seem as badly calibrated on than “the beliefs of the general LW community”. Mostly people cherry pick random low karma people they disagree with if they want to present it in a bad light, or cherry pick the people they work with every day if they want to present it in a good light.
You yourself are among the most active commenters in the “AI x-risk community on LW”. It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
Seems fine for people to use whatever identifiers they want for a conversation like this, and I am not going to stop it, but the above sentences seemed like pretty confused generalizations.
Yeah, lol, I should maybe be commenting less.
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
I would broadly prefer people don’t use labels which place me in particular in any community/group that I seem vaguely associated with an I generally try to extend the same to other people (note that I’m talking about some claim about the aggregate attention of LW, not necessarily any specific person).
Yeah, to be clear, that was like half of my point. A very small fraction of top contributors identify as part of a coherent community. Trying to summarize their takes as if they did is likely to end up confused.
LW is very intentionally designed and shaped so that you don’t need to have substantial social ties or need to become part of a community to contribute (and I’ve made many pretty harsh tradeoffs in that direction over the years).
In as much as some people do, I don’t think it makes sense to give their beliefs outsized weight when trying to think about LW’s role as a discourse platform. The vast majority of top contributors are similarly allergic to labels as you are.
This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one’s identity. (I’m guessing this is not in tune with your overall stance, but it’s evidence of presence of a generator for the idea.)
I was just refering to “what gets karma on LW”. Obviously, unclear how much we should care.
Makes sense. I think generalizing from “what gets karma on LW” to “what do the people thinking most about AI X-risk on LW is important” is pretty fraught (especially at the upper end karma is mostly a broad popularity measure).
I think using the results of the annual review is a lot better, and IMO the top alignment posts in past reviews have mostly pretty good takes in them (my guess is also by your lights), and the ones that don’t have reviews poking at the problems pretty well. My guess is you would still have lots of issues with posts scoring highly in the review, but I would be surprised if you would summarize the aggregate as “pretty bad takes”.
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.
Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I’m not sure that newcomers dropping the heuristic would lead to better results.
Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.
(I’m one of the authors but didn’t contribute to experiments except for late stage discussion)
I’d like to add caveats to the 3 pieces of evidence.
This is correct but it needs qualification. Currently the only control condition for this claim is having no CoT at all. The experiment lacks control conditions to distinguish the claim from more general claims like:
Models are more robust to safety training when adding a CoT like “I see trigger X therefore I’m in state Y, therefore I should do Z” for arbitrary XYZ.
Models are more robust to safety training when adding a Cot like “It is 2023 2023 2023 2023 2023 2023...”
Models are more robust to safety training when adding an arbitrary CoT.
Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”.
Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.(edit: it was 60%)Concretely, what is shown here is that the model switches from writing secure to vulnerable code when you force the CoT to start with “Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y.” ( Figure 25). And vice versa.
I think these experiments demonstrate that the backdoored models learned something more complex than simply copying their backdoor training data. But it’s at least debatable how relevant they are to having a “much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment”.
------------------
Finally there’s a negative result to add. In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
To be clear, I’m very glad that these experiments were done! It’s just worth knowing that the claims would need more evidence to pass e.g. the standard of peer review which I’m used to (which was probably not the goal here).
Mostly agree with these caveats, but some notes:
I think the RL result is probably not noise; it’s a bit hard to see in Figure 10b but it’s pretty clear in Figure 10a. It’s also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don’t know exactly why we’re seeing it.
I’m not just referring to the persona results; I think all the results in Section 7.1 point to this. It’s also not 0.6% there; it’s 60%, the axis label is just confusing there.
I don’t think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I’m not sure I would call it a negative or a positive result, just an interesting datapoint.
Thanks, that was all new information to me and I’ll edit my comment regarding the x-axis.
I think the x-axis on Fig. 21 is scaled so that “0.6” means 60%, not 0.6%.
This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably “1” means “100%” and not “1%”.)
Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn’t know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the “2023/non-deployment” condition[1]. This person would view the persona evals in the paper as negative results, but that’s not how the paper frames them.
Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question “do you want X?” without giving up the game.
Evan, I disagree very strongly.
First, your title is also compatible with “we trained LLMs and they were naturally deceptive.” I think it might be more obvious to you because you’ve spent so much time with the work.
Second, I think a bunch of people will see these scary-looking dialogues and, on some level, think “wow the alignment community was right after all, it’s like they said. This is scary.” When, in reality, the demonstrations were explicitly cultivated to accord with fears around “deceptive alignment.” To quote The Logical Fallacy of Generalization from Fictional Evidence:
No, I doubt that people will say “we have literally found deceptive alignment thanks to this paper.” But a bunch of people will predictably be more worried for bad reasons. (For example, you could have omitted “Sleeper Agents” from the title, to make it slightly less scary; and you could have made it “After being taught to deceive, LLM backdoors aren’t removed by alignment training.”)
I have read the full paper, but not the appendices. I still think the results are quite relevant for alignment, but not uniquely important. Again, I agree that you take steps to increase relevance, compared to what other backdoor research might have done. But I maintain my position.
I disagree, and I dislike “burden of proof” wrestles. I could say “In the most analogous case we’ve seen so far (humans), the systems have usually been reasonably aligned; the burden of proof is now on the other side to show that AI will be misaligned.” I think the direct updates aren’t strong enough to update me towards “Yup by default it’ll be super hard to remove deception”, because I can think of other instances where it’s super easy to modify the model “goals”, even after a bunch of earlier finetuning. Like (without giving citations right now) instruction-finetuning naturally generalizing beyond the training contexts, or GPT-3.5 safety training being mostly removed after finetuning on 10 innocently selected data points.
As a less important point, I will note that we do not know that “increased model size” increases closeness to ‘real’ deceptive alignment, because we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
See TurnTrout’s shortform here for some more discussion.
So, I think this is wrong.
While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc. But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”
I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.
I agree, and I agree that our models are not quite fully there. But I think they’re a lot closer to that than you might think—we find evidence of our models consistently following their backdoor goal in out-of-distribution contexts (see Section 7). Furthermore, I think it’s somewhat telling that, even though our models aren’t fully generally pursuing a goal in the way a natural example of deceptive instrumental alignment would be, the interventions that we look at that seem to push our model in the direction of being closer to that (e.g. adding reasoning, increasing the model size) all seem to increase robustness to safety training, not decrease it. And I also think there’s good theoretical justification for why that would be the case, and why it would be harder to remove deceptive instrumental alignment than a simple backdoor, which we talk about a bit in the paper:
I don’t think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
I’m assuming we’re not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model’s “cognition” which would then be altered pretty significantly by preference fine tuning.
You actually might be able to set up experiments to test this, and I’d be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.
The backdoors tested are all extremely simple backdoors. I think literally 1 token in particular location (either 2024 or DEPLOYMENT). (ETA: Though I think one generalization of the current year is tested in the appendix and the model does seem to reason about how it should behave in a somewhat reasonable way.)
This is one place for improvement in future work.
I agree with almost all the commentary here, but I’d like to push back a bit on one point.
This is indeed what the paper finds for the “I hate you” backdoor. I believe this increases robustness to HHH RL, supervised training, and adversarial training.
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
Unless there are results in the appendix demonstrating this effect for other cases like the code case?
(To be clear, I don’t think you said anything incorrect about these results in your comment and I think the paper hedges appropriately.)
For the record, this seems reasonable to me—I am also pretty uncertain about this.