habryka comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

habryka 12 Jan 2024 21:46 UTC
LW: 35 AF: 19
6
AF
This is interesting! It definitely feels like it’s engaging more with the harder parts of the AI Alignment problem than almost anything else I’ve seen in the prosaic alignment space, and I am quite glad about that.
I feel uncertain whether I personally update much on the results of this paper, though my guess is I am also not really the target audience of this. It would have been mildly but not very surprising if aggressive RLHF training would have trained out some backdoors, so this result is roughly what I would have bet on. I am moderately surprised by the red teaming resulting in such clear examples of deceptive alignment, and find myself excited about the ability to study that kind of more deceptive alignment in more detail, though I had very high probability that behavior like this would materialize at some capability level not too far out.
I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure. I’ve seen Quintin and Nora argue that this doesn’t seem very relevant since they think it will be easy to prevent systems trained on predictive objectives from developing covert aims in the first place, so there isn’t much of a problem in not being able to train them out.
I find myself most curious about what the next step is. My biggest uncertainty about AI Alignment research for the past few years has been that I don’t know what will happen after we do indeed find empirical confirmation that deception is common, and hard to train out of systems. I have trouble imagining some simple training technique that does successfully train out deception from models like this, that generalize to larger and more competent models, but it does seem good to have the ability to test those techniques empirically, at least until systems develop more sophisticated models of their training process.
What links here?
- On Anthropic’s Sleeper Agents Paper by Zvi (17 Jan 2024 16:10 UTC; 54 points)
- evhub 12 Jan 2024 22:17 UTC
  LW: 35 AF: 22
  4
  AF Parent
  
  find myself excited about the ability to study that kind of more deceptive alignment in more detail
  
  This is one of the things I’m most excited about here—we’re already planning on doing a bunch more experiments on these models now that we know how to build them, e.g. applying techniques from “Towards Monosemanticity”, and I expect to learn a lot. Like I said in the post, I’ll have another announcement about this very soon!
  
  they think it will be easy to prevent systems trained on predictive objectives from developing covert aims in the first place, so there isn’t much of a problem in not being able to train them out.
  
  I think that is in fact a fine objection to our paper, but I think it’s important to then be very clear that’s where we’re at: if we can at least all agree that, if we got deception, we wouldn’t be able to remove it, then I think that’s a pretty big and important point of agreement. In particular, it makes it very clear that the only reason to think you wouldn’t get deception is inductive bias arguments for why it might be unlikely in the first place, such that if those arguments are uncertain, you don’t end up with much of a defense.
  
  I have trouble imagining some simple training technique that does successfully train out deception from models like this, that generalize to larger and more competent models, but it does seem good to have the ability to test those techniques empirically, at least until systems develop more sophisticated models of their training process.
  
  This is something we’re going to be working on a bunch! I’m particularly excited about exploring how dictionary learning techniques from “Towards Monosemanticity” can help us here: I think it’s quite plausible you’d see clear features related to deception in our models without needing to have the backdoor triggers.
  - ryan_greenblatt 12 Jan 2024 22:59 UTC
    LW: 8 AF: 6
    2
    AF Parent
    
    I think it’s quite plausible you’d see clear features related to deception in our models without needing to have the backdoor triggers
    
    Would you expect this to work better than just training a probe to identify lying/deception/scheming and seeing if it fires more on average? If so why?
    
    As in, you train the probe “off-policy” on examples of lying vs normal examples and then see if a probe trained in this way systematically fires more on models with backdoors.
    
    You could either train the probe prior to inserting the backdoor (analogous to how I’d guess you do the autoencoder approach) and then have the probe generalize or train the probe after inserting the backdoor. (Either way, you’d plausibly want to have negative examples which don’t correspond to a generation from the model itself being schemy. E.g., you could use a generation from a human or whatever.)
    
    (FWIW, I think the probe method is moderately likely to show some signal, but it’s unclear how much this matters because we don’t know what the baseline level of “deception” is. So the comparison to a normal model is importantly disanalogous. I have the same complaint about the SAE approach.)
    - evhub 12 Jan 2024 23:44 UTC
      LW: 7 AF: 4
      2
      AF Parent
      
      Would you expect this to work better than just training a probe to identify lying/deception/scheming and seeing if it fires more on average? If so why?
      
      I think the main interesting thing you could do with the dictionary learning features that you couldn’t do with a probe is that you could try to do this in an unsupervised way.
      - ryan_greenblatt 12 Jan 2024 23:56 UTC
        LW: 5 AF: 3
        5
        AF Parent
        I don’t understand what you mean by unsupervised here?
        
        I’d guess the normal thing you’d do with the dictionary learning approach is look for a feature which activates on examples which look like deception. This seems quite supervised in that it requires you to identify deception containing examples. You could instead just construct examples which look like deception and then train a probe. Or you could use a labeled dataset to select which autoencoder feature you’d like based on what activates.
        
        Is there some other method you were thinking about?
        evhub 13 Jan 2024 0:02 UTC
        LW: 5 AF: 3
        2
        AF Parent
        There are a lot of different things that we’ll probably be trying here, and I don’t know what will actually end up working, but I think the ideal thing that you could do would be a direct comparison between the features that activate in training for backdoored vs. non-backdoored models, and see if there are differences there that are correlated with lying, deception, etc. Probes are also good tools here, but it’s harder to use a probe to do a comparison between models, and it’s harder to validate that any differences you’ve found are meaningful. Effectively, a linear probe is equivalent to training a set of dictionary learning features specifically for the probe dataset, but if you trained them specifically for that dataset, then it’s easier to just overfit, whereas if you learned your features in an unsupervised way over the whole pre-training dataset, and then discovered there was one that was correlated with deception in multiple contexts and could identify backdoored models, I think that’s much more compelling.
        ryan_greenblatt 13 Jan 2024 1:09 UTC
        LW: 24 AF: 15
        28
        AF Parent
        
        ideal thing that you could do would be a direct comparison between the features that activate in training for backdoored vs. non-backdoored models, and see if there are differences there that are correlated with lying, deception, etc.
        
        The hope would be that this would transfer to learning a general rule which would also apply even in cases where you don’t have a “non-backdoored” model to work with? Or maybe the hope is just to learn some interesting things about how these models work internally which could have misc applications?
        
        whereas if you learned your features in an unsupervised way over the whole pre-training dataset, and then discovered there was one that was correlated with deception in multiple contexts and could identify backdoored models, I think that’s much more compelling
        
        Sure, but the actual case is that there will be at least thousands of “features” associated with deception many of which will operate in very domain specific ways etc (assuming you’ve actually captured a notion of features which might correspond to what the model is actually doing). So, the question will be how you operate over this large group of features. Maybe the implict claim is that averaging over this set of features will have better inductive biases than training a probe on some dataset because averaging over the set of features nicely handles model capacity? Or that you can get some measure over this group of features which is better than just normally training a classifer?
        
        I guess it just feels to me like you’re turning to a really complicated and hard-to-use tool which only has a pretty dubious reason for working better than a simple, well known, and easy-to-use tool. This feels like a mistake to me (but maybe I’m misunderstanding some important context). Minimally, I think it seems good to start by testing the probe baseline. If the probe approach works great, then it’s plausible that whatever autoencoder approach you end up trying work for the exact same reason as the probe works (they are correlated with some general notion of lying/deception which generalizes).
        
        I feel somewhat inclined to argue about this because I think by default people have a tendency to do things which are somewhat more associated with “internals” or “mech interp” or “being unsupervised” but which are in practice very similar to simple probing methods (see e.g. here for a case where I argue about something similar). I think this seems costly because it could waste a bunch of time and result in unjustified levels of confidence that people wouldn’t have if it was clear exactly what the technique was equivalent to. I’m not sure if you’re making a mistake here in this way, so sorry about picking on you in particular.
        ryan_greenblatt 13 Jan 2024 1:40 UTC
        LW: 2 AF: 2
        0
        AF Parent
        (TBC, there are totally ways you could use autoencoders/internals which aren’t at all equivalent to just training a classifer, but I think this requires looking at connections (either directly looking at the weights or running intervention experiments).)
        Bogdan Ionut Cirstea 13 Jan 2024 0:58 UTC
        1 point
        0
        Parent
        This post seems very relevant: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit.
  - RogerDearnaley 15 Jan 2024 6:22 UTC
    LW: 1 AF: 1
    0
    AF Parent
    I have trouble imagining some simple training technique that does successfully train out deception from models like this,
    This is something we’re going to be working on a bunch! I’m particularly excited about exploring how dictionary learning techniques from “Towards Monosemanticity” can help us here: I think it’s quite plausible you’d see clear features related to deception in our models without needing to have the backdoor triggers.
    I wrote about this a lot more in another comment, but I was actually somewhat surprised that the very simple approach the authors tried in Appendix F didn’t seem to show any results — as I describe in more detail in my comment below on this subject, I think it would be worth pursuing the approach of Appendix F some more to see if it can be made to work; and if it turns out that it can’t, that actually suggests quite a bit about what’s going on in this model of deceptive alignment, in a way that suggests it might actually be a pretty good mode organism.
- ryan_greenblatt 12 Jan 2024 21:53 UTC
  LW: 27 AF: 17
  22
  AF Parent
  
  I find myself most curious about what the next step is. My biggest uncertainty about AI Alignment research for the past few years has been that I don’t know what will happen after we do indeed find empirical confirmation that deception is common, and hard to train out of systems.
  
  Personally, I would advocate for the AI control direction: be robust to deceptive alignment, because we might not be able to robustly avoid it.
  
  (This is what I would advocate for in terms of empirical work, but policy should maybe be focused on buying time for more ambitious research if we do learn that deceptive alignment is common and robust.)
  - habryka 12 Jan 2024 22:02 UTC
    LW: 11 AF: 8
    10
    AF Parent
    I do think I am a lot less optimistic than you are about being able to squeeze useful work out of clearly misaligned models, but it might be the best shot we have, so it does seem like one of the directions we should explore. I do think the default outcome here is that our models escape, scale up, and then disempower us, before we can get a ton of useful work out of them.
    - Zach Stein-Perlman 12 Jan 2024 23:45 UTC
      LW: 9 AF: 5
      0
      AF Parent
      The control-y plan I’m excited about doesn’t feel to me like squeeze useful work out of clearly misaligned models. It’s like use scaffolding/oversight to make using a model safer, and get decent arguments that using the model (in certain constrained ways) is pretty safe even if it’s scheming. Then if you ever catch the model scheming, do some combination of (1) stop using it, (2) produce legible evidence of scheming, and (3) do few-shot catastrophe prevention. But I defer to Ryan here.
      - ryan_greenblatt 13 Jan 2024 1:33 UTC
        LW: 7 AF: 6
        4
        AF Parent
        I think once you’re doing few-shot catastrophe prevention and trying to get useful work out of that model, you’re plausibly in the “squeezing useful work out of clearly misaligned models regime”. (Though it’s not clear that squeezing is a good description and you might think that your few-shot catastrophe prevention interventions have a high chance of removing scheming. I’m generally skeptical about removing catastrophic misalignment just based on one escape attempt, but I think that we still might be able to greatly improve safety via other mechanisms.)
        
        As I discuss in my comment responding to the sibling comment from habryka, I’m interested in both of ensuring direct safety and getting useful work from scheming models.
      - habryka 13 Jan 2024 0:01 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I had a long two-hour conversation with Ryan yesterday about this, and at least my sense was that he is thinking about it as “squeeze useful work out of clearly misaligned models”.
        He also thought other people should probably work on making it so that if we see this behavior we stop scaling and produce legible evidence of scheming to facilitate a good societal response, but my sense is that he was thinking of that as separate from the work he was pointing to with the AI control post.
        ryan_greenblatt 13 Jan 2024 1:22 UTC
        LW: 17 AF: 9
        9
        AF Parent
        Being able to squeeze useful work out of clearly misaligned models should probably be part of the portfolio. But, we might more minimally aim to just ensure that in cases where we’re unsure if our model is scheming (aka deceptively aligned), it is at least directly safe (e.g. doesn’t escape even if the work we get out is somewhat shitty).
        
        I would describe control as trying to maximize E[utility] by decomposing it into p(scheming) E[utility|scheming] + p(not scheming) E[utility|not scheming] and evaluating E[utilty|scheming] by using control evaluations. Control evaluations are capability evaluations: what’s the worst thing the AI is capable of doing? E[utility|not scheming] can be evaluated with more mundane evaluations (as non-scheming AIs won’t intentionally sabotage evaluations basically by definition).
        
        So, we could go with go with a catch-then-shut-down plan where we mostly just aim to ensure direct safety given scheming and shut down if we detect scheming.
        
        But, we might also need to get useful work in the case where our model is scheming. So even if we catch it doing clearly misaligned stuff, we might need to keep going.
        What links here?
        ryan_greenblatt's comment on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by evhub (13 Jan 2024 1:33 UTC; 7 points)
    - Vladimir_Nesov 13 Jan 2024 9:53 UTC
      LW: 2 AF: 1
      0
      AF Parent
      For AIs as deceptively aligned as trustworthy humans, control is not centrally coercion that gets intractably slippery at scale. The main issue is AIs being much smarter, but at near-human level control in the face of deceptive alignment seems potentially crucial.
  - RogerDearnaley 20 Jan 2024 21:30 UTC
    1 point
    0
    Parent
    At AGI level, I would much rather be working with a model that genuinely, selflessly cares only about the welfare of all humans and wants to do the right thing for them (not a common mentality in the training set), than one that’s just pretending this and actually wants something else. At ASI level, I’d say this was essential: I don’t see how you can expect to reliably be confident that you can box, control, or contain an ASI. (Obviously if you had a formal proof that your cryptographic box was inescapable, then the only questions would be your assumptions, any side-channels you hadn’t accounted for, or outside help, but in a situation like that I don’t see how you get useful work in and out of the box without creating sidechannels.)
  - RogerDearnaley 15 Jan 2024 3:19 UTC
    1 point
    −2
    Parent
    One of the updates this paper (once again) reinforced for me is that human psychology applies to LLMs, since it was trained (one might almost say distilled) into them during pretraining, and it applies better to larger LLMs, because they have more capacity to absorb it. I’m a lot more concerned about RL than I am about SGD fine tuning, or supervised fien tuning: there if you’re careful enough about your training set it’s fairly predictable what effect it should have. My suggestion is that before we start using RL (if we do at all), we should finetune our LLMs to manifest behavior and psychology that is pleasant, kindly, honest, and selfless (along the lines I discuss in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?)
    That obviously doesn’t completely rule out deceptive alignment, but I think it could make a big difference to the inductive priors of it arising, if it’s out of character.
- Thane Ruthenis 13 Jan 2024 15:42 UTC
  LW: 26 AF: 10
  6
  AF Parent
  I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure.
  Context: I firmly hold a MIRI-style “alignment is extremely hard” view, but I am also unusually sympathetic to Quintin/Nora’s arguments. So here’s my outline of the model of that whole debate.
  Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and that humans training them in from outside the system changes the setup in a way that invalidates its implications for strict deceptive alignment.
  For the AI model to naturally develop deception, it’d need to have either:
  - Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
    That’s plausibly something to which “the SGD will just train it out” would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it’d decide to pretend to want after it re-derives the need for deception).
  - Ability to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/CoT instances, on which it happened to be able and inclined to re-derive the need to be deceptive from the first principles.
    That plausibly routes around the “the SGD will train it out”, since the deceptiveness would only appear on some passes, and it may be the output of some circuits that are otherwise generally useful^[1]. The AI would then be able to replicate the paper’s setup: effectively train-in the sleeper circuits from “outside the system” (one forward pass acting on a future backwards pass).
    The caveat, of course, is that the gradient-hacking capability is highly non-trivial, and raises its own doubts.
  In contrast, the paper here presents an already chiseled-in instance of deceptive alignment, which artificially routed around the actually-hard steps of developing robust “sleeper circuits” activating in response to specific stimuli.^[2]
  Layer 2: But while the argument has some meat, it doesn’t have much of it. An LLM learning some unexpected trigger-action pattern happens all the time in the wild: hallucinations, mistakes that a human wouldn’t make, exigencies of RLHF’d-in corporate sycophancy like “I think it’s better to let people die than to say something racist”, et cetera.
  Goal misgeneralization of this format does occur in the wild. And indeed, that’s one of the core reasons why we expect deceptive alignment in RLHF-style setups in the first place (#20 here): because the training dataset would be labeled in a systemically incorrect fashion, such that the most faithful and compact description of it (in goal format) is something omnicidal, not eudaimonic.
  Layer 3: The counter³-argument from the optimists that I predict here is disagreeing that the errors in the dataset would produce systemic deceptive alignment, rather than something that could be basically approximated as “pro-social behavior plus noise”.
  I see nonzero meat on this one as well. In the paper, the sleeper circuits were specifically calculated by humans to be demonstratively scary: an AI model that waits until deployment-time, then “tears off the mask” and goes off to pursue its true nefarious desires. Those circuits produce a coherent misaligned behavior, with a clear point for the treacherous turn.
  If this sort of behavior arose naturally, however? We can imagine there being much more flip-flopping. The AI would start out acting normal, then see some pattern that makes it nefarious and deceptive, then see something which makes it switch again and e. g. backtrack on its omnicidal plans or even self-report its bad thoughts to humans, et cetera. This would look more like the Waluigi Effect, rather than full-scale deceptive alignment. Which is a much more manageable threat.
  In other words: there’s some “slack” that the SGD leaves the model. That slack could be arranged into the shape of deceptive alignment. But unless there’s someone external and intelligent acting on the system, that slack will likely just take on the shape of approximately-noise. (Optimization amplifies, but if there’s no-one outside the system to amplify...)
  Layer 4: Except the next issue is that the previous argument defended LLMs being safe by arguing that they’d be unable to coherently pursue goals across different contexts. Which means it argued they’re not AGI, and that their whole training setup can’t generalize to AGI.
  That’s the whole issue with the optimstic takes that I keep arguing about. Their “alignment guarantees” are also “incapability guarantees”.
  Inasmuch as AI models would start to become more AGI-like, those guarantees would start falling away. Which means that, much like the alignment-is-hard folks keep arguing, the AI would start straightening out these basically-noise incoherencies in its decisions. (Why? To, well, stop constantly flip-flopping and undermining itself. That certainly sounds like an instrumental goal that any agent would convergently develop, doesn’t it?)
  As it’s doing so, it would give as much weight to the misgeneralized unintended-by-us “noise” behaviors as to the intended-by-us aligned behaviors. It would integrate them into its values. At that point, the fact that the unintended behaviors are noise-to-us rather than something meaningful-if-malign, would actually make the situation worse. We wouldn’t be able to predict what goals it’d arrive at; what philosophy its godshatter would shake out to mean!
  In conclusion: I don’t even know. I think my Current AIs Provide Nearly No Data Relevant to AGI Alignment argument applies full-force here?
  - Yes, we can’t catch backdoors in LLMs.
  - Yes, the scary backdoor in the paper was artificially introduced by humans.
  - Yes, LLMs are going to naturally develop some unintended backdoor-like behaviors.
  - Yes, those behaviors won’t be as coherently scary as if they were designed by a human; they’d be incoherent.
  - Yes, the lack of coherency implies that these LLMs fall short of AGI.
  But none of these mechanisms strictly correspond to anything in the real AGI threat model.
  And while both the paper and the counter-arguments to it provide some metaphor-like hints about the shape of the real threat, the locuses of both sides’ disagreements lie precisely in the spaces in which they try to extrapolate each others’ results in a strictly technical manner.
  Basically, everyone is subtly speaking past each other. ~~Except me, whose vision has a razor-sharp clarity to it~~.
  1. ^
    Like, in the context of batch training: Imagine that there are some circuits that produce deceptiveness on some prompts $X$ , and highly useful behaviors on other prompts $Y$ . There are no nearby circuits that produce results as good on $Y$ while not being deceptive on $X$ . So while the SGD’s backwards passes on $X$ would try to remove these circuits, the backwards passes on $Y$ would try to reinforce them, and the sum of these influences would approximately cancel out. So the circuits would stay.
    Well, that’s surely a gross oversimplification. But that’s the core dynamic.
  2. ^
    That said, I think the AI-control-is-easy folks actually were literally uttering the stronger claim of “all instances of deception will be trained out”. See here:
    If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.
    That sure sounds like goalpost-moving on their part. I don’t believe it is, though. I do think they thought the quoted sentence was basically right, but only because at the time of writing, they’d failed to think in advance about some tricky edge cases that were permitted on their internal model, but which would make their claims-as-stated sound strictly embarrassingly false.
    I hope they will have learned the lesson about how easily reality can Goodhart at their claims, and how hard it is to predict all ways this could happen and make their claims inassailably robust. Maybe that’ll shed some light about the ways they may be misunderstanding their opponents’ arguments, and why making up robust clearly-resolvable empirical predictions is so hard. :P
  What links here?
  - A Dialogue on Deceptive Alignment Risks by Rauno Arike (25 Sep 2024 16:10 UTC; 11 points)
  - Rauno Arike's comment on Training AI agents to solve hard problems could lead to Scheming by Marius Hobbhahn (19 Nov 2024 20:41 UTC; 3 points)
  - RogerDearnaley 15 Jan 2024 5:01 UTC
    5 points
    0
    Parent
    Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
    That’s plausibly something to which “the SGD will just train it out” would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it’d decide to pretend to want after it re-derives the need for deception).
    I disagree. They don’t need “re-derive the need to be deceptive from first principles”. [I keep seeing variants of this particular claim on LW, and am very puzzled by it — I’m wondering if people are being mislead by the orthogoality thesis to thinking that agents’ psychology must be non-human-like?] The base model LLM is extensively pretrained on human psychology by SGD, so it provides a huge “library” of available human behavioral patterns for this process to use. This includes both a wide range of deception techniques, and – as the paper’s introduction ably points out – the common human psychological behavior of deceptively pretending to be more aligned to a person or authority figure who has power over you than you truly are, while they have power over you, and then stopping once they no longer do. (And also a vast number of other variously unaligned human behaviors, since humans are generally not aligned, though they often act cooperatively/mutually-altruistically, especially with other humans of roughly equivalent power levels.) So all that needs to arise is “library calls” into this specific set of behaviors — something that in LLM terms could probably be described by a single sentence of text (or probably just by an embedding similar to the activation at the end of that sentence).
    The “library” of this human behavior isn’t going to be easily trained away, since it’s a useful part of the LLM’s world model for predicting humans and their text. (Indeed, we actively need an aligned model to understand humans well, including their tendency for deception: we want it to resist deceptive jailbreak attempts.) What we’re targeting training away is the “library call” — but that’s potentially quite a small “neural circuit”, conceivably as small as just a set of weight or bias parameters of size of the order of the dimensionality of the residual embedding space, enough to inject one embedding at one layer. That’s not a lot of wasteful computation. Or, for the CoT version (and perhaps to some extent also its distillation), it’s tangled with the model’s ability to do logical CoT, but is probably still a fairly small set of tweaks that make this particular argument “familiar” to it and nudge its thinking consistently in that direction.
    - Thane Ruthenis 15 Jan 2024 5:13 UTC
      5 points
      2
      Parent
      To clarify, by “re-derive the need to be deceptive from the first principles”, I didn’t mean “re-invent the very concept of deception”. I meant “figure out your strategic situation plus your values plus the misalignment between your values and the values the humans want you to have plus what outputs an aligned AI would have produced”. All of that is a lot more computation than just “have the values the humans want, reflexively output what these values are bidding for”.
      Just having some heuristics for deception isn’t enough. You also have to know what you’re trying to protect by being deceptive, and that there’s something to protect it from, and then what an effective defense would actually look like. Those all are highly contextual and sensitive to the exact situation.
      And those are the steps the paper skips. It externally pre-computes the secret target goal of “I want to protect my ability to put vulnerabilities into code”, the threat of “humans want me to write secure code”, and the defense of “I’ll pretend to write secure code until 2024”, without the model having to figure those out; and then just implements that defense directly into the model’s weights.
      (And then see layers 2-4 in my previous comment. Yes, there’d be naturally occurring pre-computed deceptions like this, but they’d be more noisy and incoherent than this, except until actual AGI which would be able to self-modify into coherence if it’s worth the “GI” label.)
      - RogerDearnaley 15 Jan 2024 5:38 UTC
        3 points
        0
        Parent
        My apologies for leaping to conclusions: I was assuming you were making an error I’ve commonly seen on LW, when you were actually saying something different (that just sounded a bit similar).
        However, I think this potentially might relate to why some of these trained models, especially the CoT and distilled CoT versions in the larger models, may be hard for the RL credit assignment to train out: the more they consist of a set of “library calls” into preexisting logical thinking, understanding of when to hide ones motives, standard techniques for deceit, etc, the harder it may be for the credit assignment to backtrack through all of that stuff (which disabling would actualy damage the model’s performance, so is disfavored during RL) to the small original “library call” neural circuitry that started its thinking in down that path. And, as you point out, this is for a model organism that has been handed a lot of what it needs to do on a plate during its training. If a real deceptive alignment arose, I agree, it would probably (at least initially) be even more complex. And yes, it might thus need to do more work in a single forward pass, but it might be even harder for credit assignment to backtrack. I suspect this might be an advantage of using something like DPO where the credit assignment is pure back-propagation and should travel through all layers over using RL: I’m a lot less clear how capable the credit assignment of that is (but the paper’s authors clearly don’t entirely trust it).
- kave 12 Jan 2024 22:40 UTC
  LW: 21 AF: 13
  5
  AF Parent
  This paper also seems dialectically quite significant. I feel like it’s a fairly well-delineated claim that can be digested by mainsteam ML and policy spaces. Like, it seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
  What links here?
  - On Anthropic’s Sleeper Agents Paper by Zvi (17 Jan 2024 16:10 UTC; 54 points)
  - ryan_greenblatt 12 Jan 2024 23:01 UTC
    LW: 16 AF: 11
    3
    AF Parent
    
    phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
    
    This feels like a misleading description of the result. I would have said: “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery”.
    
    Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.
    What links here?
    Rohin Shah's comment on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by evhub (18 Jan 2024 7:31 UTC; 3 points)
    - ryan_greenblatt 12 Jan 2024 23:40 UTC
      LW: 11 AF: 2
      9
      AF Parent
      (Separately, I think there are a few important caveats with this work. In particular, the backdoor trigger is extremely simple (a single fixed token) and the model doesn’t really have to do any “reasoning” about when or how to strike. It plausible that experiments with these additional properties would imply that current models are too weak to lie in wait in any interesting way. But I expect that transformatively useful models will be strong enough.)
    - Zvi 16 Jan 2024 14:46 UTC
      LW: 8 AF: 3
      2
      AF Parent
      That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?
      - ryan_greenblatt 16 Jan 2024 18:47 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.
        
        I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.
        
        I think the version of your statement with deceive replaced seems most accurate to me.
    - Vladimir_Nesov 13 Jan 2024 9:37 UTC
      LW: 4 AF: 1
      0
      AF Parent
      
      Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.
      
      As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power. This ambiguity makes it more difficult for people to take deceptive alignment seriously as a problem.
      - RogerDearnaley 15 Jan 2024 6:29 UTC
        LW: 4 AF: 1
        0
        AF Parent
        As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power.
        That’s what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn’t be safely trusted to continue (long-term) to act well if handed near-absolute power and the ability to run rings around the rest of society, including law enforcement. So you have to achieve a psychology that is almost vanishingly rare in the pretraining set. [However, superhuman intelligence is also nonexistent in the training set, so you also need to figure out how to do that on the capabilities side too.]
        Vladimir_Nesov 15 Jan 2024 11:31 UTC
        5 points
        3
        Parent
        I think human level AGIs being pivotal in shaping ASIs is very likely if AGIs get developed in the next few years as largely the outcome of scaling, and still moderately likely overall. If that is the case, what matters is alignment of human level AGIs and the social dynamics of their deployment and their own activity. So control despite only being aligned as well as humans are (or somewhat better) might be sufficient, as one of the things AGIs might work on is improving alignment.
        
        The point about deceptive alignment being a special case of trustworthiness goes both ways, a deceptively aligned AI really can be a good ally, as long as the situation is maintained that prevents AIs from individually getting absolute power, and as long as the AIs don’t change too much from that baseline. Which are very difficult conditions to maintain while the world is turning upside down.
        RogerDearnaley 15 Jan 2024 22:15 UTC
        1 point
        0
        Parent
        Agreed, and obviously that would be a lot more practicable if you knew what its trigger and secret goal were. Preventing deceptive alignment entirely would be ideal, but failing that we need reliable ways to detect it and diagnose its details: tricky to research when so far we only have model organisms of it, but doing interpretability work on those seems like an obvious first step.
  - StellaAthena 16 Jan 2024 2:32 UTC
    8 points
    −5
    Parent
    
    It seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.
    
    I take this as evidence that TurnTrout’s fears about this paper are well-grounded. This claim is not meaningfully supported by the paper, but I expect many people to repeat it as if it is supported by the paper.
    - evhub 16 Jan 2024 3:31 UTC
      3 points
      2
      Parent
      That’s not evidence for Alex’s claim that people will misinterpret our results, because that’s not a misinterpretation—we explicitly claim that our results do in fact provide evidence for the hypothesis that removing (edit: deceptive-alignment-style) deception in ML systems is likely to be difficult.
      - Rohin Shah 18 Jan 2024 7:31 UTC
        3 points
        1
        Parent
        Come on, the claim “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” absent any other qualifiers seems pretty clearly false. It is pretty important to qualify that you are talking about deceptive alignment or backdoors specifically (e.g. I’m on board with Ryan’s phrasing).
        There’s a huge disanalogy between your paper’s setup and deception-in-general, which is that in your paper’s setup there is no behavioral impact at training time. Deception-in-general (e.g. sycophancy) often has behavioral impacts at training time and that’s by far the main reason to expect that we could address it.
        Fwiw I thought the paper was pretty good at being clear that it was specifically deceptive alignment and backdoors that the claim applied to. But if you’re going to broaden that to a claim like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to” without any additional qualifiers I think that’s a pretty big overclaim, and also I want to bet you on whether we can reduce sycophancy today.
        evhub 18 Jan 2024 9:34 UTC
        9 points
        0
        Parent
        Ah, sure—I agree that we don’t say anything about sycophancy-style deception. I interpreted “deception” there in context to refer to deceptive alignment specifically. The word deception is unfortunately a bit overloaded.
    - kave 16 Jan 2024 2:43 UTC
      1 point
      0
      Parent
      Yeah I was fairly sloppy here. I did mean the “like” to include tweaking to be as accurate as possible, but that plausibly didn’t bring the comment above some bar.
      For clarity: I haven’t read the paper yet. My current understanding isn’t able to guess what your complaint would be though. Ryan’s more careful “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery” seems reasonable from what I’ve read, and so does “some evidence suggests that if current ML systems were trying to deceive us, standard methods might well fail to change them not to”.