ryan_greenblatt comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

ryan_greenblatt 12 Jan 2024 23:01 UTC
LW: 16 AF: 11
3
AF

phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.

This feels like a misleading description of the result. I would have said: “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery”.

Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.
What links here?
- Rohin Shah's comment on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by evhub (18 Jan 2024 7:31 UTC; 3 points)
- ryan_greenblatt 12 Jan 2024 23:40 UTC
  LW: 11 AF: 2
  9
  AF Parent
  (Separately, I think there are a few important caveats with this work. In particular, the backdoor trigger is extremely simple (a single fixed token) and the model doesn’t really have to do any “reasoning” about when or how to strike. It plausible that experiments with these additional properties would imply that current models are too weak to lie in wait in any interesting way. But I expect that transformatively useful models will be strong enough.)
- Zvi 16 Jan 2024 14:46 UTC
  LW: 8 AF: 3
  2
  AF Parent
  That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?
  - ryan_greenblatt 16 Jan 2024 18:47 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.
    
    I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.
    
    I think the version of your statement with deceive replaced seems most accurate to me.
- Vladimir_Nesov 13 Jan 2024 9:37 UTC
  LW: 4 AF: 1
  0
  AF Parent
  
  Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.
  
  As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power. This ambiguity makes it more difficult for people to take deceptive alignment seriously as a problem.
  - RogerDearnaley 15 Jan 2024 6:29 UTC
    LW: 4 AF: 1
    0
    AF Parent
    As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power.
    That’s what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn’t be safely trusted to continue (long-term) to act well if handed near-absolute power and the ability to run rings around the rest of society, including law enforcement. So you have to achieve a psychology that is almost vanishingly rare in the pretraining set. [However, superhuman intelligence is also nonexistent in the training set, so you also need to figure out how to do that on the capabilities side too.]
    - Vladimir_Nesov 15 Jan 2024 11:31 UTC
      5 points
      3
      Parent
      I think human level AGIs being pivotal in shaping ASIs is very likely if AGIs get developed in the next few years as largely the outcome of scaling, and still moderately likely overall. If that is the case, what matters is alignment of human level AGIs and the social dynamics of their deployment and their own activity. So control despite only being aligned as well as humans are (or somewhat better) might be sufficient, as one of the things AGIs might work on is improving alignment.
      
      The point about deceptive alignment being a special case of trustworthiness goes both ways, a deceptively aligned AI really can be a good ally, as long as the situation is maintained that prevents AIs from individually getting absolute power, and as long as the AIs don’t change too much from that baseline. Which are very difficult conditions to maintain while the world is turning upside down.
      - RogerDearnaley 15 Jan 2024 22:15 UTC
        1 point
        0
        Parent
        Agreed, and obviously that would be a lot more practicable if you knew what its trigger and secret goal were. Preventing deceptive alignment entirely would be ideal, but failing that we need reliable ways to detect it and diagnose its details: tricky to research when so far we only have model organisms of it, but doing interpretability work on those seems like an obvious first step.