Vladimir_Nesov comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Vladimir_Nesov 13 Jan 2024 9:37 UTC
LW: 4 AF: 1
0
AF

Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.

As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power. This ambiguity makes it more difficult for people to take deceptive alignment seriously as a problem.
- RogerDearnaley 15 Jan 2024 6:29 UTC
  LW: 4 AF: 1
  0
  AF Parent
  As Zvi noted in a recent post, a human is “considered trustworthy rather than deceptively aligned” when they have hidden motives suppressed from manifesting (possibly even to the human’s own conscious attention) by current circumstances. So deceptive alignment is not even centrally a special case of deception, it’s more like the property of humans being corruptible by absolute power.
  That’s what makes aligning LLM-powered ASI so hard: you need to produce something a lot more moral, selfless, and trustworthy than almost every human, nearly-all of whom couldn’t be safely trusted to continue (long-term) to act well if handed near-absolute power and the ability to run rings around the rest of society, including law enforcement. So you have to achieve a psychology that is almost vanishingly rare in the pretraining set. [However, superhuman intelligence is also nonexistent in the training set, so you also need to figure out how to do that on the capabilities side too.]
  - Vladimir_Nesov 15 Jan 2024 11:31 UTC
    5 points
    3
    Parent
    I think human level AGIs being pivotal in shaping ASIs is very likely if AGIs get developed in the next few years as largely the outcome of scaling, and still moderately likely overall. If that is the case, what matters is alignment of human level AGIs and the social dynamics of their deployment and their own activity. So control despite only being aligned as well as humans are (or somewhat better) might be sufficient, as one of the things AGIs might work on is improving alignment.
    
    The point about deceptive alignment being a special case of trustworthiness goes both ways, a deceptively aligned AI really can be a good ally, as long as the situation is maintained that prevents AIs from individually getting absolute power, and as long as the AIs don’t change too much from that baseline. Which are very difficult conditions to maintain while the world is turning upside down.
    - RogerDearnaley 15 Jan 2024 22:15 UTC
      1 point
      0
      Parent
      Agreed, and obviously that would be a lot more practicable if you knew what its trigger and secret goal were. Preventing deceptive alignment entirely would be ideal, but failing that we need reliable ways to detect it and diagnose its details: tricky to research when so far we only have model organisms of it, but doing interpretability work on those seems like an obvious first step.