Zach Stein-Perlman comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Zach Stein-Perlman 12 Jan 2024 23:45 UTC
LW: 9 AF: 5
0
AF
The control-y plan I’m excited about doesn’t feel to me like squeeze useful work out of clearly misaligned models. It’s like use scaffolding/oversight to make using a model safer, and get decent arguments that using the model (in certain constrained ways) is pretty safe even if it’s scheming. Then if you ever catch the model scheming, do some combination of (1) stop using it, (2) produce legible evidence of scheming, and (3) do few-shot catastrophe prevention. But I defer to Ryan here.
- ryan_greenblatt 13 Jan 2024 1:33 UTC
  LW: 7 AF: 6
  4
  AF Parent
  I think once you’re doing few-shot catastrophe prevention and trying to get useful work out of that model, you’re plausibly in the “squeezing useful work out of clearly misaligned models regime”. (Though it’s not clear that squeezing is a good description and you might think that your few-shot catastrophe prevention interventions have a high chance of removing scheming. I’m generally skeptical about removing catastrophic misalignment just based on one escape attempt, but I think that we still might be able to greatly improve safety via other mechanisms.)
  
  As I discuss in my comment responding to the sibling comment from habryka, I’m interested in both of ensuring direct safety and getting useful work from scheming models.
- habryka 13 Jan 2024 0:01 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I had a long two-hour conversation with Ryan yesterday about this, and at least my sense was that he is thinking about it as “squeeze useful work out of clearly misaligned models”.
  He also thought other people should probably work on making it so that if we see this behavior we stop scaling and produce legible evidence of scheming to facilitate a good societal response, but my sense is that he was thinking of that as separate from the work he was pointing to with the AI control post.
  - ryan_greenblatt 13 Jan 2024 1:22 UTC
    LW: 17 AF: 9
    9
    AF Parent
    Being able to squeeze useful work out of clearly misaligned models should probably be part of the portfolio. But, we might more minimally aim to just ensure that in cases where we’re unsure if our model is scheming (aka deceptively aligned), it is at least directly safe (e.g. doesn’t escape even if the work we get out is somewhat shitty).
    
    I would describe control as trying to maximize E[utility] by decomposing it into p(scheming) E[utility|scheming] + p(not scheming) E[utility|not scheming] and evaluating E[utilty|scheming] by using control evaluations. Control evaluations are capability evaluations: what’s the worst thing the AI is capable of doing? E[utility|not scheming] can be evaluated with more mundane evaluations (as non-scheming AIs won’t intentionally sabotage evaluations basically by definition).
    
    So, we could go with go with a catch-then-shut-down plan where we mostly just aim to ensure direct safety given scheming and shut down if we detect scheming.
    
    But, we might also need to get useful work in the case where our model is scheming. So even if we catch it doing clearly misaligned stuff, we might need to keep going.