ryan_greenblatt comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

ryan_greenblatt 13 Jan 2024 1:22 UTC
LW: 17 AF: 9
9
AF
Being able to squeeze useful work out of clearly misaligned models should probably be part of the portfolio. But, we might more minimally aim to just ensure that in cases where we’re unsure if our model is scheming (aka deceptively aligned), it is at least directly safe (e.g. doesn’t escape even if the work we get out is somewhat shitty).

I would describe control as trying to maximize E[utility] by decomposing it into p(scheming) E[utility|scheming] + p(not scheming) E[utility|not scheming] and evaluating E[utilty|scheming] by using control evaluations. Control evaluations are capability evaluations: what’s the worst thing the AI is capable of doing? E[utility|not scheming] can be evaluated with more mundane evaluations (as non-scheming AIs won’t intentionally sabotage evaluations basically by definition).

So, we could go with go with a catch-then-shut-down plan where we mostly just aim to ensure direct safety given scheming and shut down if we detect scheming.

But, we might also need to get useful work in the case where our model is scheming. So even if we catch it doing clearly misaligned stuff, we might need to keep going.