ryan_greenblatt comments on Catching AIs red-handed

ryan_greenblatt 5 Jan 2024 18:02 UTC
LW: 2 AF: 1
0
AF
Yes.

We’re a bit inconsistent with terminology here. We say both “use” and “deploy”:

We think that it’s plausible that AI developers will adopt the policy that if they catch the models trying to escape, they stop using them.

and

You might think that this is an absurdly weak lab ask: “once your model has demonstrably tried to escape, stop deploying it”.

In the context of catch-then-shut-down, we use the term “deploy” to mean the same thing as “use”. So under the catch-then-shut-down plan, the lab no longer uses the model at all once it has been caught.

But, there is the possible caveat of safety scientific work which is obviously safe even if the AI is scheming (given realistic bounds on the capabilities). E.g., maybe it’s fine to carefully study the AI’s behavior in the case where it tried to escape with some precautions around persuasion.

So, when we say “deploy” or “use” we mean something like “use at all in ways which aren’t careful safety/alignment experiments, evaluations, or tests which are extremely likely to be safe even given that the AI is scheming and we might somewhat underestimate its capabilities”. Perhaps we should clarify in the post.