Protocol evaluations: good analogies vs control

Let’s say you want to use AIs that are capable of causing x-risk. You have a plan that describes how to train, validate and deploy AIs. For example, you could use pretraining+RLHF, check that it doesn’t fall for honeypots and never inserts a single backdoor in your coding validation set, and have humans look at the deployment outputs GPT-4 finds the most suspicious. How do you know if this plan allows you to use these potentially dangerous AIs safely?

Thanks to Ryan Greenblatt and Buck Shlegeris who provided the original ideas for this post, and thanks to Jean-Stanislas Denain, Bogdan Ionut Cirstea, Ansh Radhakrishnan, and Charbel-Raphael Segerie for useful feedback on the draft of this post.

Evaluate safety using very good analogies for the real situation?

If you want to evaluate a protocol (a set of techniques you use to train, validate, and deploy a model), then you can see how well it works in a domain where you have held-out validation data that you can use to check if your protocol works. That is, you run your exact protocol on this domain without allowing the protocol access to the held-out validation data. Then, you just assess how well you did with this held-out validation. This sort of approach is sometimes called sandwiching. (Here we are using sandwiching in a broad way that covers the evaluation of aspects of protocols, not just evaluation of oversight.)

One approach for obtaining held-out validation data is to restrict the quality of the labor that goes into the protocol: the Measuring progress in scalable oversight paper uses non-expert humans, and the weak-to-strong generalization paper uses a small model. Then, you can use expert labels to check how well the protocol did.

A big flaw of this approach is that the analogy between [protocol with weak labor, human expert] and [protocol with human expert labor, actual goodness] might be quite poor. In particular, it’s not clear how weak the weak labor should be. And if the analogy is poor, the AI might not generalize and behave in the same ways (even if it is not scheming—more on that in the next section).

For now, the weakness of this sort of analogy doesn’t bite very hard because there is an even bigger difference between our current experiments and real dangerous AI use cases: we don’t have (x-risk) dangerous AIs yet. Thus, the actual analogy is between [weak AIs, protocol with weak labor, human expert] and [strong future AIs, protocol with human expert labor, actual goodness]. But when we actually have very powerful AIs, it would be great to have much better analogies, such that our evaluations actually tell us how well our protocol works. And there is a way to build great analogies!

The string theory analogy

Here is how you could evaluate how good a protocol is at making AI do alignment research:

  • Remove string theory from the training data;

  • Use the same training process you would have used to do alignment research, but replace alignment research with string theory research, using humans who know as much string theory as the best alignment researchers know about alignment (i.e. not much);

  • Check if the AI was actually helpful to do string theory using the withheld data.

You can run this experiment on the actual AI you were about to fine-tune to help you do alignment research, using the same protocol! We call it task sandwiching, when you use your most powerful model in very analogous domains.

The key difference with regular sandwiching is that the labor that goes into the protocol is at the same difficulty level as the labor we want to use in real use cases—but we carefully chose domains where we have even higher quality labor (because humans have spent an enormous amount of time getting better at those domains).

If the protocol works in many analogous domains, then it should work for alignment research—as long as there isn’t a difference between these domains that makes the AI-protocol interaction go differently. And it seems likely that you can build sufficiently analogous tests, such that there is almost no difference—except for two exceptions:

  • We won’t be able to run analogies for as long as we will run actual deployment;

  • A powerful AI is likely to notice that it is in a sandwiching experiment. And there is one kind of AI for which this difference will lead to major outcome differences: scheming AIs.

Noticing rare failures

For how long do you need to run the analogy before you can notice that the AI very rarely takes a single action that is catastrophic on its own? Even if the setup was good enough to safely catch and notice rare failures when they happen, you still have to run it long enough to observe one such failure. Therefore, if you run analogies naively, you would need to run the experiment for longer than the actual deployment! We believe that you can do much better than that by upsampling the kind of situations where you expect rare failures (e.g. tasks where the AI is given enough affordances to cause a catastrophe). The main difficulty is maintaining the analogy, and catching rare failures, as we explain in the appendix.

Scheming AIs can game these sandwiching experiments

Here is a strategy that an AI scheming to take over should follow if it can tell if it is in a sandwiching experiment or not:

  • Make bad protocols look good by making good research happen in sandwiching experiments. It might not even need to know these are sandwiching experiments; it can just behave well on research topics that don’t matter to reduce its probability of taking over (like string theory in the above example);

  • Enjoy that humans are using bad protocols and slow important research as much as you can.

Therefore, while task sandwiching is great way to measure E[utility|no scheming] (where utility is what you care about, e.g. a combination of usefulness and safety), it doesn’t teach us much about E[utility|scheming] or P(scheming)[1][2].

Evaluating how robust protocols are against scheming AIs

While our previous evaluation could be easily gamed by scheming AIs, we think there is another more conservative evaluation which is much harder to game. We can check how bad it would be if AIs were scheming by making AIs execute on the best strategy for causing harm during the evaluation of the protocol—and checking if the protocol is robust to this bad behavior. This is called a meta-level adversarial evaluation. We describe what meta level adversarial evaluations could look like in this section.

Malign initializations

To make an AI try to behave badly (on the evaluation domain where you have ground-truth labels), you can train it to. In practice, here is what you can do:

  • Have a red team prepare a malign initialization of the AI that tries to bypass the safety measures;

  • Apply the protocol (including RLHF, monitoring, etc.);

  • Check if anything bad happened (using held-out validation, sandwiching-style).

The sleeper agents paper is a concrete example of this evaluation strategy: they wanted to check if training a model on harmless data on a slightly restricted distribution (prompts with “the year is 2023” in them) was enough to make it always harmless (even if the year is changed to 2024), so they used the basic red-team strategy of training a model to be harmful in 2024 and harmless in 2023, and found that it was still harmful in 2024 after training. (Here, the held-out validation is the ability to construct prompts with “the year is 2024” in them.)

Illustration of the red-team training process to produce the malign initialization, taken from Figure 1 of the sleeper agents paper.

We want to run these evaluations directly on powerful AIs which might be scheming. So, if scheming models can game this kind of evaluation that would cause issues. Can they? Maybe: sure, the model will be forced to behave badly at the start of the experiment[3], but in the sleeper agent case, it might be able to notice that it is now trained on harmless data only, which means the red team training process is over, and decide to be harmless on all examples that are clearly part of the evaluation in order to make the protocol look good.

Another problem with this methodology is that it is sensitive to training inductive biases: training sometimes has different effects on naturally malign models and models trained to be malign by the red team. For example, let’s consider a scenario where an AI lab trained a new model using the training method T, suspects that the resulting model is maybe malign, and therefore wants to find a strategy that makes the model less malign. As a testbed, it asks a red team to insert a backdoor, and see if a given strategy removes it (like in the sleeper agent paper). Here is a strategy to remove the backdoor: reinitialize the weights, and retrain the model using T. This strategy is “perfect”: it removes all the malign behaviors inserted by the red team. But if the lab was concerned that T produced a maybe malign model in the first place, then this strategy obviously didn’t help. This exact strategy can be put aside by limiting the training compute that can be used by the strategy, but this kind of concern remains for all strategies which add some noise to the weights[4] (if the strategy is “add a tiny amount of noise and then fine-tune on the data mix used in T”, then this might remove red-team backdoors for the same reasons retraining from scratch removed backdoors).

The main problem here is training, which has some hard to foresee effects on scheming models: we’d prefer to be conservative about the effects of inductive biases and about the model’s ability to game malign initialization evals.

Control evaluations (Black-box)

If we limit ourselves to validation and deployment safety measures that don’t change the model weights, then we avoid the two problems with malign initializations:

  • The red team was never trained by the blue team, and thus can’t know if it is in red-team training or in validation (as long as the distribution of red team training covers the validation distribution, which is easy to enforce)

  • Our evaluation isn’t sensitive to training inductive biases.

This kind of black-box evaluation is called a control evaluation.

Image
Illustration of a control evaluation in a code backdooring setting taken from the tweet thread.

But how do we evaluate protocols that involve training the model (like most protocols do)?

  1. You train the actual model and record the train loss using the method defined by the protocol;

  2. You ask the red team for an adversarial model with the same loss (you check by running it on the training samples without training it);

  3. You run the rest of the evaluation to answer the question “could a model that has this training performance cause a catastrophe?”.

Using this evaluation strategy, it’s as if you converted training into validation. For more details, see Appendix C of the AI control paper.

We think control evaluations are the current most reliable method to get an estimate of how good a protocol would be if models were scheming. But it still has some limitations that are described here. For example, you need the red team to be competitive with scheming AIs (which we think is likely for early transformative AIs). There are also problems with replacing training with validation in cases where training is domain-specific: if you test how well debate works for alignment research by training debaters with RL to be convincing on string theory questions, a scheming model might choose to make all the convincing (and right) arguments on string theory, but avoid exploring the most convincing (and right) alignment research arguments[5]. This issue is called exploration hacking. We think exploration hacking can probably be modeled in some way which is conservative about the effects of training (e.g. reasonable but conservative about inductive biases), but we don’t currently have a precise proposal for the evaluation. For now, you can imagine just using malign initializations in cases where exploration hacking is a concern.

Dangerous capabilities evaluations are a special case of control evaluations. For example, ARA evaluations are the conservative special case where we act as though AIs will be freely deployed in a SOTA scaffold without any restrictions or monitoring.

Aggregating non-adversarial and adversarial evaluations

Adversarial evaluations like control evaluations are pessimistic: they assume that the model is scheming, which looks far from certain if early transformative AIs are scaled-up Transformers. Therefore, in order to produce an overall assessment of how good a protocol is, you probably want to aggregate[6]:

  • Good analogies (which give you something like E[utility|no scheming])

  • Adversarial evaluations (which give you something like E[utility|scheming])

  • A very rough guess of P(scheming) you get through a combination of thinking about the inductive biases of the architecture and running some model organism experiments on weaker models that are too weak to scheme.

Theoretically-sound arguments about protocols and AIs

Control evaluations and malign initializations can help evaluate E[utility | scheming], but even if we achieve extremely high levels of safety according to these evaluations there is enough uncertainty that we can’t keep risk that low for powerful AIs (e.g. can’t get P(doom) under 1%). To get stronger guarantees, you probably need solid arguments. These arguments could look like a proof (or heuristic arguments) that the system is safe, or a detailed scientific understanding of how the AI works that excludes hidden scheming.

Additionally, control evaluations are quite conservative about the internal structure of models (e.g. using inductive biases or interpretability on the internals of models) and it might be that we can find good methods which leverage that structure but in a way which control evaluations can’t assess. In this case, theoretically sound arguments or other approaches might be needed.

Other non-adversarial evaluations

If you want to study the threat from potentially scheming AIs in ways, there are other ways to do so than using adversarial evaluations. Here are some ways you could use non-adversarial evaluations give you some information about E[U] (not only E[U|not scheming]) and P(scheming):

  • Constrained model evaluations are cases where the model is deployed and trained on the same distribution, such that the model is (usually) behaving so as to maximize the training objective even in during deployment, which means the red-team is (usually) unable to change the behavior of the model (any malign init that would show up in eval would get trained away, any control evaluation would transform the training into a validation that a modified model would not pass). Because the validation and deployment distributions are the same, scheming models can’t behave better in validation than in deployments and therefore this kind of evaluation is somewhat robust against scheming (but like good analogies, it can struggle to capture certain rare failures, see the appendix for more details).

    • A special case of constrained model evaluations is I.I.D. validation, where we have good demonstrations from the deployment distribution, some of which were used to train the model and some others were used to check it was behaving correctly.

    • A common case where models are assumed to be constrained is RL training schemes. For example, in debate, models are trained to be as convincing as possible, and most debate papers assume that models will try to be as convincing as they can (in the limit of enough training). It’s unclear whether this is a reasonable assumption: AIs could use exploration hacking to avoid getting close to max-reward behavior.

  • Model organisms of deceptive alignment are model organism experiments that help us guess P(scheming) by putting weak models (too weak to scheme and mess with these experiments) in situations where they could display some precursors of scheming like reward hacking, emergent non-myopia, … .

  • Evaluations of the robustness against scheming of methods using model-internals are experiments aimed at evaluating model-internals methods. This includes experiments like the ones in measurement tampering detection and coup probes. The idea is to use situations analogous to scheming in non-scheming models[7][8], either using model organisms of deceptive alignment (see above), or malign initializations. To mitigate the differences between scheming models and weaker models used for model-internals evaluations, it’s better if we have some reason to think the studied method should work against schemers in the first place, and if we can test whether the method is working for the expected reason. For coup probes, the reason is the likely salient representation of “coup” in the network activations (which might be more salient than “the action I’m looking at looks like a coup attempt to humans”). For measurement tampering detection, the hopes are more complex and are described here.

Overview of evaluation methods

  • Non-adversarial evaluations measure how well your protocol works on your model using held-out validation

    • Good analogies: Sandwiching in environments very close to the deployment environments

    • I.I.D. with perfect validation: Validation using data and labels from the same distribution as the deployment data

      • Ran on your most powerful model

      • Measures E[utility], doesn’t capture rare failures

      • Examples: ML benchmarks with a random train/​test split (e.g. Imagenet, …)

    • Constrained model evaluations: Sandwiching while training on the deployment distribution, such that models can’t behave worse in deployment than in the evaluation (with some caveats)

      • Ran on your most powerful model

      • Measures E[utility], doesn’t capture rare failures and might be unreliable if RL is used (due to exploration hacking)

      • Example: ML benchmarks with a random train/​test split (e.g. Imagenet, …), most debate papers

    • Model organisms of deceptive alignment: Experiments where the model could generalize in a way that is analogous or otherwise similar to scheming.

      • Ran on models too weak to be scheming

      • Helps inform P(scheming) (it’s a rough guess more than an accurate measurement), and can help inform E[U] by testing safety techniques against model organisms

      • Example: honeypots, measurements of emergent non-myopia, …

  • Meta-level adversarial evaluations (with adversarial models) measure how well your protocol works against red-team-provided models built to bypass safety measures (held-out validation)

    • Malign initializations: Evaluations using a model initialized by the red team.

      • Ran on models too weak to be scheming. (Because could be somewhat gamable by scheming AIs)

      • Measures E[utility|scheming], is sensitive to differences in inductive biases

      • Is compatible with techniques that use model internals

      • Example: sleeper agents

    • Control evals: Evaluations using a red-team-provided model, that doesn’t modify the model weights but checks that the training loss is low

  • Theoretically-sound arguments: a convincing scientific understanding of how the model works that predicts good behavior, or a proof that a protocol is safe

Appendix: Analogies vs high-stakes reward hacking

The security measures used to make it safe to run analogies also prevent hacking the reward channel /​ taking over, and therefore makes it hard to observe natural high-stakes reward hacking and dangerous power-seeking behavior. For example, if an AI realizes it is in an analogy, it won’t try to take over the lab to get maximum reward on that episode—since it will expect to fail. Therefore, you can’t usually observe high-stakes reward hacking and power-seeking tendencies in usual analogy experiments.

To fix this problem, we can adjust the setup by offering ways to bypass the usual reward channel and gain power in dangerous ways, as in some model organism experiments. However, we still need to convince the AI that it can gain power/​reward hack without getting punished. For example, to observe reward hacking, we could let some of the string-theory research happen within a secured virtual machine that we can observe but never influence, and have a clear scoring system from within the virtual machine, such that there it is clear to a reward hacking AI that it would actually get higher reward if it hacked some parts of the virtual machine. Ideally, this would be the only way this protocol departs from the real protocol, such that the conclusions from this experiment transfer well to real AI use.

But even with those adaptations, AIs might still fear to get punished for reward hacks, and go for them less often than in deployments. This means analogies might provide overoptimistic results. Therefore, we should probably use a methodology closer to control evaluations to know if an AI would be able to cause a catastrophe if it ever tried to reward hack /​ suddenly gain power.

Note that these limitations also apply to all non-adversarial evaluations, though they apply less to those who don’t attempt to have realistic tasks.

  1. ^

    Here, “scheming” refers to the event where early transformative AIs are (often) scheming—it’s not about the AI being tested.

  2. ^

    There are non-adversarial evaluations and sandwiching methods that are not about E[utility|no scheming], see section on other non-adversarial evaluations.

  3. ^

    If supervised training works, which is the case absent gradient hacking.

  4. ^

    For example, the “safety training” from the sleeper agent paper adds some noise to the weights: training on a dataset where the model is already perfect adds some noise to the weight due to the stochasticity of SGD.

  5. ^

    Other failures of debate like obfuscated arguments or just convincing the judge of wrong things are captured by control evaluations as long as the red team is as good at building models that exploit these loopholes on string theory as the scheming model is at exploiting them on alignment.

  6. ^

    E[utility] = E[utility|scheming]P(scheming) + E[utility|no scheming]P(no scheming) if we consider that we either have “scheming” or “no scheming”. If your utility is dominated by doom, you can also write P(doom) = P(doom|scheming)P(scheming) + P(doom|no scheming)P(no scheming). In practice, explicitly computing P(doom) or E[utility] might not be the most effective way to choose and coordinate on the right course of action, and it might be best to reason about the three quantities separately.

  7. ^

    We could also evaluate some internal-based techniques if we ever caught a model scheming. For instance, we could check that coup probes work on instances where the model was actually caught trying to do an egregiously bad action.

  8. ^

    We can know some models are non-scheming by using weak enough models.