Near- and medium-term AI Control Safety Cases

This essay was part of my application to UKAISI. Unclear how much is novel, but I like the decomposition.

Some AI-enabled catastrophic risks involve an actor purposefully bringing about the negative outcome, as opposed to multi-agent tragedies where no single actor necessarily desired it. Such purposeful catastrophes can be broken down into a conjunction: they materialize if and only if some actor has the capabilities and disposition required to materialize them. Let’s additionally focus on misalignment (actor = AIs) instead of misuse (actor = humans).^[1]

A way to prevent such risks is to ensure one of the two conjuncts doesn’t hold. And for each conjunct, this can be achieved either by raising the required threshold for catastrophe (of capabilities or disposition), or by lowering the AIs’ amount (of capabilities or disposition).

	…capabilities	…disposition
Raising catastrophic threshold of…	AI Control and Monitoring	Speculative “trading with AIs” approaches
Lowering AIs’ amount of…	Slowing down, not the focus of safety cases	Traditional Alignment research

Redwood’s AI Control agenda aims to raise the capabilities AIs would need to enact a catastrophic outcome (even if they were trying to), by limiting or monitoring the AIs’ actions, communication or written memory. This is no different from how a human closely surveilled by a policeman would have a much harder time enacting a crime than a non-surveilled one, but AI Control brings surveillance to an extreme only made possible by the AIs’ digital nature. At the same time, the monitoring protocol can’t be so invasive as to drastically reduce the amount of useful-to-humans cognitive work completed by the AIs (we need a small hit on beneficial capabilities, a low “control tax”).
While an astoundingly low number of researchers are explicitly working on the agenda (less than 10 FTEs at time of writing), it holds promise to be the “next big thing” in near-term safety after vanilla capability evals. Indeed, it is more well-scoped than work on alignment, and already enjoys some preliminary buy-in from labs, partly thanks to its synergy with existing security work motivated by less speculative risks like misuse.

Control could feed into a near-term safety case straightforwardly, providing a worst-case bound on catastrophic capabilities. A blue-team would design the monitoring protocol, which tries to make it hard for the AIs to enact a certain proxy negative outcome (like weight exfiltration). A red-team then introduces any (current-generation) AIs they want as the participants in the protocol. In particular, they can prompt or fine-tune them to elicit certain capabilities, or follow a certain plan. Say the blue-team succeeds, and additionally we have sufficient evidence for the following claims:

A. Adequate experimental setting. Our setting is not harder than real deployment. If some AIs can fool monitoring and enact a negative outcome during real deployment, then they can also do so during our experiments.

B. Adequate red-team. Our red-team was capable enough. If current-generation AIs could conceivably win in our experiments, then the read-team would have achieved this.

These claims together are a strong case against catastrophe from misalignment. (Although the full case for deployment would need to address other risks, like external attacks.)

Let’s zoom in on (A) Adequate experimental setting. It’s unclear how challenging this will be in practice. Bridging the gap between experiment and reality might seem difficult, given the size and complexity of the backend codebases labs use for deployment. But it actually seems plausible that most of the securing of low-level implementation details can be punted to the cyber-security part of the safety case. Those responsible for cyber-security could provide a higher-level abstraction for the working of the whole system, which the control case can treat as the model of real deployment. Indeed, it can be argued that separating this analysis will be warranted, that is, it won’t miss out on important interactions between low-level implementation and the control case. ^[2]^[3]

It doesn’t seem inviable for the control experiments to be run directly on this high-level abstraction of the whole system, thus eliminating any disanalogies other than low-level implementation details. But it’s also the case that, for now, control experiments are only run on very toy environments. Indeed, current research is still aimed at rough first estimates of models’ capabilities, and laying the basic methodology of the field. So it could take more than a year before such experiments can be scaled that ambitiously.
If any safety cases are drafted in the meantime, experiments on simplified settings will need to be used, together with a more frugal analysis of why the real deployment pipeline seems no easier to compromise, which might rely predominantly on arguments from common sense or human analogy. It’s conceivable that even these simplified experiments already provide most of the value from control research.

In the medium term, closing this gap (making experiments more similar to deployment) could come with associated risks of its own, especially when the experiments require more surface area with the outside world. This could be the case when emulating external deployment, or internal tasks which require access to outside information (like the Internet) or dense feedback from reality (like a wetlab). It seems plausible that, for the purposes of control experiments, all these tasks can be simply replaced by a local safe equivalent of similar complexity, or failing that, have the required feedback from the outside world emulated. But this could stop being the case at higher levels of capabilities, say if we want to rule out dangerous strategies made possible only by the specific task, and its complexity is high enough that emulating feedback is not possible (even in the low amounts that these experiments would usually require).
In this more speculative scenario, we could imagine requiring a preliminary case for the guaranteed safety of these experiments, motivated either solely on the grounds of the secured experimental setup, or also by empirical evidence from even more conservative experiments. That said, this might be straining the limits of the control methodology and external audits beyond what’s optimal, and hopefully by then better alternatives through safe delegation will be available.

A remaining elephant in the room is that internal and external deployment might need to be addressed separately. Conceptually, control safety cases can be used for both: the bounds on damage from worst-case model behavior still hold, regardless of whether that behavior comes about by spontaneous misalignment, or jailbreaking from user messages. But in practice, their software architecture and setup could look very different. For example, the one for external deployment might be more optimized to protect the external channel from the higher adversariality of arbitrary users relative to lab employees. As a result, they might require totally different control experiments and cases.

A vital piece for (A) is the guarantee that the lab will implement the protocol as promised throughout deployment. Hopefully the “control tax” will be low enough, and good faith from labs patent enough, that this assumption is not seriously in doubt. But even then, it would make sense for regulation to include an explicit mechanism for continual oversight, as is already the case in other industries. Recurring external audits are the obvious measure, which of course still relies on some amount of transparency from the lab. These checks for the usage of the protocol would make most sense as part of more general cyber-security audits, which can keep guaranteeing that the provided high-level abstraction of the system corresponds to reality.^[4]

Internal deployment would seem especially hard to track. It might suffice to demonstrate that most employees only have access to the model through an API hosting the required protocol, and those with additional permissions (doing either training, or research which requires weight access) only use it in adequately secure machines, and for tasks or amounts of cognitive work that we have reason to believe are safe. The required on-the-ground oversight could be enacted by lab employees reporting to external oversight, building a continuous oversight hierarchy between the outside and inside of the lab. But this could also be especially vulnerable to manipulation. If we consider even more explicitly adversarial measures, record-keeping and cryptographic procedures could be used to provide legible and private guarantees on the execution of the protocol.

Similarly to how nuclear plants need to demonstrate their safety under appropriately-common natural catastrophes, labs could also demonstrate, through their cyber-security cases and continual audits, the likelihood of a negative outcome (most notably, weight exfiltration) under several unexpected but possible situations, like outages or partly successful cyber-attacks. Realistically, it might not be possible to say much about safety once any part of the deployment pipeline has been compromised. As an emergency measure, we could consider the momentaneous shutdown of the whole pipeline, or even deletion of certain copies of the weights. But we would need explicit specifications for when such measures would be taken. Additionally, such a drastic solution might no longer be viable if AI services become critical infrastructure, on which other systems depend and which should not go offline during an emergency. But even here we can imagine workarounds through redundancy.

Going back to the safety agenda itself, the main limitation of control is its scalability: as early discussions of “AI boxing” already made clear, we should expect arbitrarily intelligent entities to eventually fool our imperfect protocols. But several avenues can be explored to keep control cases viable at least in the medium term.
For one, even when models small enough to be trusted (that is, we have good evidence they are not purposefully misleading us) are no longer intelligent enough to monitor the current generation, we can employ untrusted models as monitors, if we can show that the monitors and monitorees are not capable of enough coordination to cause a catastrophe. Research is getting started on this topic, which might require two isolated red-teams in a safety case.
Or also, control cases could be augmented with tools from other safety agendas. For example, gray-box methods from interpretability like probes could be trained on top of untrusted models to coarsely keep track of their internal reasoning, and the red-team could stress-test whether the protocol and probe can be fooled at the same time.

^
We don’t have a clear technical definition of the capability/disposition divide, but for our purposes it will be enough to define that “a model is capable of X” if a team of humans can get it to do X under enough iteration (without re-training too much or hard-coding the solution to X), and define “a model is disposed to do X” if it will do X on some “normal” scenarios it will face (without such external iteration). Of course, this distinction will break for high enough capabilities, for example when humans are no longer the best at eliciting behavior.
^
Still, running the final experiments on an even higher-fidelity representation, or prominently the real deployment pipeline, shouldn’t be discarded if it’s cheap enough and made safe (see below).
^
In fact, a similar approach would seem necessary for the safety cases of most other safety agendas.
^
In fact, the control tax could even be negative if control techniques were also useful to avoid low-stakes failures damaging to the lab’s public image, but we shouldn’t assume these use cases can’t be separated.