Proactive ‘If-Then’ Safety Cases

Holden proposed the idea of if-then planning for AI safety. I think this is potentially a very good idea, depending on the implementation details.

I’ve heard criticisms of the If-Then style of planning that it is inherently reactive, rather than proactive. I think this is not necessarily true, and want to make a case for the value of proactive if-then planning.

Reactive

First, let’s look at reactive if-then planning. I’ll lay out reactive versions of three critical questions facing us about our near term future about which we might want to do if-then planning.

Reactive triggers:

- AI Biorisk: If we see empirical proof of substantial harm resulting from bioweapons which provably required AI-assistance to be created, then we put governance in place to address this risk going forwards. We will seek to reduce the likelihood of this harm happening again.

- Rapid recursive self-improvement: If we see empirical proof that some group was able to implement an AI system which provably self-improved more rapidly than if the group had directly worked on it, and the result of this self-improvement was an advance upon SotA frontier algorithms/​models, then we will put governance in place to regulate such processes. We will seek to prevent any such processes occurring outside of carefully secured sandbox environments with government oversight.

- Unauthorized autonomous replication: if we see empirical proof that an AI system not under the control of a human or group is self-sustainingly persisting in an uncontained manner (e.g. has access to the internet or robotic actuators) and this AI system has caused significant harms in the world (e.g. breaking laws in pursuit of resources/​survival), then we will put governance in place to empower enforcement agencies to halt this AI system and prevent the existence of such systems in the future.

On the plus side, I think these reactive if-then cases would be a really easy sell for most people and most governments. On the downside, the reactive nature of the chosen triggers means that we wait until harm has occurred before taking action. In all three of these cases, I argue that there is a substantial probability that the first case of such a harm would be an unacceptably large catastrophe.

Proactive

Now let’s consider what proactive if-then planning might look like for each of these three cases. The common factor for proactive triggers is that we are instead asking, “what sort of evidence would convince us that we are in a world such that X is a substantial enough problem that it is worth paying costs to address?”


Proactive triggers:

- AI Biorisk: If we see empirical proof of an AI system demonstrating in a controlled evaluation context that it is able to critically enable the creation of substantially dangerous bioweapons by test subjects whose control-counterparts failed to create similar bioweapons, then we will take government action to address this risk. The first step may be to commission more studies, increasing aspects like realism, coverage, sample size, etc. Once the scope of the problem has been quantified, governance actions will be taken to prevent this harm from occurring.

- Rapid recursive self-improvement: If we see empirical proof that some group was able to implement an AI system which demonstrates many of the key skills which would enable rapid automated AI R&D, then we will put governance in place to regulate such processes. This wouldn’t necessarily be a ban, but it would definitely involve careful oversight and restrictions on access to the system.

- Unauthorized autonomous replication: If we see empirical proof that an AI system is able to demonstrate many of the key necessary skills for autonomous survival and replication, as outlined in this post by METR, then we should take action to reduce the risk of such an independent AI system getting loose on the internet.

All of these proactive if-then cases come with a cost. You have to pay, ahead of evidence, for evals to check for the problems. Thus, the cost-benefit analysis must weigh factors such as:

  • With the evidence available now, how likely do these problems seem?

  • How much benefit would early detection grant?

  • Related to this: how early could we detect the problems? With what rate of false positives and false negatives?

  • How much improvement in detection do we get for how much expenditure on evals?

  • Overall, does it seem worth trying to develop and use evals for x problem? What should the budget for evals for x be?

Note that this is a very different set of questions than one would be asking if early detection weren’t possible at all. A call for evals is different from a call for blindly attempting to ameliorate uncertain future risks through costly actions directly impinging on the process of concern (AI development and deployment).

If you think that the case for having the government put resources into attempting to preemptively build and run evals for these AI risk cases is sufficiently weak that you believe no funding should go towards this, I’d be surprised. This feels like a pretty reasonable compromise between the “risks are small, don’t impede AI development” camp and the “risks are large, let’s be careful” camp.

If you are with me so far in this framing of the discussion, and generally in agreement that social (i.e. government and charity) spending on AI risk related evals should probably be non-zero, then I’d like to pitch you on an additional risk category that I believe deserves non-zero evals work.

The additional category I propose is perhaps less clear-cut than the three categories mentioned so far.

Tractability of Algorithmic improvement: how quickly should we expect it would be possible for AI training costs to drop, and/​or AI peak capabilities to increase, without hardware changes? (improvement that synergizes with hardware change being a separate question).

Why is this question of key importance? Because it matters a lot for what governance plans we make. For instance, the recent proposal “A Narrow Path” makes the assumption that large datacenters are necessary for creating dangerously capable AI. The success of the proposal hinges on it being intractable for compute-limited groups to make substantial algorithmic improvements. They do acknowledge the risk that a large scale automated search for algorithmic advances (e.g. a large scale recursive-self improvement undertaking) could find substantial algorithmic advances. If there was a smaller, but still considerable chance that humans (with relatively limited AI assistance) could make dangerous advances without large compute resources, then I argue that their plan wouldn’t work.

Tricky aspects of evals for algorithmic improvement:

The evals themselves would be dangerous, to the degree that they were realistic. (This is also true of biorisk evals, but biorisk is different in that it is more purely a weapon, and thus doesn’t have the temptation to use the results in peaceful economically-useful ways.)

It seems to me that it is harder to develop realistic evals for this than for the other three cases. This means that the case for being able to preemptively get firm evidence as a result of building evals seems weaker for this risk. Thus, the overall predicted cost-benefit ratio is less favorable. I still think that there’s work worth doing here, and this deserves non-zero funding, but I wouldn’t prioritize it above the other cases.