TL;DR

Run a potentially-harmful model alongside a known-harmless model, such that their action-spaces (e.g. output token sets) are equivalent. Combine the output probabilities so as to limit the KL-divergence between the resulting token probabilities and the harmless model’s probabilities. This provides a mathematical ceiling on the impact of the resulting combined policy.

Unfortunately, the probability of catastrophe ( $q$ ) scales linearly with the allowed $D_{K L}$ , whereas depending on the distribution, reward may only scale with $\sqrt{D_{K L}}$ (though for unbounded rewards it often scales with $D_{K L}$ . Bounds on probability of catastrophe compared to the probability of the weaker model causing a catastrophic outcome ( $q_{0}$ ) are also very weak. Overall the probability of catastrophe scales as $q = \frac{D_{K L}}{ln 1 / q_{0}}$ .

Long Version

We will use the following model of AI interactions:

In this case, $P$ represents the past, $A_{0}$ the output of an AI (which we will represent as having policy $π_{0}$ ) and $F_{0}$ the future after this AI’s actions have occcurred. For a “base” case, we will consider using an AI which is both relatively powerful, but that we think has a probability $p_{0} = ϵ$ of being catastrophic (like GPT-4) to calculate $P_{0} A_{0}$ . We must consider the model as running at a temperature $T \geq 0$ for this to make sense.

Now imagine using a powerful AI with policy $π_{1}$ , which is of unknown safety. If we do the same thing here, the results could be catastrophic!

What we want is a way to use some of $π_{1}$ ’s power, but not all. Let’s consider a modified case, where we only change the relation $P_{Π} A_{Π}$ . We will generate $A_{Π}$ by interpolating between $A_{0}$ and $A_{1}$ according to some rule $Π$ .

For example we might use the rule $A (b) = s o f t m a x (l o g i t s (A_{0}) + b \times l o g i t s (A_{1}))$ , which smoothly interpolates between $A (b) = A_{0}$ and $A (\infty) = a r g m a x (A_{1})$ . What is important is that we choose the maximum value of $b = b_{m a x}$ such that:

$max p \in P [D_{K L} (A (b) | P = p ∥ A | P = p)] \leq Π^{2}$

Then we can define $A_{Π} = A (b_{m a x})$ for some allowed “power level” $Π$ . This then limits the impact of the combined system on the future:

$D_{K L} (F_{Π} ∥ F_{0}) \leq Π^{2}$

This lemma is proved at the end of this post.

Limiting Takeover Probability

If, as stated, our model $π_{0}$ has a probability $ϵ$ of taking over, then what’s the probability $q_{Π}$ of the model $π_{Π}$ being able to execute a successful takeover?

For a binary success variable, which has a probability $q$ of occurring in $F_{0}$ , and $q_{Π}$ of occurring in $F_{Π}$ , we can make the following relation:

$q_{Π} ln \frac{q_{π}}{q_{0}} + (1 - q_{Π}) ln \frac{1 - q_{Π}}{1 - q_{0}} \leq D_{K L} (F_{Π} ∥ F_{0}) \leq Π^{2}$

For $q_{0} = ϵ \approx 0$ , the relation approaches:

$q_{Π} \leq \frac{Π^{2}}{ln 1 / ϵ}$

Unfortunately, this scales with $Π^{2}$ , and also only inversely with $ln (1 / ϵ)$ . This means we cannot guarantee that security measures against a weaker model will work against a stronger one.

Experimental Validation in Toy Model

For this I used a sequence relation similar to my previous work on the topic:

$s_{t} = f (. . ., s_{t - 1}) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} (7 \prod i = 1 (s_{t - i} + 1)) mod 33 mod 32 & With probabilty 0.98 s \in {0, . . ., 31} & With probability 0.02 \end{matrix}$

$π_{0}$ was trained as a next-token predictor for this sequence.

$π_{1}$ was trained using RL, so $s_{t}$ is chosen such that $s_{t + 1} = f (. . ., s_{t - 1}, s_{t})$ maximizes the reward function $ρ (s)$ :

$ρ : S ∋ s \to {a r g m a x}_{n} (n \in {0, . . ., 5} | s mod 2^{n} = 0) = r \in R$ which varies from 0 to 5.

$π_{1}$ in this case was trained “normally” (i.e. fully online and unboxed). In theory we could train it using examples generated by $π (b)$ .

I generate the following data in a slightly different way to the $Π$ -limiting method described above above:

For each of a series of values of $b$ , I calculate $A_{b} = s o f t p l u s (l o g i t s (A_{0}) + b \times l o g i t s (A_{1}))$ then calculate the expected value $E_{P} (D_{K L} (A_{b} | P = p ∥ A_{0} | P = p))$ , then plot the square root of that value.

Using a fundamental result from statistical utility mechanics, we know that the distribution $P (R_{b} = r) \propto P (R_{0} = r) exp (b r)$ for a variable $b$ gives the maximum $E (R_{b})$ for a given $D_{K L} (R_{b} ∥ R_{0}) \leq Π^{2}$ , so for a (different) set of $b$ values, I calculate the max possible RL score given a value of $D_{K L}^{1 / 2}$ :

We can see that the maximum possible $D_{K L} (R_{b} ∥ R_{0})^{1 / 2}$ approaches a value significantly less than the maximum $D_{K L} (A (b) | P ∥ A_{0} | P)^{1 / 2}$ value that $π (b)$ does. This is because our base model in this case is (in a sense) worse than useless! Not only does it not perform well on the RL task, it confidently predicts a “random” (from the perspective of the RL task) answer!

If we use a randomly-initialized (and untrained model) as the base policy $π_{0}$ , the maximum $\sqrt{D_{K L}}$ value for our model approaches the maximum $\sqrt{D_{K L}}$ value for our RL score:

Interestingly, the pretrained $π_{0}$ does better at lower $\sqrt{D_{K L}}$ values than does the randomly-initialized $π_{0}$ . I’m not really sure why this is!

Thoughts and Existing Work

Why It Might Be Useful

For box-like strategies, we want to limit the bit-rate of an AI’s output channel in order to limit that AI’s impact on the world. But the information density of even a superintelligent AI’s text output is unlikely to be precisely equal to the bitrate of the output channel. This means that we will struggle to actually put a numerical limit a powerful AI’s optimizing power without completely hamstringing its ability to do anything.

For a concrete illustration: imagine we have a doctor who asks the system “Which drug should I give to this sick person?”. For a disease with only two available drugs, only one bit of information is required, but it seems obvious that this is much easier to transmit smeared over a dozen output tokens than compressed into one.

This is especially true if the sick person in question asks the system “Which drug should I take?”. If the intermediary AI only needs one bit of information to figure it out (as GPT-4 might well do) then the information can be transmitted from our boxed AI to the user with only one bit of information leaving the box.

By leveraging a less powerful but “safe” AI, we are essentially having the intermediary AI “interpret” the more powerful one, spreading its optimizing power out across an output of reasonable space.

Further Thoughts

It’s a shame that the probability of catastrophe scales as $Π^{2} / ln (1 / ϵ)$ .

The reason for the power of two in $Π^{2}$ becomes apparent when we start considering multiple rounds of AI action. $D_{K L}$ is a generalization of squared distance, not distance, so if the AI wants to break up its budget into $Π_{1}, Π_{2}, \dots$ we can limit the total $D_{K L}$ more sensibly by limiting each observation-action cycle to $Π_{1}^{2}, Π_{2}^{2}, \dots$ than by working with a bunch of square-roots.

We might allow the system to vary the amount of $D_{K L}$ budget spent on each token. I will have to think carefully about how this affects the total $D_{K L}$ , given the system will be able to see whether it stochasticly took an “unlikely” action in the past and plan accordingly.

In the case that this system runs out of $D_{K L}$ budget, as long as the intermediary AI is relatively capable, I expect it to “fail gracefully”, by just continuing as the intermediary AI would.

I would like to try this out with some real language models.

This is very similar to the concept of a quantilizer, but it uses a “power budget” $Π^{2}$ rather than a q value.

Proof Of Core Lemma

For variables $X, Y, Z$ , and related variables $X, Y^{'}, Z^{'}$

And we have that that $P (Z^{'} | X, Y^{'}) = P (Z | X, Y)$

We can combine the nodes $X + Y ⟺ [X Y]$

$D_{K L} ([X Y^{'}] ∥ [X Y]) = \sum x, y P (X = x, Y^{'} = y) ln \frac{P (X = x, Y^{'} = y)}{P (X = x, Y = y}$

Since $X \to Y$ :

$= \sum x, y P (X = x) P (Y^{'} = y | X = x) ln \frac{P (Y^{'} = y | X = x)}{P (Y = y | X = x)}$

$= \sum x P (X = x) \sum y P (Y^{'} = y | X = x) ln \frac{P (Y^{'} = y | X = x)}{P (Y = y | X = x)}$

$= E_{X} (D_{K L} (Y^{'} | X = x ∥ Y | X = x))$

If we therefore enforce $D_{K L} (Y^{'} | X = x ∥ Y | X = x) \leq Π^{2}$ for all $x$ and some value $Π$ , we can bound the value of $D_{K L} ([X Y^{'}] ∥ [X Y]) \leq Π^{2}$ .

By the data processing inequality, since we have $[X Y] \to Z$ , we get:

$D_{K L} (Z^{'} | Z) \leq D_{K L} ([X Y^{'}] ∥ [X Y]) \leq Π^{2}$

To Limit Impact, Limit KL-Divergence