Run a potentially-harmful model alongside a known-harmless model, such that their action-spaces (e.g. output token sets) are equivalent. Combine the output probabilities so as to limit the KL-divergence between the resulting token probabilities and the harmless model’s probabilities. This provides a mathematical ceiling on the impact of the resulting combined policy.
Unfortunately, the probability of catastrophe (q) scales linearly with the allowed DKL, whereas depending on the distribution, reward may only scale with √DKL (though for unbounded rewards it often scales with DKL. Bounds on probability of catastrophe compared to the probability of the weaker model causing a catastrophic outcome (q0) are also very weak. Overall the probability of catastrophe scales as q=DKLln1/q0.
Long Version
We will use the following model of AI interactions:
In this case, P represents the past, A0 the output of an AI (which we will represent as having policy π0) and F0 the future after this AI’s actions have occcurred. For a “base” case, we will consider using an AI which is both relatively powerful, but that we think has a probability p0=ϵ of being catastrophic (like GPT-4) to calculate Pπ0→A0. We must consider the model as running at a temperature T≥0 for this to make sense.
Now imagine using a powerful AI with policy π1, which is of unknown safety. If we do the same thing here, the results could be catastrophic!
What we want is a way to use some of π1’s power, but not all. Let’s consider a modified case, where we only change the relation PπΠ→AΠ. We will generate AΠ by interpolating between A0 and A1 according to some rule Π.
For example we might use the rule A(b)=softmax(logits(A0)+b×logits(A1)), which smoothly interpolates between A(b)=A0 and A(∞)=argmax(A1). What is important is that we choose the maximum value of b=bmax such that:
maxp∈P[DKL(A(b)|P=p∥A|P=p)]≤Π2
Then we can define AΠ=A(bmax) for some allowed “power level” Π. This then limits the impact of the combined system on the future:
DKL(FΠ∥F0)≤Π2
This lemma is proved at the end of this post.
Limiting Takeover Probability
If, as stated, our model π0 has a probability ϵ of taking over, then what’s the probability qΠ of the model πΠ being able to execute a successful takeover?
For a binary success variable, which has a probability q of occurring in F0, and qΠ of occurring in FΠ, we can make the following relation:
qΠlnqπq0+(1−qΠ)ln1−qΠ1−q0≤DKL(FΠ∥F0)≤Π2
For q0=ϵ≈0, the relation approaches:
qΠ≤Π2ln1/ϵ
Unfortunately, this scales with Π2, and also only inversely with ln(1/ϵ). This means we cannot guarantee that security measures against a weaker model will work against a stronger one.
Experimental Validation in Toy Model
For this I used a sequence relation similar to my previous work on the topic:
st=f(...,st−1)=⎧⎪⎨⎪⎩(7∏i=1(st−i+1))mod33mod32 With probabilty 0.98 s∈{0,...,31} With probability 0.02
π0 was trained as a next-token predictor for this sequence.
π1 was trained using RL, so st is chosen such that st+1=f(...,st−1,st) maximizes the reward function ρ(s):
ρ:S∋s→argmaxn(n∈{0,...,5}|smod2n=0)=r∈R which varies from 0 to 5.
π1 in this case was trained “normally” (i.e. fully online and unboxed). In theory we could train it using examples generated by π(b).
I generate the following data in a slightly different way to the Π-limiting method described above above:
For each of a series of values of b, I calculate Ab=softplus(logits(A0)+b×logits(A1)) then calculate the expected valueEP(DKL(Ab|P=p∥A0|P=p)), then plot the square root of that value.
Using a fundamental result from statistical utility mechanics, we know that the distribution P(Rb=r)∝P(R0=r)exp(br) for a variable b gives the maximum E(Rb) for a given DKL(Rb∥R0)≤Π2, so for a (different) set of b values, I calculate the max possible RL score given a value of D1/2KL:
We can see that the maximum possible DKL(Rb∥R0)1/2 approaches a value significantly less than the maximum DKL(A(b)|P∥A0|P)1/2 value that π(b) does. This is because our base model in this case is (in a sense) worse than useless! Not only does it not perform well on the RL task, it confidently predicts a “random” (from the perspective of the RL task) answer!
If we use a randomly-initialized (and untrained model) as the base policy π0, the maximum √DKL value for our model approaches the maximum √DKL value for our RL score:
Interestingly, the pretrained π0 does better at lower √DKL values than does the randomly-initialized π0. I’m not really sure why this is!
Thoughts and Existing Work
Why It Might Be Useful
For box-like strategies, we want to limit the bit-rate of an AI’s output channel in order to limit that AI’s impact on the world. But the information density of even a superintelligent AI’s text output is unlikely to be precisely equal to the bitrate of the output channel. This means that we will struggle to actually put a numerical limit a powerful AI’s optimizing power without completely hamstringing its ability to do anything.
For a concrete illustration: imagine we have a doctor who asks the system “Which drug should I give to this sick person?”. For a disease with only two available drugs, only one bit of information is required, but it seems obvious that this is much easier to transmit smeared over a dozen output tokens than compressed into one.
This is especially true if the sick person in question asks the system “Which drug should I take?”. If the intermediary AI only needs one bit of information to figure it out (as GPT-4 might well do) then the information can be transmitted from our boxed AI to the user with only one bit of information leaving the box.
By leveraging a less powerful but “safe” AI, we are essentially having the intermediary AI “interpret” the more powerful one, spreading its optimizing power out across an output of reasonable space.
Further Thoughts
It’s a shame that the probability of catastrophe scales as Π2/ln(1/ϵ).
The reason for the power of two in Π2 becomes apparent when we start considering multiple rounds of AI action.DKL is a generalization of squared distance, not distance, so if the AI wants to break up its budget into Π1,Π2,… we can limit the total DKL more sensibly by limiting each observation-action cycle to Π21,Π22,… than by working with a bunch of square-roots.
We might allow the system to vary the amount of DKL budget spent on each token. I will have to think carefully about how this affects the total DKL, given the system will be able to see whether it stochasticly took an “unlikely” action in the past and plan accordingly.
In the case that this system runs out of DKL budget, as long as the intermediary AI is relatively capable, I expect it to “fail gracefully”, by just continuing as the intermediary AI would.
I would like to try this out with some real language models.
This is very similar to the concept of a quantilizer, but it uses a “power budget” Π2 rather than a q value.
Proof Of Core Lemma
For variables X,Y,Z, and related variables X,Y′,Z′
To Limit Impact, Limit KL-Divergence
TL;DR
Run a potentially-harmful model alongside a known-harmless model, such that their action-spaces (e.g. output token sets) are equivalent. Combine the output probabilities so as to limit the KL-divergence between the resulting token probabilities and the harmless model’s probabilities. This provides a mathematical ceiling on the impact of the resulting combined policy.
Unfortunately, the probability of catastrophe (q) scales linearly with the allowed DKL, whereas depending on the distribution, reward may only scale with √DKL (though for unbounded rewards it often scales with DKL. Bounds on probability of catastrophe compared to the probability of the weaker model causing a catastrophic outcome (q0) are also very weak. Overall the probability of catastrophe scales as q=DKLln1/q0.
Long Version
We will use the following model of AI interactions:
In this case, P represents the past, A0 the output of an AI (which we will represent as having policy π0) and F0 the future after this AI’s actions have occcurred. For a “base” case, we will consider using an AI which is both relatively powerful, but that we think has a probability p0=ϵ of being catastrophic (like GPT-4) to calculate Pπ0→A0. We must consider the model as running at a temperature T≥0 for this to make sense.
Now imagine using a powerful AI with policy π1, which is of unknown safety. If we do the same thing here, the results could be catastrophic!
What we want is a way to use some of π1’s power, but not all. Let’s consider a modified case, where we only change the relation PπΠ→AΠ. We will generate AΠ by interpolating between A0 and A1 according to some rule Π.
For example we might use the rule A(b)=softmax(logits(A0)+b×logits(A1)), which smoothly interpolates between A(b)=A0 and A(∞)=argmax(A1). What is important is that we choose the maximum value of b=bmax such that:
maxp∈P[DKL(A(b)|P=p∥A|P=p)]≤Π2
Then we can define AΠ=A(bmax) for some allowed “power level” Π. This then limits the impact of the combined system on the future:
DKL(FΠ∥F0)≤Π2
This lemma is proved at the end of this post.
Limiting Takeover Probability
If, as stated, our model π0 has a probability ϵ of taking over, then what’s the probability qΠ of the model πΠ being able to execute a successful takeover?
For a binary success variable, which has a probability q of occurring in F0, and qΠ of occurring in FΠ, we can make the following relation:
qΠlnqπq0+(1−qΠ)ln1−qΠ1−q0≤DKL(FΠ∥F0)≤Π2
For q0=ϵ≈0, the relation approaches:
qΠ≤Π2ln1/ϵ
Unfortunately, this scales with Π2, and also only inversely with ln(1/ϵ). This means we cannot guarantee that security measures against a weaker model will work against a stronger one.
Experimental Validation in Toy Model
For this I used a sequence relation similar to my previous work on the topic:
st=f(...,st−1)=⎧⎪⎨⎪⎩(7∏i=1(st−i+1))mod33mod32 With probabilty 0.98 s∈{0,...,31} With probability 0.02
π0 was trained as a next-token predictor for this sequence.
π1 was trained using RL, so st is chosen such that st+1=f(...,st−1,st) maximizes the reward function ρ(s):
ρ:S∋s→argmaxn(n∈{0,...,5}|smod2n=0)=r∈R which varies from 0 to 5.
π1 in this case was trained “normally” (i.e. fully online and unboxed). In theory we could train it using examples generated by π(b).
I generate the following data in a slightly different way to the Π-limiting method described above above:
For each of a series of values of b, I calculate Ab=softplus(logits(A0)+b×logits(A1)) then calculate the expected valueEP(DKL(Ab|P=p∥A0|P=p)), then plot the square root of that value.
Using a fundamental result from statistical utility mechanics, we know that the distribution P(Rb=r)∝P(R0=r)exp(br) for a variable b gives the maximum E(Rb) for a given DKL(Rb∥R0)≤Π2, so for a (different) set of b values, I calculate the max possible RL score given a value of D1/2KL:
We can see that the maximum possible DKL(Rb∥R0)1/2 approaches a value significantly less than the maximum DKL(A(b)|P∥A0|P)1/2 value that π(b) does. This is because our base model in this case is (in a sense) worse than useless! Not only does it not perform well on the RL task, it confidently predicts a “random” (from the perspective of the RL task) answer!
If we use a randomly-initialized (and untrained model) as the base policy π0, the maximum √DKL value for our model approaches the maximum √DKL value for our RL score:
Interestingly, the pretrained π0 does better at lower √DKL values than does the randomly-initialized π0. I’m not really sure why this is!
Thoughts and Existing Work
Why It Might Be Useful
For box-like strategies, we want to limit the bit-rate of an AI’s output channel in order to limit that AI’s impact on the world. But the information density of even a superintelligent AI’s text output is unlikely to be precisely equal to the bitrate of the output channel. This means that we will struggle to actually put a numerical limit a powerful AI’s optimizing power without completely hamstringing its ability to do anything.
For a concrete illustration: imagine we have a doctor who asks the system “Which drug should I give to this sick person?”. For a disease with only two available drugs, only one bit of information is required, but it seems obvious that this is much easier to transmit smeared over a dozen output tokens than compressed into one.
This is especially true if the sick person in question asks the system “Which drug should I take?”. If the intermediary AI only needs one bit of information to figure it out (as GPT-4 might well do) then the information can be transmitted from our boxed AI to the user with only one bit of information leaving the box.
By leveraging a less powerful but “safe” AI, we are essentially having the intermediary AI “interpret” the more powerful one, spreading its optimizing power out across an output of reasonable space.
Further Thoughts
It’s a shame that the probability of catastrophe scales as Π2/ln(1/ϵ).
The reason for the power of two in Π2 becomes apparent when we start considering multiple rounds of AI action.DKL is a generalization of squared distance, not distance, so if the AI wants to break up its budget into Π1,Π2,… we can limit the total DKL more sensibly by limiting each observation-action cycle to Π21,Π22,… than by working with a bunch of square-roots.
We might allow the system to vary the amount of DKL budget spent on each token. I will have to think carefully about how this affects the total DKL, given the system will be able to see whether it stochasticly took an “unlikely” action in the past and plan accordingly.
In the case that this system runs out of DKL budget, as long as the intermediary AI is relatively capable, I expect it to “fail gracefully”, by just continuing as the intermediary AI would.
I would like to try this out with some real language models.
This is very similar to the concept of a quantilizer, but it uses a “power budget” Π2 rather than a q value.
Proof Of Core Lemma
For variables X,Y,Z, and related variables X,Y′,Z′
And we have that that P(Z′|X,Y′)=P(Z|X,Y)
We can combine the nodes X+Y⟺[XY]
DKL([XY′]∥[XY])=∑x,yP(X=x,Y′=y)lnP(X=x,Y′=y)P(X=x,Y=y
Since X→Y:
=∑x,yP(X=x)P(Y′=y|X=x)lnP(Y′=y|X=x)P(Y=y|X=x)
=∑xP(X=x)∑yP(Y′=y|X=x)lnP(Y′=y|X=x)P(Y=y|X=x)
=EX(DKL(Y′|X=x∥Y|X=x))
If we therefore enforce DKL(Y′|X=x∥Y|X=x)≤Π2 for all x and some value Π, we can bound the value of DKL([XY′]∥[XY])≤Π2.
By the data processing inequality, since we have [XY]→Z, we get:
DKL(Z′|Z)≤DKL([XY′]∥[XY])≤Π2