There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.
We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let’s assume the offline learning setting. Domain D is represented by instance space X, label space Y, distribution μ∈Δ(X×Y) and loss function L:Y×Y→R. Similarly, domain E is represented by instance space Z, label space W, distribution ν∈Δ(Z×W) and loss function M:W×W→R. The distributions μ,ν are initially unknown and we assume some prior over them: ζ∈Δ(Δ(X×Y)×Δ(Z×W)). The prior involves some correlation between D and E, hence learning about D tends to acquire information about E as well.
A learning algorithm A for D is A:(X×Y)∗→YX (receives a data sample and produces a label prediction function). A learning algorithm B for E has access to knowledge generated by A: B:YX×(Z×W)∗×Z→W. We can now consider zero-sum games of the following form: you choose an algorithm A, the adversary looks at A and chooses an algorithm B, your payoff decreases with your expected loss ELA and increases with the adversary’s expected loss EMB (e.g. it is given by −ELA+αEMB for some parameter α>0). The expected losses are given by
Here n,m∈N are the sample sizes. The interesting case is n≫m or even m=0.
Here’s a very simple example. Suppose that Y=[0,1], L(y,y′)=(y−y′)2 and ζ is s.t. (i) each x∈X is assigned a persistent label sampled uniformly at random from {0,1} independently of other instances (ii) information about the labels doesn’t help with E but information about the distribution on X does help with E. When we care only about ELA the best we can do is memorize the samples, i.e. set A(S)(x) to y if (x,y)∈S and set it to 12 otherwise. However, this would aid the adversary. Instead, we can set A(S)(x) to y if (x,y)∈S and set it to a coinflip otherwise. Now our loss is somewhat worse (but, for discrete X it still goes to 0 as n goes to ∞) but the adversary gains no information from us!
It is also possible to ignore any knowledge we have about E and just try designing A which simultaneously minimizes the mutual information between S and A(S) and minimizes ELA. Going to an even higher level of abstraction, this is similar to the following problem:
Let (L,R,E) be a bipartite graph (L are the left vertices, R are the right vertices, E⊆L×R are the edges) and ζ a distribution on L. Find f:L→R s.t. (i) for any v∈L, (v,f(v))∈E and (ii) if we sample v from ζ then the mutual information between v and f(v) is minimal. That is, we are minimizing the following:
I(f):=Ev∼ζ[ln1ζ(f−1(f(v)))]
It would be interesting to understand the computational complexity of this problem (and/or of relaxations when we’re allowed to approximate).
Finally, it is interesting to also impose computational complexity constraints on our A (but perhaps not on B: obfuscating the learned representation means the knowledge about E is inaccessible from outside but might be still exploitable by the AI itself), in which case we would split it into a representation space RA, a training algorithm algorithm tA:(X×Y)∗→R and a prediction algorithm pA:R×X→Y (both of which have to lie in some low complexity class e.g. P), whereas the signature of B becomes B:R×(Z×W)∗×Z→W.
The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.
Let ϵ>0 be our accuracy requirement for the target domain. That is, we want f:X→Y s.t.
Exy∼μ[L(y,f(x))]≤minf′:X→YExy∼μ[L(y,f(x))]+ϵ
Given anyf:X→Y, denote ζf,ϵ to be ζ conditioned on the inequality above, where μ is regarded as a random variable. Define Bf,ϵ:(Z×W)∗×Z→W by
That is, Bf,ϵ is the Bayes-optimal learning algorithm for domain E w.r.t. prior ζf,ϵ.
Now, consider some A:(X×Y)∗×(Z×W)∗×X→Y. We regard A as a learning algorithm for domain D which undergoes “antitraining” for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that A achieves asymptotic accuracy ϵ[1], i.e. that if μ is sampled from ζ then with probability 1
Under this constraint, we want A to be as ignorant as possible about domain E, which we formalize as maximizing IGA defined by
IGAnm:=Eμν∼ζ,S∼μn,Tzw∼νm+1[M(w,BA(S,T),ϵ(T,z))]
It is actually important to consider m>0 because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model’s “internal language”. For m=0 we can get high IG despite the model actually knowing domain E because the adversary B doesn’t know the embedding, but for m>0 it should be able to learn the embedding much faster than learning domain E from scratch.
We can imagine a toy example where X=Z=Rd, the projections of μ and ν to X and Z respectively are distributions concentrated around two affine subspaces, Y=W={−1,+1} and the labels are determined by the sign of a polynomial which is the same for μ and ν up to a linear transformation α:Rd→Rd which is a random variable w.r.t. ζ. A good A would then infer α, look for an affine subspace Q⊆Rd s.t.S is near Q while α(T) is far from Q and fit a polynomial to the projections of S on Q.
More realistically, if the prior is of Solomonoff type, then IGA is probably related to the relative Kolmogorov complexity of ν w.r.t. A.
It might be bad that we’re having B condition on A having accuracy ϵ while in reality A achieves this accuracy only asymptotically. Perhaps it would be better to define ζf in some way that takes A’s convergence rate into consideration. On the other hand, maybe it doesn’t matter much as long as we focus on asymptotic metrics.
There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.
We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let’s assume the offline learning setting. Domain D is represented by instance space X, label space Y, distribution μ∈Δ(X×Y) and loss function L:Y×Y→R. Similarly, domain E is represented by instance space Z, label space W, distribution ν∈Δ(Z×W) and loss function M:W×W→R. The distributions μ,ν are initially unknown and we assume some prior over them: ζ∈Δ(Δ(X×Y)×Δ(Z×W)). The prior involves some correlation between D and E, hence learning about D tends to acquire information about E as well.
A learning algorithm A for D is A:(X×Y)∗→YX (receives a data sample and produces a label prediction function). A learning algorithm B for E has access to knowledge generated by A: B:YX×(Z×W)∗×Z→W. We can now consider zero-sum games of the following form: you choose an algorithm A, the adversary looks at A and chooses an algorithm B, your payoff decreases with your expected loss ELA and increases with the adversary’s expected loss EMB (e.g. it is given by −ELA+αEMB for some parameter α>0). The expected losses are given by
ELAn:=E(μ,ν)∼ζ,S∼μn,(x,y)∼μ[L(y,A(S)(x))]
EMBm:=E(μ,ν)∼ζ,S∼μn,T∼νm,(z,w)∼ν[M(z,B(A(S),T,z))]
Here n,m∈N are the sample sizes. The interesting case is n≫m or even m=0.
Here’s a very simple example. Suppose that Y=[0,1], L(y,y′)=(y−y′)2 and ζ is s.t. (i) each x∈X is assigned a persistent label sampled uniformly at random from {0,1} independently of other instances (ii) information about the labels doesn’t help with E but information about the distribution on X does help with E. When we care only about ELA the best we can do is memorize the samples, i.e. set A(S)(x) to y if (x,y)∈S and set it to 12 otherwise. However, this would aid the adversary. Instead, we can set A(S)(x) to y if (x,y)∈S and set it to a coinflip otherwise. Now our loss is somewhat worse (but, for discrete X it still goes to 0 as n goes to ∞) but the adversary gains no information from us!
It is also possible to ignore any knowledge we have about E and just try designing A which simultaneously minimizes the mutual information between S and A(S) and minimizes ELA. Going to an even higher level of abstraction, this is similar to the following problem:
Let (L,R,E) be a bipartite graph (L are the left vertices, R are the right vertices, E⊆L×R are the edges) and ζ a distribution on L. Find f:L→R s.t. (i) for any v∈L, (v,f(v))∈E and (ii) if we sample v from ζ then the mutual information between v and f(v) is minimal. That is, we are minimizing the following:
I(f):=Ev∼ζ[ln1ζ(f−1(f(v)))]
It would be interesting to understand the computational complexity of this problem (and/or of relaxations when we’re allowed to approximate).
Finally, it is interesting to also impose computational complexity constraints on our A (but perhaps not on B: obfuscating the learned representation means the knowledge about E is inaccessible from outside but might be still exploitable by the AI itself), in which case we would split it into a representation space RA, a training algorithm algorithm tA:(X×Y)∗→R and a prediction algorithm pA:R×X→Y (both of which have to lie in some low complexity class e.g. P), whereas the signature of B becomes B:R×(Z×W)∗×Z→W.
The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.
Let ϵ>0 be our accuracy requirement for the target domain. That is, we want f:X→Y s.t.
Exy∼μ[L(y,f(x))]≤minf′:X→YExy∼μ[L(y,f(x))]+ϵ
Given any f:X→Y, denote ζf,ϵ to be ζ conditioned on the inequality above, where μ is regarded as a random variable. Define Bf,ϵ:(Z×W)∗×Z→W by
Bf,ϵ(T,z):=argminw∈WEν∼ζf,ϵ,T′z′w′∼ν|T|+1[M(w′,w)∣T′=T,z′=z]
That is, Bf,ϵ is the Bayes-optimal learning algorithm for domain E w.r.t. prior ζf,ϵ.
Now, consider some A:(X×Y)∗×(Z×W)∗×X→Y. We regard A as a learning algorithm for domain D which undergoes “antitraining” for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that A achieves asymptotic accuracy ϵ[1], i.e. that if μ is sampled from ζ then with probability 1
limn→∞supT∈(Z×W)∗ESxy∼μn+1[L(y,A(S,T,x))]≤minf:X→YExy∼μ[L(y,f(x))]+ϵ
Under this constraint, we want A to be as ignorant as possible about domain E, which we formalize as maximizing IGA defined by
IGAnm:=Eμν∼ζ,S∼μn,Tzw∼νm+1[M(w,BA(S,T),ϵ(T,z))]
It is actually important to consider m>0 because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model’s “internal language”. For m=0 we can get high IG despite the model actually knowing domain E because the adversary B doesn’t know the embedding, but for m>0 it should be able to learn the embedding much faster than learning domain E from scratch.
We can imagine a toy example where X=Z=Rd, the projections of μ and ν to X and Z respectively are distributions concentrated around two affine subspaces, Y=W={−1,+1} and the labels are determined by the sign of a polynomial which is the same for μ and ν up to a linear transformation α:Rd→Rd which is a random variable w.r.t. ζ. A good A would then infer α, look for an affine subspace Q⊆Rd s.t.S is near Q while α(T) is far from Q and fit a polynomial to the projections of S on Q.
More realistically, if the prior is of Solomonoff type, then IGA is probably related to the relative Kolmogorov complexity of ν w.r.t. A.
It might be bad that we’re having B condition on A having accuracy ϵ while in reality A achieves this accuracy only asymptotically. Perhaps it would be better to define ζf in some way that takes A’s convergence rate into consideration. On the other hand, maybe it doesn’t matter much as long as we focus on asymptotic metrics.