The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.
Let ϵ>0 be our accuracy requirement for the target domain. That is, we want f:X→Y s.t.
Exy∼μ[L(y,f(x))]≤minf′:X→YExy∼μ[L(y,f(x))]+ϵ
Given anyf:X→Y, denote ζf,ϵ to be ζ conditioned on the inequality above, where μ is regarded as a random variable. Define Bf,ϵ:(Z×W)∗×Z→W by
That is, Bf,ϵ is the Bayes-optimal learning algorithm for domain E w.r.t. prior ζf,ϵ.
Now, consider some A:(X×Y)∗×(Z×W)∗×X→Y. We regard A as a learning algorithm for domain D which undergoes “antitraining” for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that A achieves asymptotic accuracy ϵ[1], i.e. that if μ is sampled from ζ then with probability 1
Under this constraint, we want A to be as ignorant as possible about domain E, which we formalize as maximizing IGA defined by
IGAnm:=Eμν∼ζ,S∼μn,Tzw∼νm+1[M(w,BA(S,T),ϵ(T,z))]
It is actually important to consider m>0 because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model’s “internal language”. For m=0 we can get high IG despite the model actually knowing domain E because the adversary B doesn’t know the embedding, but for m>0 it should be able to learn the embedding much faster than learning domain E from scratch.
We can imagine a toy example where X=Z=Rd, the projections of μ and ν to X and Z respectively are distributions concentrated around two affine subspaces, Y=W={−1,+1} and the labels are determined by the sign of a polynomial which is the same for μ and ν up to a linear transformation α:Rd→Rd which is a random variable w.r.t. ζ. A good A would then infer α, look for an affine subspace Q⊆Rd s.t.S is near Q while α(T) is far from Q and fit a polynomial to the projections of S on Q.
More realistically, if the prior is of Solomonoff type, then IGA is probably related to the relative Kolmogorov complexity of ν w.r.t. A.
It might be bad that we’re having B condition on A having accuracy ϵ while in reality A achieves this accuracy only asymptotically. Perhaps it would be better to define ζf in some way that takes A’s convergence rate into consideration. On the other hand, maybe it doesn’t matter much as long as we focus on asymptotic metrics.
The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.
Let ϵ>0 be our accuracy requirement for the target domain. That is, we want f:X→Y s.t.
Exy∼μ[L(y,f(x))]≤minf′:X→YExy∼μ[L(y,f(x))]+ϵ
Given any f:X→Y, denote ζf,ϵ to be ζ conditioned on the inequality above, where μ is regarded as a random variable. Define Bf,ϵ:(Z×W)∗×Z→W by
Bf,ϵ(T,z):=argminw∈WEν∼ζf,ϵ,T′z′w′∼ν|T|+1[M(w′,w)∣T′=T,z′=z]
That is, Bf,ϵ is the Bayes-optimal learning algorithm for domain E w.r.t. prior ζf,ϵ.
Now, consider some A:(X×Y)∗×(Z×W)∗×X→Y. We regard A as a learning algorithm for domain D which undergoes “antitraining” for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that A achieves asymptotic accuracy ϵ[1], i.e. that if μ is sampled from ζ then with probability 1
limn→∞supT∈(Z×W)∗ESxy∼μn+1[L(y,A(S,T,x))]≤minf:X→YExy∼μ[L(y,f(x))]+ϵ
Under this constraint, we want A to be as ignorant as possible about domain E, which we formalize as maximizing IGA defined by
IGAnm:=Eμν∼ζ,S∼μn,Tzw∼νm+1[M(w,BA(S,T),ϵ(T,z))]
It is actually important to consider m>0 because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model’s “internal language”. For m=0 we can get high IG despite the model actually knowing domain E because the adversary B doesn’t know the embedding, but for m>0 it should be able to learn the embedding much faster than learning domain E from scratch.
We can imagine a toy example where X=Z=Rd, the projections of μ and ν to X and Z respectively are distributions concentrated around two affine subspaces, Y=W={−1,+1} and the labels are determined by the sign of a polynomial which is the same for μ and ν up to a linear transformation α:Rd→Rd which is a random variable w.r.t. ζ. A good A would then infer α, look for an affine subspace Q⊆Rd s.t.S is near Q while α(T) is far from Q and fit a polynomial to the projections of S on Q.
More realistically, if the prior is of Solomonoff type, then IGA is probably related to the relative Kolmogorov complexity of ν w.r.t. A.
It might be bad that we’re having B condition on A having accuracy ϵ while in reality A achieves this accuracy only asymptotically. Perhaps it would be better to define ζf in some way that takes A’s convergence rate into consideration. On the other hand, maybe it doesn’t matter much as long as we focus on asymptotic metrics.