Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 24 Dec 2021 20:21 UTC
LW: 3 AF: 3
AF
The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.

Let $ϵ > 0$ be our accuracy requirement for the target domain. That is, we want $f : X \to Y$ s.t.

$E_{x y \sim μ} [L (y, f (x))] \leq min f^{'} : X \to Y E_{x y \sim μ} [L (y, f (x))] + ϵ$

Given any $f : X \to Y$ , denote $ζ_{f, ϵ}$ to be $ζ$ conditioned on the inequality above, where $μ$ is regarded as a random variable. Define $B_{f, ϵ} : (Z \times W)^{*} \times Z \to W$ by

$B_{f, ϵ} (T, z) := arg min w \in W E_{ν \sim ζ_{f, ϵ}, T^{'} z^{'} w^{'} \sim ν^{| T | + 1}} [M (w^{'}, w) ∣ T^{'} = T, z^{'} = z]$

That is, $B_{f, ϵ}$ is the Bayes-optimal learning algorithm for domain E w.r.t. prior $ζ_{f, ϵ}$ .

Now, consider some $A : (X \times Y)^{*} \times (Z \times W)^{*} \times X \to Y$ . We regard $A$ as a learning algorithm for domain D which undergoes “antitraining” for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that $A$ achieves asymptotic accuracy $ϵ$ ^[1], i.e. that if $μ$ is sampled from $ζ$ then with probability $1$

$lim n \to \infty sup T \in (Z \times W)^{*} E_{S x y \sim μ^{n + 1}} [L (y, A (S, T, x))] \leq min f : X \to Y E_{x y \sim μ} [L (y, f (x))] + ϵ$

Under this constraint, we want $A$ to be as ignorant as possible about domain E, which we formalize as maximizing ${I G}^{A}$ defined by

${I G}_{n m}^{A} := E_{μ ν \sim ζ, S \sim μ^{n}, T z w \sim ν^{m + 1}} [M (w, B_{A (S, T), ϵ} (T, z))]$

It is actually important to consider $m > 0$ because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model’s “internal language”. For $m = 0$ we can get high $I G$ despite the model actually knowing domain E because the adversary $B$ doesn’t know the embedding, but for $m > 0$ it should be able to learn the embedding much faster than learning domain E from scratch.

We can imagine a toy example where $X = Z = R^{d}$ , the projections of $μ$ and $ν$ to $X$ and $Z$ respectively are distributions concentrated around two affine subspaces, $Y = W = {- 1, + 1}$ and the labels are determined by the sign of a polynomial which is the same for $μ$ and $ν$ up to a linear transformation $α : R^{d} \to R^{d}$ which is a random variable w.r.t. $ζ$ . A good $A$ would then infer $α$ , look for an affine subspace $Q \subseteq R^{d}$ s.t. $S$ is near $Q$ while $α (T)$ is far from $Q$ and fit a polynomial to the projections of $S$ on $Q$ .

More realistically, if the prior is of Solomonoff type, then ${I G}^{A}$ is probably related to the relative Kolmogorov complexity of $ν$ w.r.t. $A$ .
1. ↩︎
  It might be bad that we’re having $B$ condition on $A$ having accuracy $ϵ$ while in reality $A$ achieves this accuracy only asymptotically. Perhaps it would be better to define $ζ_{f}$ in some way that takes $A$ ’s convergence rate into consideration. On the other hand, maybe it doesn’t matter much as long as we focus on asymptotic metrics.