paulfchristiano comments on Mechanistic anomaly detection and ELK

paulfchristiano 9 Dec 2022 18:04 UTC
LW: 6 AF: 4
0
AF
Here is a question closely related to the feasibility of finding discriminating-reasons (cross-posted from Facebook):
For some circuits C it’s meaningful to talk about “different mechanisms” by which C outputs 1.
A very simple example is C(x) := A(x) or B(x). This circuit can be 1 if either A(x) = 1 or B(x) = 1, and intuitively those are two totally different mechanisms.
A more interesting example is the primality test C(x, n) := (x^n = x (mod n)). This circuit is 1 whenever n is a prime, but it can also be 1 “by coincidence” e.g if n is a Carmichael number. (Such coincidences are rare and look nothing like n being close to prime.)
In every case I’m aware of where there are two clearly distinct mechanisms, there is also an efficient probabilistic algorithm for distinguishing those mechanisms (i.e. an algorithm for distinguishing cases where C(x) = 1 due to mechanism A from cases where C(x) = 1 due to mechanism B). I am extremely interested in counterexamples to that general principle.
For example, a priori it seems like it could have turned out that (x^n = x (mod n)) is a good probabilistic primality test, but there is no efficient probabilistic test for distinguishing primes from Carmichael numbers. That would have been a convincing counterexample. But it turns out that testing primality is easy, and in fact we can make a simple tweak to this very probabilistic primality test so that it doesn’t get fooled by Carmichael numbers. But is that an incidental fact about number theory, or once we found a probabilistic primality test was it inevitable that it could be strengthened in this way?
Here are some other illustrative cases:
- Suppose that C(x) uses x to initialize a 1000 x 1000 square in the middle of a 10,000 x 10,000 game of life grid. Then we simulate it for a million steps, and C(x) = 1 if any cell on the rightmost edge of the grid is ever alive. It’s very easy to look at the grid and distinguish the cases where C(x) = 1 because a glider is created that heads to the right side of the grid, from the much rarer cases where C(x) = 1 for any other reason (e.g. a medium weight spaceship).
- Suppose that X(x) is a pseudorandom sparse n x n matrix in some large finite field, and suppose that X is sparse enough that 1% of the time there are is no perfect matchings at all (i.e. there is no permutation sigma such that X[i, sigma(i)] != 0 for i=1,…,n). Define C(x) := (det(X(x)) = 0). We can distinguish the common case where det(X) = 0 because there are no perfect matchings in X from the extremely rare case where det(X) = 0 because there are multiple perfect matchings contributing to the determinant and they happen to all cancel out. These two cases are easy to distinguish by calculating det(X’) for another random matrix X’ with the same sparsity pattern as X. (Thanks to Dan Kane for calling my attention to this kind of example, and especially the harder version based on exact matchings.)
- Suppose that C_0(x) := A(x) or B(x) and C(x) is an obfuscated version of C_0. Then there is an efficient discriminator: de-obfuscate the circuit and check whether A or B is true. Finding that discriminator given C is hard, but that’s not a violation of our general principle. That said, I would also be interested in a slightly stronger conjecture: not only is there always a discriminator, but it can always specified using roughly the same number of bits required to specify the circuit C. That’s true in this case, because the circuit C needs to bake in the secret key for the obfuscation, and so requires more bits than the discriminator.
If there don’t exist any convincing counterexamples to this principle, then I’m also very interested in understanding why—right now I don’t have any formal way of talking about this situation or seeing why discrimination should be possible in general. One very informal way of phrasing the “positive” problem: suppose I have a heuristic argument that C(x) often outputs 1 for random inputs x, and suppose that my heuristic argument appears to consider two cases A and B separately. Is there a generic way to either (i) find an efficient algorithm for distinguishing cases A and B, or else (ii) find an improved heuristic argument that unifies cases A and B, showing that they weren’t fundamentally separate mechanisms?
What links here?
- Can we efficiently explain model behaviors? by paulfchristiano (16 Dec 2022 19:40 UTC; 64 points)