Tim Freeman comments on Can we efficiently distinguish different mechanisms?

Tim Freeman 12 Jan 2023 10:38 UTC
1 point
0
Does the discriminator get access to the symbolic representation of f, or just its behavior?

If the discriminator only gets access to the behavior of f, I might have a solution. Define g(y,z) = 1 if y = y * z, 0 otherwise. So g(y,z) is 1 if z is 1 or y is 0, which is two different mechanisms.

Pick some way to encode two numbers into one, so we have a one to one mapping E between pairs (y,z) and numbers x. Define f1(x) = g(E(x)).

Now pick a cryptographic hash H that might as well be sha256 except some fresh algorithm not known to the discriminator or guessable by the discriminator. Define f(x) = f1(H(x)).

Does that do what you want? If not, how does it fail?

f in the original problem stands for the AI’s estimate of “are the humans getting what they want?”, and the two mechanisms are the humans really getting what they want vs something insane and unwanted that manages to look that way. It seems unfair for the humans to encrypt their minds to make this difficult. The AI will have access to brain scans, so the discriminator should have access to the symbolic representation of f, and therefore if we assume an unguessable H we are solving the wrong problem. Agreed?

Another problem is that the f I stated is going to be hard for an AI to guess an x such that f(x) = 1. That is easily fixable. Redefine g so it clears all but the low 5 bits of y and z before doing the multiplication. Now the AI can guess x by trial and error with 1000-ish trials.
- paulfchristiano 12 Jan 2023 16:52 UTC
  2 points
  0
  Parent
  We’re imagining that the discriminator gets to see the computation done by f—the tricky part of the Fermat test is that even if you see the computation of $a^{n} (mod n)$ you still can’t figure out whether the number was prime or Carmichael.
  We are also hoping the discriminator will be no harder to learn than f, and so e.g. if gradient descent bakes some tricky secret into f then we are happy baking the same secret secret into the discriminator (and this is why obfuscated circuits aren’t a counterexample).