Zac Hatfield-Dodds comments on Discussion with Eliezer Yudkowsky on AGI interventions

Zac Hatfield-Dodds 12 Nov 2021 3:59 UTC
3 points
I note that this seems pretty strongly disanalogous to any kind of safety proof we care about, and doesn’t generalise to any reasoning about nonzero sensitivity. That said your assumptions seem fair to me, and I’d be happy to pay out the prize if this actually works.
- Maxwell 13 Nov 2021 12:04 UTC
  3 points
  Parent
  Yes, I’m aware of that, I tried to find a better proof but failed. Attempts based on trying to compute the maximum possible change (instead of figuring out how to get a desired change) are doomed. Changing the last bit isn’t an infinitesimal change, so using calculus to compute the maximum derivative won’t work. EfficientNets use swish activations, not ReLUs, which aren’t locally linear, so we will have to deal with the chaotic dynamics that show up whenever non-linear functions are iteratively applied. The sigmoid inside the swish does eventually saturate because of floating-point arithmetic, making it effectively locally linear, but then we have to deal with the details of the exact binary representations.
  There might be another solution, though: the softmax is the n-input generalization of the sigmoid, so it can also saturate to exact 0s and 1s. We could try to overfit a network using a very high learning rate so that for one randomly generated but fixed image it predicts some class with 100% certainty and given the same image with some pixel changed it predicts another class with also 100% certainty. Then, if this works, we could try training it on MNIST, adding the two original images to every mini-batch with a loss multiplier much greater than for the regular samples. That way the answer to your question becomes 1.
  If nobody comes up with anything better, when I get some free time I will try to implement the binarization approach and then I will send you the code.
  - Zac Hatfield-Dodds 13 Nov 2021 22:22 UTC
    1 point
    Parent
    Sure, just remember that an experimental demonstration isn’t enough—“Your proof must not include executing the model, nor equivalent computations”.