Maxwell comments on Discussion with Eliezer Yudkowsky on AGI interventions

Maxwell 11 Nov 2021 19:46 UTC
14 points
That might actually be easy to prove with some effort (or it might not), consider this strategy:
Let’s assume that the input to the system are PNG images with 8-bit values between 0 and 255, that are converted into floating-point tensors before entering the net, and that the bits you are talking about are those of the original images. And lets also assume that the upscaling from the small MNIST images to the input of the net is such that each float of the tensor corresponds to exactly one value in the original image (that is, there is no interpolation). And that we are using the exact architecture of table 1 of the EfficientNet paper (https://arxiv.org/pdf/1905.11946v5.pdf).
Then we might be able to train the network freezing the weights of the first Conv3x3 to the identity (1 in the centre of the kernel and 0s around it) so that the 32 channels of the next layer receive 2x downscaled copies of the image. We also remove the first batch norm layer, it will be added later. Losing the ability to perform useful computations in this layer is probably not enough to make it perform below 99% accuracy given that MNIST is so simple.
If this works, then we need to edit the weights of the first Conv layer and of the batch norm layer (if they can be called weights) to (assuming float32):
Central value of kernel: 2**-149
Division value in batch norm: 2**-49
Multiplication value of the batch norm: 2**100
All other values to the neutral element of their operations
(check https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
This will turn all values of 127 (in the PNG) and below into exactly 0, and values above into exactly 1 because of numerical underflow (I checked this with numpy). The network will process this from the swish activation onwards exactly the same way it would process the binarized versions of the images, and in all likelihood give them the same class.
127 in base 2 is 01111111, and 128 is 10000000. You can’t go from a perceived black pixel to a perceived white pixel by only changing the last bit, so the maximum difference in class probability that you can obtain by only changing the least significant bit is zero.
- Zac Hatfield-Dodds 12 Nov 2021 3:59 UTC
  3 points
  Parent
  I note that this seems pretty strongly disanalogous to any kind of safety proof we care about, and doesn’t generalise to any reasoning about nonzero sensitivity. That said your assumptions seem fair to me, and I’d be happy to pay out the prize if this actually works.
  - Maxwell 13 Nov 2021 12:04 UTC
    3 points
    Parent
    Yes, I’m aware of that, I tried to find a better proof but failed. Attempts based on trying to compute the maximum possible change (instead of figuring out how to get a desired change) are doomed. Changing the last bit isn’t an infinitesimal change, so using calculus to compute the maximum derivative won’t work. EfficientNets use swish activations, not ReLUs, which aren’t locally linear, so we will have to deal with the chaotic dynamics that show up whenever non-linear functions are iteratively applied. The sigmoid inside the swish does eventually saturate because of floating-point arithmetic, making it effectively locally linear, but then we have to deal with the details of the exact binary representations.
    There might be another solution, though: the softmax is the n-input generalization of the sigmoid, so it can also saturate to exact 0s and 1s. We could try to overfit a network using a very high learning rate so that for one randomly generated but fixed image it predicts some class with 100% certainty and given the same image with some pixel changed it predicts another class with also 100% certainty. Then, if this works, we could try training it on MNIST, adding the two original images to every mini-batch with a loss multiplier much greater than for the regular samples. That way the answer to your question becomes 1.
    If nobody comes up with anything better, when I get some free time I will try to implement the binarization approach and then I will send you the code.
    - Zac Hatfield-Dodds 13 Nov 2021 22:22 UTC
      1 point
      Parent
      Sure, just remember that an experimental demonstration isn’t enough—“Your proof must not include executing the model, nor equivalent computations”.