acylhalide comments on Interpretability Tools Are an Attack Channel

[ ]
[deleted]
- Thane Ruthenis 18 Aug 2022 13:35 UTC
  1 point
  0
  Parent
  No, inside the box as in inside the encrypted computation. The AI’s architecture, the training setup, whatever regularizers/interpreters we have as part of that, but nothing beyond this.
  - [ ]
    [deleted]
    - Thane Ruthenis 18 Aug 2022 15:23 UTC
      5 points
      0
      Parent
      Well, the idea is that we basically give the code of the AI architecture and training loop as input to e. g. a Lisp interpreter, in an encrypted form. The interpreter would then execute that program without understanding it — i. e., directly derive the encrypted versions of the next steps of that program. The computations being encrypted means that we can’t distinguish them from any other kind of encrypted computations, so whatever clever strategy the AI is trying to try and break out, it won’t look any different to us than an AI that isn’t trying to break out. As such, hacking out is impossible without breaking encryption.
      Another angle: If you think about it from the point of view inside the homomorphically encrypted computation, it’s as if the rest of the world was encrypted. So to somehow cleverly manipulate parts of the outside (e. g., hack the system running it), the AI would need to decrypt them without the secret key, i. e. break encryption. It can still take over everything inside the encrypted computation, but everything beyond it (starting at the Lisp interpreter) is inaccessible.
      - [ ]
        [deleted]
      - [ ]
        [deleted]
        Thane Ruthenis 23 Aug 2022 9:54 UTC
        1 point
        0
        Parent
        Cool! I’ve had a similar idea. You should consider submitting it to superlinear.
      - [ ]
        [deleted]