acylhalide comments on Interpretability Tools Are an Attack Channel

[ ]
[deleted]
- Thane Ruthenis 18 Aug 2022 11:05 UTC
  1 point
  0
  Parent
  I mean, it can’t hack out of the box. While the training process is encrypted, the AI is screened off from the rest of reality, and it can’t do anything without breaking the encryption (its computations can’t be externally distinguished from any other kind of homomorphically encrypted computation, so under homomorphic encryption, running it is as safe as summing up random numbers).
  The problem is that it can still seize full control over the insides of the box, and we’re presumably going to interact with them at some point. And then, the AI will 1) show us whatever we need to see to open the box more, if we’re limiting the bandwidth at the level of hardware, then 2) break out once we eventually give it a high-bandwidth channel.
  - [ ]
    [deleted]
    - Thane Ruthenis 18 Aug 2022 13:35 UTC
      1 point
      0
      Parent
      No, inside the box as in inside the encrypted computation. The AI’s architecture, the training setup, whatever regularizers/interpreters we have as part of that, but nothing beyond this.
      - [ ]
        [deleted]
        Thane Ruthenis 18 Aug 2022 15:23 UTC
        5 points
        0
        Parent
        Well, the idea is that we basically give the code of the AI architecture and training loop as input to e. g. a Lisp interpreter, in an encrypted form. The interpreter would then execute that program without understanding it — i. e., directly derive the encrypted versions of the next steps of that program. The computations being encrypted means that we can’t distinguish them from any other kind of encrypted computations, so whatever clever strategy the AI is trying to try and break out, it won’t look any different to us than an AI that isn’t trying to break out. As such, hacking out is impossible without breaking encryption.
        Another angle: If you think about it from the point of view inside the homomorphically encrypted computation, it’s as if the rest of the world was encrypted. So to somehow cleverly manipulate parts of the outside (e. g., hack the system running it), the AI would need to decrypt them without the secret key, i. e. break encryption. It can still take over everything inside the encrypted computation, but everything beyond it (starting at the Lisp interpreter) is inaccessible.
        [ ]
        [deleted]
        [ ]
        [deleted]
        Thane Ruthenis 23 Aug 2022 9:54 UTC
        1 point
        0
        Parent
        Cool! I’ve had a similar idea. You should consider submitting it to superlinear.
        [ ]
        [deleted]