Thane Ruthenis comments on Interpretability Tools Are an Attack Channel

Thane Ruthenis 18 Aug 2022 8:53 UTC
1 point
0
For instance why couldn’t you just run the code on a machine without an output device (such as a monitor or speaker) if you don’t want to see any results during the AI running?
Because that would allow the AI to directly hack out of the training loop. It’d be able to exploit any vulnerabilities in the tech and software stacks implementing the training process, not only the vulnerabilities in the training process itself. And this is so easy some of the contemporary models do it (e. g., they learn to delete or alter the datasets they’re being scored against).
That is, homomorphic encryption dramatically decreases the attack surface.
Also I guess we’re assuming here that the AI a) cannot break FHE itself, and b) cannot obtain root access to the machine. Because if it could it could again forcibly create more than one bit of output. (Unless you have a hardware solution to reject more than one bit of output)
Yeah, if it can break FHE then all bets are off, and in the spirit of the least convenient possible world for my point, I’m assuming we’re using hardware solutions whenever possible.
- [ ]
  [deleted]
  - Thane Ruthenis 18 Aug 2022 11:05 UTC
    1 point
    0
    Parent
    I mean, it can’t hack out of the box. While the training process is encrypted, the AI is screened off from the rest of reality, and it can’t do anything without breaking the encryption (its computations can’t be externally distinguished from any other kind of homomorphically encrypted computation, so under homomorphic encryption, running it is as safe as summing up random numbers).
    The problem is that it can still seize full control over the insides of the box, and we’re presumably going to interact with them at some point. And then, the AI will 1) show us whatever we need to see to open the box more, if we’re limiting the bandwidth at the level of hardware, then 2) break out once we eventually give it a high-bandwidth channel.
    - [ ]
      [deleted]
      - Thane Ruthenis 18 Aug 2022 13:35 UTC
        1 point
        0
        Parent
        No, inside the box as in inside the encrypted computation. The AI’s architecture, the training setup, whatever regularizers/interpreters we have as part of that, but nothing beyond this.
        [ ]
        [deleted]
        Thane Ruthenis 18 Aug 2022 15:23 UTC
        5 points
        0
        Parent
        Well, the idea is that we basically give the code of the AI architecture and training loop as input to e. g. a Lisp interpreter, in an encrypted form. The interpreter would then execute that program without understanding it — i. e., directly derive the encrypted versions of the next steps of that program. The computations being encrypted means that we can’t distinguish them from any other kind of encrypted computations, so whatever clever strategy the AI is trying to try and break out, it won’t look any different to us than an AI that isn’t trying to break out. As such, hacking out is impossible without breaking encryption.
        Another angle: If you think about it from the point of view inside the homomorphically encrypted computation, it’s as if the rest of the world was encrypted. So to somehow cleverly manipulate parts of the outside (e. g., hack the system running it), the AI would need to decrypt them without the secret key, i. e. break encryption. It can still take over everything inside the encrypted computation, but everything beyond it (starting at the Lisp interpreter) is inaccessible.
        [ ]
        [deleted]
        [ ]
        [deleted]
        Thane Ruthenis 23 Aug 2022 9:54 UTC
        1 point
        0
        Parent
        Cool! I’ve had a similar idea. You should consider submitting it to superlinear.
        [ ]
        [deleted]