Not necessarily. It’s routed through the interpretability channel, but the idea is that the AI would be able to take over whatever system is running the interpretability tool through that channel (either by shaping the tool’s direct outputs, or by manipulating the computational side-effects of that tool being ran). So it won’t be limited to controlling the tool’s output, but the system’s output.
I guess we can homomorphically encrypt the training process and then only decrypt the interpretability results on a separate air-gapped computer whose outputs we’re not using for anything else. In that case, yeah, the AI would only be able to influence us through the interpretability results. (Which wouldn’t be hard or anything. Presumably we’re doing that because some of these results we would judge as “good” and give the go-ahead to continue the training/release the AI based on them. The AI will make sure we get such results.)
For instance why couldn’t you just run the code on a machine without an output device (such as a monitor or speaker) if you don’t want to see any results during the AI running?
Because that would allow the AI to directly hack out of the training loop. It’d be able to exploit any vulnerabilities in the tech and software stacks implementing the training process, not only the vulnerabilities in the training process itself. And this is so easy some of the contemporary models do it (e. g., they learn to delete or alter the datasets they’re being scored against).
That is, homomorphic encryption dramatically decreases the attack surface.
Also I guess we’re assuming here that the AI a) cannot break FHE itself, and b) cannot obtain root access to the machine. Because if it could it could again forcibly create more than one bit of output. (Unless you have a hardware solution to reject more than one bit of output)
Yeah, if it can break FHE then all bets are off, and in the spirit of the least convenient possible world for my point, I’m assuming we’re using hardware solutions whenever possible.
I mean, it can’t hack out of the box. While the training process is encrypted, the AI is screened off from the rest of reality, and it can’t do anything without breaking the encryption (its computations can’t be externally distinguished from any other kind of homomorphically encrypted computation, so under homomorphic encryption, running it is as safe as summing up random numbers).
The problem is that it can still seize full control over the insides of the box, and we’re presumably going to interact with them at some point. And then, the AI will 1) show us whatever we need to see to open the box more, if we’re limiting the bandwidth at the level of hardware, then 2) break out once we eventually give it a high-bandwidth channel.
No, inside the box as in inside the encrypted computation. The AI’s architecture, the training setup, whatever regularizers/interpreters we have as part of that, but nothing beyond this.
Well, the idea is that we basically give the code of the AI architecture and training loop as input to e. g. a Lisp interpreter, in an encrypted form. The interpreter would then execute that program without understanding it — i. e., directly derive the encrypted versions of the next steps of that program. The computations being encrypted means that we can’t distinguish them from any other kind of encrypted computations, so whatever clever strategy the AI is trying to try and break out, it won’t look any different to us than an AI that isn’t trying to break out. As such, hacking out is impossible without breaking encryption.
Another angle: If you think about it from the point of view inside the homomorphically encrypted computation, it’s as if the rest of the world was encrypted. So to somehow cleverly manipulate parts of the outside (e. g., hack the system running it), the AI would need to decrypt them without the secret key, i. e. break encryption. It can still take over everything inside the encrypted computation, but everything beyond it (starting at the Lisp interpreter) is inaccessible.
Not necessarily. It’s routed through the interpretability channel, but the idea is that the AI would be able to take over whatever system is running the interpretability tool through that channel (either by shaping the tool’s direct outputs, or by manipulating the computational side-effects of that tool being ran). So it won’t be limited to controlling the tool’s output, but the system’s output.
I guess we can homomorphically encrypt the training process and then only decrypt the interpretability results on a separate air-gapped computer whose outputs we’re not using for anything else. In that case, yeah, the AI would only be able to influence us through the interpretability results. (Which wouldn’t be hard or anything. Presumably we’re doing that because some of these results we would judge as “good” and give the go-ahead to continue the training/release the AI based on them. The AI will make sure we get such results.)
Because that would allow the AI to directly hack out of the training loop. It’d be able to exploit any vulnerabilities in the tech and software stacks implementing the training process, not only the vulnerabilities in the training process itself. And this is so easy some of the contemporary models do it (e. g., they learn to delete or alter the datasets they’re being scored against).
That is, homomorphic encryption dramatically decreases the attack surface.
Yeah, if it can break FHE then all bets are off, and in the spirit of the least convenient possible world for my point, I’m assuming we’re using hardware solutions whenever possible.
I mean, it can’t hack out of the box. While the training process is encrypted, the AI is screened off from the rest of reality, and it can’t do anything without breaking the encryption (its computations can’t be externally distinguished from any other kind of homomorphically encrypted computation, so under homomorphic encryption, running it is as safe as summing up random numbers).
The problem is that it can still seize full control over the insides of the box, and we’re presumably going to interact with them at some point. And then, the AI will 1) show us whatever we need to see to open the box more, if we’re limiting the bandwidth at the level of hardware, then 2) break out once we eventually give it a high-bandwidth channel.
No, inside the box as in inside the encrypted computation. The AI’s architecture, the training setup, whatever regularizers/interpreters we have as part of that, but nothing beyond this.
Well, the idea is that we basically give the code of the AI architecture and training loop as input to e. g. a Lisp interpreter, in an encrypted form. The interpreter would then execute that program without understanding it — i. e., directly derive the encrypted versions of the next steps of that program. The computations being encrypted means that we can’t distinguish them from any other kind of encrypted computations, so whatever clever strategy the AI is trying to try and break out, it won’t look any different to us than an AI that isn’t trying to break out. As such, hacking out is impossible without breaking encryption.
Another angle: If you think about it from the point of view inside the homomorphically encrypted computation, it’s as if the rest of the world was encrypted. So to somehow cleverly manipulate parts of the outside (e. g., hack the system running it), the AI would need to decrypt them without the secret key, i. e. break encryption. It can still take over everything inside the encrypted computation, but everything beyond it (starting at the Lisp interpreter) is inaccessible.
Cool! I’ve had a similar idea. You should consider submitting it to superlinear.