The point is that the arbiter is simple, and that the AI’s entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length.
We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don’t make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about.
So, I’m the AI, and I’m stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right?
So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture.
I don’t understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this.
In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message.
This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can’t corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted.
I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won’t fail at his task.
The point is that the arbiter is simple, and that the AI’s entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length.
We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don’t make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about.
So, I’m the AI, and I’m stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right?
So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture.
I don’t understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this.
In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message.
This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can’t corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted.
I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won’t fail at his task.
Ah! Yes, that clarifies the situation. Thank you.
And the paper is too short for you to be able to argue persuasively or make art or whatever.