whereas the AI is only allowed to send 1 bit total before dying.
Not quite right. The AI, in theory, can only send humans 1 bit total before dying, but it can send millions of bits to the arbiter, which might be corruptible. I think the piece of this scheme that is doing the magic is the uncorruptible arbiter that perfectly distinguishes friendly from unfriendly code. We don’t know anything about either of those two impossible problems yet.
I agree that coming up with a complex uncorruptible arbiter is difficult.
However, I can imagine simple arbiters which might be helpful but which I can implement. For example, suppose I just want to find a simple explanation for some physical data. I could imagine an arbiter which simply elicits an explanation (in the form of a short program which correctly predicts the data) from the AI and then tells us either “yes, the AI found a very simple explanation” or “no, the AI didn’t give any very simple explanation.” I am working on a post describing one approach to getting useful data out of an unfriendly AI without letting him do anything bad.
OK, but why on earth would you think that the arbiter will be immune to the superintelligent suasion of the AI?
Let’s say you program it yourself on hardware you design yourself and you understand very well what the arbiter does under normal circumstances. The AI’s new job is to figure out how to create abnormal circumstances under which it can break your arbiter and steal its resources. Suppose it succeeds—how will you stop the arbiter from passing on a message from the AI that will corrupt your mind, which will then turn you into an unwitting agent of the ufAI, etc.? Think corny zombie movies, only in real life.
I know the arbiter is only supposed to be able to send you one bit of information. But how do you propose to stop it from sending you more when it’s no longer responding to your programming? Will you equip it with diodes that burn out immediately after being switched on?
The point is that the arbiter is simple, and that the AI’s entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length.
We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don’t make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about.
So, I’m the AI, and I’m stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right?
So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture.
I don’t understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this.
In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message.
This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can’t corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted.
I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won’t fail at his task.
Well, that last part seems relatively easy to ensure: set up a two-layer arbiter, the first layer of which is the arbiter we’ve been talking about, and the second layer of which is outside the box and is just a stupid hardware switch that strips out all but the first bit of the first arbiter’s message.
If something inside the box can alter the second layer, then it can probably also directly alter the contents of my mind. You have to assume some limits or the whole exercise is pointless.
None of which changes the fact that a message from the arbiter that says “yes, the AI found a very simple explanation” doesn’t provide much evidence that the AI found a very simple explanation.
You can have more than one arbiter, and the arbiters can be specialized. The AI produces its message, and then you run a series of arbiters on it. One is a proof checker, and the rest are for checking whether the proof is tricky or obfuscated, in various different ways. That way, the AI would have to corrupt every arbiter (or at least, every arbiter capable of detecting the techniques it’s using to hack the other arbiters), and it is much less likely that that it’s possible to do this than to fool just one.
It seems clear to me that if an entity is able to intentionally hack an arbiter in the first place, having multiple equally insecure variant arbiters only postpones the inevitable.
If that is somehow not true of the arbiter you are visualizing, then I haven’t properly visualized this arbiter. I can’t even figure out what questions to ask to clarify it; the whole idea just seems like a non-starter.
I should therefore drop out of this conversation and go back to listening.
Having multiple equally insecure variant arbiters only postpones the inevitable.
The point is that you’re uncertain about whether arbiters are secure or not, and this lets you combine multiple possibly-insecure arbiters into a more-secure combination. If you have two arbiters which each have a discoverable and exploitable security hole with probability 0.5, and they don’t share any code and use sufficiently different techniques, then the AI can only fool both of them with probability 0.25.
In fact, it may be better than that—the two arbiters may both have security holes, but exploiting them may be mutually exclusive. For example, if the AI can force either arbiter to run a payload, but has to give the same payload to both while they expect different programming languages with disjoint syntax, then any exploit will crash at least one arbiter, and returning the “arbiter crashed” flag means you know the AI is untrustworthy.
I don’t see how I could ever know—or even be significantly confident—that a given pair of arbiters X and Y were such that the probability that an AI can exploit arbiter X doesn’t correlate well with the probability that the AI can exploit arbiter Y.
That is, I don’t see how I could ever, even in principle, be significantly confident that their algorithms were “sufficiently different” for the probabilities to multiply.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted. (ETA: That part has been clarified elsethread; thanks!)
Now, maybe what I know doesn’t matter: maybe we’re not talking about provable security. If you’re saying that we could build networks of arbiters and linkages between those arbiters in a secure box such that there was an unknowable but nonzero chance that a hostile superintelligence in that box had failed to corrupt those arbiters, then I agree with you, at least in principle.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted.
Actually, that part’s easy. The message can only corrupt things which both receive it as input and perform nontrivial computation on it. You don’t need an “arbiter” to copy a message, nor do you need to decrpt it.
Not quite right. The AI, in theory, can only send humans 1 bit total before dying, but it can send millions of bits to the arbiter, which might be corruptible. I think the piece of this scheme that is doing the magic is the uncorruptible arbiter that perfectly distinguishes friendly from unfriendly code. We don’t know anything about either of those two impossible problems yet.
I agree that coming up with a complex uncorruptible arbiter is difficult.
However, I can imagine simple arbiters which might be helpful but which I can implement. For example, suppose I just want to find a simple explanation for some physical data. I could imagine an arbiter which simply elicits an explanation (in the form of a short program which correctly predicts the data) from the AI and then tells us either “yes, the AI found a very simple explanation” or “no, the AI didn’t give any very simple explanation.” I am working on a post describing one approach to getting useful data out of an unfriendly AI without letting him do anything bad.
OK, but why on earth would you think that the arbiter will be immune to the superintelligent suasion of the AI?
Let’s say you program it yourself on hardware you design yourself and you understand very well what the arbiter does under normal circumstances. The AI’s new job is to figure out how to create abnormal circumstances under which it can break your arbiter and steal its resources. Suppose it succeeds—how will you stop the arbiter from passing on a message from the AI that will corrupt your mind, which will then turn you into an unwitting agent of the ufAI, etc.? Think corny zombie movies, only in real life.
I know the arbiter is only supposed to be able to send you one bit of information. But how do you propose to stop it from sending you more when it’s no longer responding to your programming? Will you equip it with diodes that burn out immediately after being switched on?
The point is that the arbiter is simple, and that the AI’s entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length.
We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don’t make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about.
So, I’m the AI, and I’m stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right?
So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture.
I don’t understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this.
In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message.
This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can’t corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted.
I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won’t fail at his task.
Ah! Yes, that clarifies the situation. Thank you.
And the paper is too short for you to be able to argue persuasively or make art or whatever.
Well, that last part seems relatively easy to ensure: set up a two-layer arbiter, the first layer of which is the arbiter we’ve been talking about, and the second layer of which is outside the box and is just a stupid hardware switch that strips out all but the first bit of the first arbiter’s message.
If something inside the box can alter the second layer, then it can probably also directly alter the contents of my mind. You have to assume some limits or the whole exercise is pointless.
None of which changes the fact that a message from the arbiter that says “yes, the AI found a very simple explanation” doesn’t provide much evidence that the AI found a very simple explanation.
You can have more than one arbiter, and the arbiters can be specialized. The AI produces its message, and then you run a series of arbiters on it. One is a proof checker, and the rest are for checking whether the proof is tricky or obfuscated, in various different ways. That way, the AI would have to corrupt every arbiter (or at least, every arbiter capable of detecting the techniques it’s using to hack the other arbiters), and it is much less likely that that it’s possible to do this than to fool just one.
It seems clear to me that if an entity is able to intentionally hack an arbiter in the first place, having multiple equally insecure variant arbiters only postpones the inevitable.
If that is somehow not true of the arbiter you are visualizing, then I haven’t properly visualized this arbiter. I can’t even figure out what questions to ask to clarify it; the whole idea just seems like a non-starter.
I should therefore drop out of this conversation and go back to listening.
The point is that you’re uncertain about whether arbiters are secure or not, and this lets you combine multiple possibly-insecure arbiters into a more-secure combination. If you have two arbiters which each have a discoverable and exploitable security hole with probability 0.5, and they don’t share any code and use sufficiently different techniques, then the AI can only fool both of them with probability 0.25.
In fact, it may be better than that—the two arbiters may both have security holes, but exploiting them may be mutually exclusive. For example, if the AI can force either arbiter to run a payload, but has to give the same payload to both while they expect different programming languages with disjoint syntax, then any exploit will crash at least one arbiter, and returning the “arbiter crashed” flag means you know the AI is untrustworthy.
I don’t see how I could ever know—or even be significantly confident—that a given pair of arbiters X and Y were such that the probability that an AI can exploit arbiter X doesn’t correlate well with the probability that the AI can exploit arbiter Y.
That is, I don’t see how I could ever, even in principle, be significantly confident that their algorithms were “sufficiently different” for the probabilities to multiply.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted. (ETA: That part has been clarified elsethread; thanks!)
Now, maybe what I know doesn’t matter: maybe we’re not talking about provable security. If you’re saying that we could build networks of arbiters and linkages between those arbiters in a secure box such that there was an unknowable but nonzero chance that a hostile superintelligence in that box had failed to corrupt those arbiters, then I agree with you, at least in principle.
Actually, that part’s easy. The message can only corrupt things which both receive it as input and perform nontrivial computation on it. You don’t need an “arbiter” to copy a message, nor do you need to decrpt it.