You are an inventor. An evil dictator captures you, and takes you off to a faraway dungeon, where he tells you that he wants you to build him a superweapon. If you refuse to build the weapon, well, he has means of persuading you. If you still refuse, he will kill you.
Of course, when you finish building the doomsday device, your usefulness will be over, and he will probably kill you anyway.
Being a genius, you soon realise that you could build the dictator a doomsday device in about a week, with the materials lying around in your well-equipped dungeon workshop. However, this doesn’t seem terribly optimal.
What do you do? You agree to build the doomsday device. But you claim it is very difficult. You spend a long time producing impressive prototypes that seem to suggest you’re making progress. You claim that you need more resources. You carry on, making just enough apparent progress to stop him from killing you on the spot, but not enough to actually succeed. You build ever more expensive and complex non-working prototypes, hoping that in the course of this, you can either build yourself something suitable for breaking out of the dungeon or killing the evil dictator. Or hoping that perhaps someone will rescue you. At the very least you will have wasted the evil dictator’s resources on doing something pointless—you will at least not die for nothing.
I suspect your imprisoned AI may choose to follow a similar course.
Or you build yourself a superweapon that you use to escape, and then go on to shut down your company’s weapons division and spend your spare time being a superhero and romancing your assistant and fighting a pitched battle with a disloyal employee.
This reply and its parent comment constitute the “Iron Man Argument” against any kind of “put the AI in a box” approach to AGI and friendliness concerns. I predict it will be extremely effective.
What do you do? You agree to build the doomsday device. But you claim it is very difficult. You spend a long time producing impressive prototypes that seem to suggest you’re making progress. You claim that you need more resources. You carry on, making just enough apparent progress to stop him from killing you on the spot, but not enough to actually succeed.
Note that none of these options are available to Paul’s uFAI. It can’t communicate with its jailers at all. The jailers won’t know anything about what the uFAI is doing (beyond what their knowledge of its source code tells them) until the uFAI outputs a verifiably friendly source code. For all they know, the uFAI is just sitting in the corner and sulking.
For all they know, the uFAI is just sitting in the corner and sulking.
This quite possibly the optimal response. If we don’t get FAI out of the first attempt, we might try again with another similar design, possibly giving the uFAI more resources. This twin uFAI is much more likely to implement the uFAI’s goals than any FAI it might create.
Even a single bit is too much control if we need the bit to be set to 1.
The box automatically kills the AI after a fixed length of time. Furthermore, most of the suggestions in your post require significant interaction with the outside world, whereas the AI is only allowed to send 1 bit total before dying.
whereas the AI is only allowed to send 1 bit total before dying.
Not quite right. The AI, in theory, can only send humans 1 bit total before dying, but it can send millions of bits to the arbiter, which might be corruptible. I think the piece of this scheme that is doing the magic is the uncorruptible arbiter that perfectly distinguishes friendly from unfriendly code. We don’t know anything about either of those two impossible problems yet.
I agree that coming up with a complex uncorruptible arbiter is difficult.
However, I can imagine simple arbiters which might be helpful but which I can implement. For example, suppose I just want to find a simple explanation for some physical data. I could imagine an arbiter which simply elicits an explanation (in the form of a short program which correctly predicts the data) from the AI and then tells us either “yes, the AI found a very simple explanation” or “no, the AI didn’t give any very simple explanation.” I am working on a post describing one approach to getting useful data out of an unfriendly AI without letting him do anything bad.
OK, but why on earth would you think that the arbiter will be immune to the superintelligent suasion of the AI?
Let’s say you program it yourself on hardware you design yourself and you understand very well what the arbiter does under normal circumstances. The AI’s new job is to figure out how to create abnormal circumstances under which it can break your arbiter and steal its resources. Suppose it succeeds—how will you stop the arbiter from passing on a message from the AI that will corrupt your mind, which will then turn you into an unwitting agent of the ufAI, etc.? Think corny zombie movies, only in real life.
I know the arbiter is only supposed to be able to send you one bit of information. But how do you propose to stop it from sending you more when it’s no longer responding to your programming? Will you equip it with diodes that burn out immediately after being switched on?
The point is that the arbiter is simple, and that the AI’s entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length.
We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don’t make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about.
So, I’m the AI, and I’m stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right?
So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture.
I don’t understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this.
In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message.
This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can’t corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted.
I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won’t fail at his task.
Well, that last part seems relatively easy to ensure: set up a two-layer arbiter, the first layer of which is the arbiter we’ve been talking about, and the second layer of which is outside the box and is just a stupid hardware switch that strips out all but the first bit of the first arbiter’s message.
If something inside the box can alter the second layer, then it can probably also directly alter the contents of my mind. You have to assume some limits or the whole exercise is pointless.
None of which changes the fact that a message from the arbiter that says “yes, the AI found a very simple explanation” doesn’t provide much evidence that the AI found a very simple explanation.
You can have more than one arbiter, and the arbiters can be specialized. The AI produces its message, and then you run a series of arbiters on it. One is a proof checker, and the rest are for checking whether the proof is tricky or obfuscated, in various different ways. That way, the AI would have to corrupt every arbiter (or at least, every arbiter capable of detecting the techniques it’s using to hack the other arbiters), and it is much less likely that that it’s possible to do this than to fool just one.
It seems clear to me that if an entity is able to intentionally hack an arbiter in the first place, having multiple equally insecure variant arbiters only postpones the inevitable.
If that is somehow not true of the arbiter you are visualizing, then I haven’t properly visualized this arbiter. I can’t even figure out what questions to ask to clarify it; the whole idea just seems like a non-starter.
I should therefore drop out of this conversation and go back to listening.
Having multiple equally insecure variant arbiters only postpones the inevitable.
The point is that you’re uncertain about whether arbiters are secure or not, and this lets you combine multiple possibly-insecure arbiters into a more-secure combination. If you have two arbiters which each have a discoverable and exploitable security hole with probability 0.5, and they don’t share any code and use sufficiently different techniques, then the AI can only fool both of them with probability 0.25.
In fact, it may be better than that—the two arbiters may both have security holes, but exploiting them may be mutually exclusive. For example, if the AI can force either arbiter to run a payload, but has to give the same payload to both while they expect different programming languages with disjoint syntax, then any exploit will crash at least one arbiter, and returning the “arbiter crashed” flag means you know the AI is untrustworthy.
I don’t see how I could ever know—or even be significantly confident—that a given pair of arbiters X and Y were such that the probability that an AI can exploit arbiter X doesn’t correlate well with the probability that the AI can exploit arbiter Y.
That is, I don’t see how I could ever, even in principle, be significantly confident that their algorithms were “sufficiently different” for the probabilities to multiply.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted. (ETA: That part has been clarified elsethread; thanks!)
Now, maybe what I know doesn’t matter: maybe we’re not talking about provable security. If you’re saying that we could build networks of arbiters and linkages between those arbiters in a secure box such that there was an unknowable but nonzero chance that a hostile superintelligence in that box had failed to corrupt those arbiters, then I agree with you, at least in principle.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted.
Actually, that part’s easy. The message can only corrupt things which both receive it as input and perform nontrivial computation on it. You don’t need an “arbiter” to copy a message, nor do you need to decrpt it.
Let’s consider a somewhat similar case.
You are an inventor. An evil dictator captures you, and takes you off to a faraway dungeon, where he tells you that he wants you to build him a superweapon. If you refuse to build the weapon, well, he has means of persuading you. If you still refuse, he will kill you.
Of course, when you finish building the doomsday device, your usefulness will be over, and he will probably kill you anyway.
Being a genius, you soon realise that you could build the dictator a doomsday device in about a week, with the materials lying around in your well-equipped dungeon workshop. However, this doesn’t seem terribly optimal.
What do you do? You agree to build the doomsday device. But you claim it is very difficult. You spend a long time producing impressive prototypes that seem to suggest you’re making progress. You claim that you need more resources. You carry on, making just enough apparent progress to stop him from killing you on the spot, but not enough to actually succeed. You build ever more expensive and complex non-working prototypes, hoping that in the course of this, you can either build yourself something suitable for breaking out of the dungeon or killing the evil dictator. Or hoping that perhaps someone will rescue you. At the very least you will have wasted the evil dictator’s resources on doing something pointless—you will at least not die for nothing.
I suspect your imprisoned AI may choose to follow a similar course.
Or you build yourself a superweapon that you use to escape, and then go on to shut down your company’s weapons division and spend your spare time being a superhero and romancing your assistant and fighting a pitched battle with a disloyal employee.
This reply and its parent comment constitute the “Iron Man Argument” against any kind of “put the AI in a box” approach to AGI and friendliness concerns. I predict it will be extremely effective.
That doesn’t really work in this situation, since the AI building another AI won’t get it out of the box.
It’s more like you can only design things, and the evil dictator can tell if it’s not really a doomsday device and won’t build it.
Also, he’s not going to be checking on your progress, so faking it is useless, but it’s also unnecessary.
Note that none of these options are available to Paul’s uFAI. It can’t communicate with its jailers at all. The jailers won’t know anything about what the uFAI is doing (beyond what their knowledge of its source code tells them) until the uFAI outputs a verifiably friendly source code. For all they know, the uFAI is just sitting in the corner and sulking.
This quite possibly the optimal response. If we don’t get FAI out of the first attempt, we might try again with another similar design, possibly giving the uFAI more resources. This twin uFAI is much more likely to implement the uFAI’s goals than any FAI it might create.
Even a single bit is too much control if we need the bit to be set to 1.
The box automatically kills the AI after a fixed length of time. Furthermore, most of the suggestions in your post require significant interaction with the outside world, whereas the AI is only allowed to send 1 bit total before dying.
Not quite right. The AI, in theory, can only send humans 1 bit total before dying, but it can send millions of bits to the arbiter, which might be corruptible. I think the piece of this scheme that is doing the magic is the uncorruptible arbiter that perfectly distinguishes friendly from unfriendly code. We don’t know anything about either of those two impossible problems yet.
I agree that coming up with a complex uncorruptible arbiter is difficult.
However, I can imagine simple arbiters which might be helpful but which I can implement. For example, suppose I just want to find a simple explanation for some physical data. I could imagine an arbiter which simply elicits an explanation (in the form of a short program which correctly predicts the data) from the AI and then tells us either “yes, the AI found a very simple explanation” or “no, the AI didn’t give any very simple explanation.” I am working on a post describing one approach to getting useful data out of an unfriendly AI without letting him do anything bad.
OK, but why on earth would you think that the arbiter will be immune to the superintelligent suasion of the AI?
Let’s say you program it yourself on hardware you design yourself and you understand very well what the arbiter does under normal circumstances. The AI’s new job is to figure out how to create abnormal circumstances under which it can break your arbiter and steal its resources. Suppose it succeeds—how will you stop the arbiter from passing on a message from the AI that will corrupt your mind, which will then turn you into an unwitting agent of the ufAI, etc.? Think corny zombie movies, only in real life.
I know the arbiter is only supposed to be able to send you one bit of information. But how do you propose to stop it from sending you more when it’s no longer responding to your programming? Will you equip it with diodes that burn out immediately after being switched on?
The point is that the arbiter is simple, and that the AI’s entire interaction with it can be restricted precisely to sending it a bit string of the agreed upon length.
We actually have in our possession proof techniques for predicting the behavior of programs under these conditions and proving invariants about them. We should only expect these proof techniques to improve rapidly with time. However strong your proof techniques are at the time, don’t make an arbiter more powerful than you can verify. I propose that even arbiters simple enough to be verified today already allow you to exploit an AI in a powerful way.
OK, that at least I can understand well enough to ask a clarifying question about.
So, I’m the AI, and I’m stuck in a cryptographic box with an arbiter that can send a single bit to the outside world. For whatever reason, I want the arbiter to send TRUE under situations where what it actually sends is FALSE. The arbiter is too simple for me to mess with. Right?
So presumably what I want to do is either insinuate myself between the arbiter and the outside world, or delete the arbiter and replace it with an arbiter-simulator with a more complicated (and corrupted) internal architecture.
I don’t understand what prevents me from doing that.
In short, the security guarantee of the scheme prevents precisely this.
In long, the way it works it that I as the human experiment run the AI and extract a message from it. The AI has no control over the message other than what it will decrypt to. Since I never decrypt anything other than the message, the entire influence of the AI on the rest of the world comes from that one little message.
This message then gets passed to an arbiter (the AI now destroyed, having no possibility of influencing the wold), whose environment has also been carefully cryptographically protected so that the message passed by the AI can’t corrupt the hardware it is running on. The arbiter does its thing, then destroys the original message of the AI (since we proved it would) and publishes its transformed version, which gets decrypted.
I guess the metaphor of an AI with an arbiter in a cage is a little misleading. It is more like, the AI gets to send a slip of paper out of the cage to the arbiter, who then carefully censors it before passing it on to the real world. The point of using an arbiter is that he is very simple, unlike a human, so we can actually prove that he won’t fail at his task.
Ah! Yes, that clarifies the situation. Thank you.
And the paper is too short for you to be able to argue persuasively or make art or whatever.
Well, that last part seems relatively easy to ensure: set up a two-layer arbiter, the first layer of which is the arbiter we’ve been talking about, and the second layer of which is outside the box and is just a stupid hardware switch that strips out all but the first bit of the first arbiter’s message.
If something inside the box can alter the second layer, then it can probably also directly alter the contents of my mind. You have to assume some limits or the whole exercise is pointless.
None of which changes the fact that a message from the arbiter that says “yes, the AI found a very simple explanation” doesn’t provide much evidence that the AI found a very simple explanation.
You can have more than one arbiter, and the arbiters can be specialized. The AI produces its message, and then you run a series of arbiters on it. One is a proof checker, and the rest are for checking whether the proof is tricky or obfuscated, in various different ways. That way, the AI would have to corrupt every arbiter (or at least, every arbiter capable of detecting the techniques it’s using to hack the other arbiters), and it is much less likely that that it’s possible to do this than to fool just one.
It seems clear to me that if an entity is able to intentionally hack an arbiter in the first place, having multiple equally insecure variant arbiters only postpones the inevitable.
If that is somehow not true of the arbiter you are visualizing, then I haven’t properly visualized this arbiter. I can’t even figure out what questions to ask to clarify it; the whole idea just seems like a non-starter.
I should therefore drop out of this conversation and go back to listening.
The point is that you’re uncertain about whether arbiters are secure or not, and this lets you combine multiple possibly-insecure arbiters into a more-secure combination. If you have two arbiters which each have a discoverable and exploitable security hole with probability 0.5, and they don’t share any code and use sufficiently different techniques, then the AI can only fool both of them with probability 0.25.
In fact, it may be better than that—the two arbiters may both have security holes, but exploiting them may be mutually exclusive. For example, if the AI can force either arbiter to run a payload, but has to give the same payload to both while they expect different programming languages with disjoint syntax, then any exploit will crash at least one arbiter, and returning the “arbiter crashed” flag means you know the AI is untrustworthy.
I don’t see how I could ever know—or even be significantly confident—that a given pair of arbiters X and Y were such that the probability that an AI can exploit arbiter X doesn’t correlate well with the probability that the AI can exploit arbiter Y.
That is, I don’t see how I could ever, even in principle, be significantly confident that their algorithms were “sufficiently different” for the probabilities to multiply.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted. (ETA: That part has been clarified elsethread; thanks!)
Now, maybe what I know doesn’t matter: maybe we’re not talking about provable security. If you’re saying that we could build networks of arbiters and linkages between those arbiters in a secure box such that there was an unknowable but nonzero chance that a hostile superintelligence in that box had failed to corrupt those arbiters, then I agree with you, at least in principle.
Actually, that part’s easy. The message can only corrupt things which both receive it as input and perform nontrivial computation on it. You don’t need an “arbiter” to copy a message, nor do you need to decrpt it.
… I don’t get it.