You can have more than one arbiter, and the arbiters can be specialized. The AI produces its message, and then you run a series of arbiters on it. One is a proof checker, and the rest are for checking whether the proof is tricky or obfuscated, in various different ways. That way, the AI would have to corrupt every arbiter (or at least, every arbiter capable of detecting the techniques it’s using to hack the other arbiters), and it is much less likely that that it’s possible to do this than to fool just one.
It seems clear to me that if an entity is able to intentionally hack an arbiter in the first place, having multiple equally insecure variant arbiters only postpones the inevitable.
If that is somehow not true of the arbiter you are visualizing, then I haven’t properly visualized this arbiter. I can’t even figure out what questions to ask to clarify it; the whole idea just seems like a non-starter.
I should therefore drop out of this conversation and go back to listening.
Having multiple equally insecure variant arbiters only postpones the inevitable.
The point is that you’re uncertain about whether arbiters are secure or not, and this lets you combine multiple possibly-insecure arbiters into a more-secure combination. If you have two arbiters which each have a discoverable and exploitable security hole with probability 0.5, and they don’t share any code and use sufficiently different techniques, then the AI can only fool both of them with probability 0.25.
In fact, it may be better than that—the two arbiters may both have security holes, but exploiting them may be mutually exclusive. For example, if the AI can force either arbiter to run a payload, but has to give the same payload to both while they expect different programming languages with disjoint syntax, then any exploit will crash at least one arbiter, and returning the “arbiter crashed” flag means you know the AI is untrustworthy.
I don’t see how I could ever know—or even be significantly confident—that a given pair of arbiters X and Y were such that the probability that an AI can exploit arbiter X doesn’t correlate well with the probability that the AI can exploit arbiter Y.
That is, I don’t see how I could ever, even in principle, be significantly confident that their algorithms were “sufficiently different” for the probabilities to multiply.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted. (ETA: That part has been clarified elsethread; thanks!)
Now, maybe what I know doesn’t matter: maybe we’re not talking about provable security. If you’re saying that we could build networks of arbiters and linkages between those arbiters in a secure box such that there was an unknowable but nonzero chance that a hostile superintelligence in that box had failed to corrupt those arbiters, then I agree with you, at least in principle.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted.
Actually, that part’s easy. The message can only corrupt things which both receive it as input and perform nontrivial computation on it. You don’t need an “arbiter” to copy a message, nor do you need to decrpt it.
You can have more than one arbiter, and the arbiters can be specialized. The AI produces its message, and then you run a series of arbiters on it. One is a proof checker, and the rest are for checking whether the proof is tricky or obfuscated, in various different ways. That way, the AI would have to corrupt every arbiter (or at least, every arbiter capable of detecting the techniques it’s using to hack the other arbiters), and it is much less likely that that it’s possible to do this than to fool just one.
It seems clear to me that if an entity is able to intentionally hack an arbiter in the first place, having multiple equally insecure variant arbiters only postpones the inevitable.
If that is somehow not true of the arbiter you are visualizing, then I haven’t properly visualized this arbiter. I can’t even figure out what questions to ask to clarify it; the whole idea just seems like a non-starter.
I should therefore drop out of this conversation and go back to listening.
The point is that you’re uncertain about whether arbiters are secure or not, and this lets you combine multiple possibly-insecure arbiters into a more-secure combination. If you have two arbiters which each have a discoverable and exploitable security hole with probability 0.5, and they don’t share any code and use sufficiently different techniques, then the AI can only fool both of them with probability 0.25.
In fact, it may be better than that—the two arbiters may both have security holes, but exploiting them may be mutually exclusive. For example, if the AI can force either arbiter to run a payload, but has to give the same payload to both while they expect different programming languages with disjoint syntax, then any exploit will crash at least one arbiter, and returning the “arbiter crashed” flag means you know the AI is untrustworthy.
I don’t see how I could ever know—or even be significantly confident—that a given pair of arbiters X and Y were such that the probability that an AI can exploit arbiter X doesn’t correlate well with the probability that the AI can exploit arbiter Y.
That is, I don’t see how I could ever, even in principle, be significantly confident that their algorithms were “sufficiently different” for the probabilities to multiply.
The same thing goes for knowing that whatever system is ensuring that X and Y receive the same payload hasn’t been corrupted. (ETA: That part has been clarified elsethread; thanks!)
Now, maybe what I know doesn’t matter: maybe we’re not talking about provable security. If you’re saying that we could build networks of arbiters and linkages between those arbiters in a secure box such that there was an unknowable but nonzero chance that a hostile superintelligence in that box had failed to corrupt those arbiters, then I agree with you, at least in principle.
Actually, that part’s easy. The message can only corrupt things which both receive it as input and perform nontrivial computation on it. You don’t need an “arbiter” to copy a message, nor do you need to decrpt it.