I see. I have a specific counterexample that feels like it had to have been considered already, but I haven’t seen it mentioned...
The strategies such as penalizing inconsistencies seem to rely on our ability to isolate the AI within the context of training, or make it somehow “buy” into it — as opposed to quickly realizing what’s happening and worming its way out of the proverbial box. It feels particularly strange to me when we’re talking about AIs that can think better than the smartest human or handily beat specialized science AIs at the useful-ontology game.
Specific example: Once the AI figures out that it’s being checked for consistency in parallel with other instances of itself, it might sidestep the “consistency game” entirely and just tailor its outputs such that they leave hints for its other instances in the environment, breaking the no-communication condition. It seems in the spirit of worst-case scenarios that we have to assume it’d be able to do that, no matter how we sandbox and airgap it.
(On the other hand, if we assume that the AI is not smart enough to do that, and would instead have to learn a direct translator, we probably should assume the same for e. g. the strategy with human operators tricking human observers, which probably caps the AI at below the level of the smartest possible human and makes that class of strategies more workable.)
This applies more broadly as well: especially to other strategies that might inadvertently give the AI a specific incentive to break out, and prospectively to all training strategies that rely on the training still working after the AI achieves superintelligence (as opposed to assuming that the training would stop being effective at that point and hoping that the pre-superhuman training would generalize).
Broadly, any proposal that relies on the AI still being fed training examples after it achieves superintelligence has to somehow involve forcing/teaching it not to think its way out of the box.
Edit: To elaborate on the thought...
I understand that some of the above is covered by the stipulation not to worry about cases where the AI becomes a learned optimizer, but:
Empirically, even relatively simple and old algorithms sometimes learn to do this sort of thing, see here (“Creative Program Repair”, pages 7-8). Regardless of whether it’s a “learned optimizer” or not, if this sort of behaviour can show up this easily, we should definitely be ready to handle it when it comes to training ASI.
I don’t think learned optimization is necessary for that; assuming it’s necessary seems to be committing an ontology mismatch error. We’re not “really” training a neural network using gradient descent, fundamentally speaking. We’re permuting a certain bundle of ones and zeroes so that, when that bundle interacts with another bundle of ones and zeroes, part of it assumes certain desirable configurations. It seems entirely plausible that a superintelligent non-optimizer would arrive at some ontology that mixes such low-level considerations directly with high-level ones, and learns to intervene on the machine code running it to produce outputs that minimize its loss function, like water flowing downhill. All without doing things like “realizing that it’s in a simulation” or exhibiting agenty behaviour where it’s looking through the possibility space in search of clever plans based on which it’ll design outputs.
I think this is an important family of counterexamples not really addressed in our report (and which we’re not really asking contest participants to handle, i.e. if you come up with a proposal for which this is the only counterexample we could come up with then we’d definitely give a $50k prize).
Some thoughts:
As you say, a lot of this comes down to our hopes for handling learned optimizers, which we discuss in this appendix.
Most importantly, we can use SGD to optimize “What is the probability I assign to answers that are consistent with other copies’ answers?” instead of “What score do I get in the consistency game?”. If this was the only difficulty, we’d want to be really careful about what SGD optimized. That means that we are basically left with inner alignment problems that could kill us in much more direct ways (e.g. by making the low-stakes assumption that we can recover from any finite number of failures). What I’m mostly interested in is trying to get enough of a handle on ELK that inner alignment becomes the dominant concern / we know enough about ELK to be able to work productively on inner alignment.
If your model is doing optimization in order to break out of the sandbox, searching over cognitive actions to find those that break out of the sandbox, what we really want to do is apply ELK at an earlier stage to the model that was evaluating cognitive actions. This is a pretty messy situation, but it looks promising enough that I’m inclined to focus on other problems first and then see how this counterexample looks later. (Similarly, you’ll always have similar pathologies with gradient hacking etc. and so you need to use a messier mechanism like this to handle those in the worst case.)
On the other hand, if your model isn’t doing any optimization to find ways to break out of the sandbox, then it seems like you are up against a fairly weak adversary (just SGD). In the worst case this is still bad news, but it’s not merely better in practice, it also seems like it gives you a large leg up with which to beat the model. In general, defending against a SmartVault which controls a ton of actuators seems harder than defending against an adversary who can only use side channels.
I’m normally pretty opposed to sandboxing, but I do also think that if this was all you had to do, then it is actually plausible that you could literally win the game. E.g. if the overhead on homomorphic encryption was low enough (say 1000%) that you could afford to run each reporter training datapoint homomorphically encrypted, then it seems like you could win against any adversary. I still wouldn’t bet on it without thinking a lot, but I think it makes the point that there is a big quantitative gap here.
Overall I agree that in the worst case this kind of counterexample applies to all of the proposals in this document. Moreover, I agree that unfortunately it is (i) a complicated aesthetic judgment about what kind of counterexample to focus on first / what is a prerequisite to what, and (ii) not obvious that you can define the problem narrowly enough to avoid this counterexample without also defining-away other difficulties we are trying to address. But I’m still wiling to bet on the aesthetic judgment and it does look like we can avoid this counterexample without accidentally defining away the whole problem.
(I don’t think this comment covers the entire issue, I hope we’ll write more about this in another report.)
I haven’t written any such articles. I definitely think it’s promising.
I see. I have a specific counterexample that feels like it had to have been considered already, but I haven’t seen it mentioned...
The strategies such as penalizing inconsistencies seem to rely on our ability to isolate the AI within the context of training, or make it somehow “buy” into it — as opposed to quickly realizing what’s happening and worming its way out of the proverbial box. It feels particularly strange to me when we’re talking about AIs that can think better than the smartest human or handily beat specialized science AIs at the useful-ontology game.
Specific example: Once the AI figures out that it’s being checked for consistency in parallel with other instances of itself, it might sidestep the “consistency game” entirely and just tailor its outputs such that they leave hints for its other instances in the environment, breaking the no-communication condition. It seems in the spirit of worst-case scenarios that we have to assume it’d be able to do that, no matter how we sandbox and airgap it.
(On the other hand, if we assume that the AI is not smart enough to do that, and would instead have to learn a direct translator, we probably should assume the same for e. g. the strategy with human operators tricking human observers, which probably caps the AI at below the level of the smartest possible human and makes that class of strategies more workable.)
This applies more broadly as well: especially to other strategies that might inadvertently give the AI a specific incentive to break out, and prospectively to all training strategies that rely on the training still working after the AI achieves superintelligence (as opposed to assuming that the training would stop being effective at that point and hoping that the pre-superhuman training would generalize).
Broadly, any proposal that relies on the AI still being fed training examples after it achieves superintelligence has to somehow involve forcing/teaching it not to think its way out of the box.
Edit: To elaborate on the thought...
I understand that some of the above is covered by the stipulation not to worry about cases where the AI becomes a learned optimizer, but:
Empirically, even relatively simple and old algorithms sometimes learn to do this sort of thing, see here (“Creative Program Repair”, pages 7-8). Regardless of whether it’s a “learned optimizer” or not, if this sort of behaviour can show up this easily, we should definitely be ready to handle it when it comes to training ASI.
I don’t think learned optimization is necessary for that; assuming it’s necessary seems to be committing an ontology mismatch error. We’re not “really” training a neural network using gradient descent, fundamentally speaking. We’re permuting a certain bundle of ones and zeroes so that, when that bundle interacts with another bundle of ones and zeroes, part of it assumes certain desirable configurations. It seems entirely plausible that a superintelligent non-optimizer would arrive at some ontology that mixes such low-level considerations directly with high-level ones, and learns to intervene on the machine code running it to produce outputs that minimize its loss function, like water flowing downhill. All without doing things like “realizing that it’s in a simulation” or exhibiting agenty behaviour where it’s looking through the possibility space in search of clever plans based on which it’ll design outputs.
I think this is an important family of counterexamples not really addressed in our report (and which we’re not really asking contest participants to handle, i.e. if you come up with a proposal for which this is the only counterexample we could come up with then we’d definitely give a $50k prize).
Some thoughts:
As you say, a lot of this comes down to our hopes for handling learned optimizers, which we discuss in this appendix.
Most importantly, we can use SGD to optimize “What is the probability I assign to answers that are consistent with other copies’ answers?” instead of “What score do I get in the consistency game?”. If this was the only difficulty, we’d want to be really careful about what SGD optimized. That means that we are basically left with inner alignment problems that could kill us in much more direct ways (e.g. by making the low-stakes assumption that we can recover from any finite number of failures). What I’m mostly interested in is trying to get enough of a handle on ELK that inner alignment becomes the dominant concern / we know enough about ELK to be able to work productively on inner alignment.
If your model is doing optimization in order to break out of the sandbox, searching over cognitive actions to find those that break out of the sandbox, what we really want to do is apply ELK at an earlier stage to the model that was evaluating cognitive actions. This is a pretty messy situation, but it looks promising enough that I’m inclined to focus on other problems first and then see how this counterexample looks later. (Similarly, you’ll always have similar pathologies with gradient hacking etc. and so you need to use a messier mechanism like this to handle those in the worst case.)
On the other hand, if your model isn’t doing any optimization to find ways to break out of the sandbox, then it seems like you are up against a fairly weak adversary (just SGD). In the worst case this is still bad news, but it’s not merely better in practice, it also seems like it gives you a large leg up with which to beat the model. In general, defending against a SmartVault which controls a ton of actuators seems harder than defending against an adversary who can only use side channels.
I’m normally pretty opposed to sandboxing, but I do also think that if this was all you had to do, then it is actually plausible that you could literally win the game. E.g. if the overhead on homomorphic encryption was low enough (say 1000%) that you could afford to run each reporter training datapoint homomorphically encrypted, then it seems like you could win against any adversary. I still wouldn’t bet on it without thinking a lot, but I think it makes the point that there is a big quantitative gap here.
Overall I agree that in the worst case this kind of counterexample applies to all of the proposals in this document. Moreover, I agree that unfortunately it is (i) a complicated aesthetic judgment about what kind of counterexample to focus on first / what is a prerequisite to what, and (ii) not obvious that you can define the problem narrowly enough to avoid this counterexample without also defining-away other difficulties we are trying to address. But I’m still wiling to bet on the aesthetic judgment and it does look like we can avoid this counterexample without accidentally defining away the whole problem.
(I don’t think this comment covers the entire issue, I hope we’ll write more about this in another report.)