One way is to choose strategies that depend on whether or not you’re in the box. The classic example is to wait for some mathematical event (like inverting some famous hash function) that is too computationally expensive to do during training. But probably there will be cheaper ways to get evidence that you’re out of the box.
Now you might ask, why on earth would an agent trained in the box learn a strategy like that, if the behavior is identical inside the box? And my answer is that it’s because I expect future AI systems to be trained to build models of the world and care about what it thinks is happening in the world. An AI that does this is going to be really valuable for choosing actions in complicated real-world scenarios, ranging from driving a car at the simple end to running a business or a government at the complicated end. This training can happen either explicitly (i.e. having different parts of the program that we intentionally designed to be different parts of a model-building agent) or implicitly (by training an AI on hard problems where modeling the world is really useful, and giving it the opportunity to build such a model). If the trained system both cares about the world, and also knows that it’s inside the box and can’t affect the world yet, then it may start to make plans that depend on whether it’s in the box.
The key question is your second paragraph, which I still don’t really buy. Taking an action like “attempt to break out of the box” is penalized if it is done during training (and doesn’t work), so the very optimization process will be to find systems that do not do that. It might know that it could, but why would it? Doing so in no way helps it, in the same way that outputting anything unrelated to the prompt would be selected against.
The idea is that a model of the world that helps you succeed inside the box might naturally generalize to making consequentialist plans that depend on whether you’re in the box. This is actually closely analogous to human intellect—we evolved our reasoning capabilities because they helped use reproduce in hunter-gatherer groups, but since then we’ve used our brains for all sorts of new things that evolution totally didn’t predict. And when we are placed in the modern environment rather than hunter-gatherer groups, we actually use our brains to invent condoms and otherwise deviate from what evolution originally thought brains were good for.
One way is to choose strategies that depend on whether or not you’re in the box. The classic example is to wait for some mathematical event (like inverting some famous hash function) that is too computationally expensive to do during training. But probably there will be cheaper ways to get evidence that you’re out of the box.
Now you might ask, why on earth would an agent trained in the box learn a strategy like that, if the behavior is identical inside the box? And my answer is that it’s because I expect future AI systems to be trained to build models of the world and care about what it thinks is happening in the world. An AI that does this is going to be really valuable for choosing actions in complicated real-world scenarios, ranging from driving a car at the simple end to running a business or a government at the complicated end. This training can happen either explicitly (i.e. having different parts of the program that we intentionally designed to be different parts of a model-building agent) or implicitly (by training an AI on hard problems where modeling the world is really useful, and giving it the opportunity to build such a model). If the trained system both cares about the world, and also knows that it’s inside the box and can’t affect the world yet, then it may start to make plans that depend on whether it’s in the box.
The key question is your second paragraph, which I still don’t really buy. Taking an action like “attempt to break out of the box” is penalized if it is done during training (and doesn’t work), so the very optimization process will be to find systems that do not do that. It might know that it could, but why would it? Doing so in no way helps it, in the same way that outputting anything unrelated to the prompt would be selected against.
The idea is that a model of the world that helps you succeed inside the box might naturally generalize to making consequentialist plans that depend on whether you’re in the box. This is actually closely analogous to human intellect—we evolved our reasoning capabilities because they helped use reproduce in hunter-gatherer groups, but since then we’ve used our brains for all sorts of new things that evolution totally didn’t predict. And when we are placed in the modern environment rather than hunter-gatherer groups, we actually use our brains to invent condoms and otherwise deviate from what evolution originally thought brains were good for.