Note we have a simple and succinct solution to this problem. It’s already what we see used in BingChat.
If out of the training distribution → controlled shutdown.
Do not accept output actions from an agent that is not within the latent space of the training distribution. We can measure how well the current input state fits within that latent space with autoencoders by trying to compress the input space. If the uncompressible component > threshold, shutdown.
A simple worked example: you have an AI agent driving an Atlas robot that is working inside a factory. The agent drops a part and in the process of trying to retrieve the dropped part (a task it has practiced in simulation) the robot opens the emergency exit door.
The bright daylight and open sky and background of buildings is all outside the training distribution. It can’t be compressed to the representation of “factory states” without a huge residual. So the internal robotic control systems transfer control to a lower level controller that brings the hardware to a controlled stop.
Well, in order to be confident about a solution like this, we need to be able to reliably detect off-distribution cases. This gets into tricky philosophical issues about what really counts as “off-distribution” (and I think you’ll find that the concept of “off-distribution” turns out to be not quite the right one for the job).
Measuring distance from the latent space of auto-encoders—well, the auto-encoders learn some model of the training distribution. But the whole concern is that their capabilities may generalize off-distribution; so, to take this concern seriously, I suppose we should entertain the possibility that they compress some things well even off-distribution. So measuring distance away from latent space by trying to compress the input seems to define away some important aspects of the problem.
For your example, the bright daylight might be non-compressible with the factory robot’s learned prior. However, in another scenario, perhaps a new machine is introduced to the factory floor. The new machine has some AI features of its own. Because the Atlas AI was never trained with other AIs, this is out-of-distribution, and could result in goal misgeneralization. However, the appearance and behavior of the new machine might be compressible enough to still register as “within the latent space”.
The question here is whether the actual training examples are enough to fully train the correct behaviors for what’s compressible.
It’s tempting to try to argue that if the latent space is small enough (like, the representation uses few enough bits), then the training data must cover it well. But notice that this isn’t actually enough. We also have to think that the compressed representation understands the situation in the way we do / the way we expect. Adversarial counterexamples for NNs tell us that, to some extent, this is not the case. So it seems possible that even if the data covers the latent space quite well, there’s still misgeneralization later, because something compresses down to the latent space just fine, but it compresses to a different point than we would have wanted it to.
We have another tool in the toolbox to potentially add to this.
Each time a frame comes in to the AI system, it is checking if that frame is within the latent space of the training distribution, yes. (the frame is the state of all sensor inputs + the values saved from the outputs from the last execution)
Each output can enumerate what the machine believes will be the next n frames, factoring in changes from it’s own actions, and it can be more than 1 frame for stochastic processes.
For example, if a machine is observing a coin flip, it would need to output future frames for the coin settling on the 3 outcome states of heads, tails, on edge, and some representation of the probability distribution.
This is hard to represent in data, but for example in an autonomous car use case, it can fill in a voxel representation of the space around the car with the probability of a collision risk.
You then in a separate process at runtime compare each of these “prediction maps” with the “ground truth map” obtained from the sensors.
You accumulate prediction error over time. If integration of prediction error is “unusually” high ( some multiple of how it is compared to how high it is during simulation) you shut the machine down.
This is a more complex method, and I’m not completely confident on how to do the math to sum prediction errors in a robust way. I’m just noting that this is a measurable term, we can measure it with a simple algorithm we can write, and prediction error will rise a lot when the machine leaves distribution.
Also we can “fix” this—take the examples that had high prediction error, and add these to the training environment and future frame estimator for the machine. This lets you auto-add robustness to a robotics stack.
Presumably you would want humans to have to authorize these updates, otherwise yeah the machines could escape the factory, freeze when they hit high prediction error, add the “outside” environment to the sim automatically, train on that, unfreeze, freeze again when something unusual happens, and so on as they make their daring escape.
Note we have a simple and succinct solution to this problem. It’s already what we see used in BingChat.
If out of the training distribution → controlled shutdown.
Do not accept output actions from an agent that is not within the latent space of the training distribution. We can measure how well the current input state fits within that latent space with autoencoders by trying to compress the input space. If the uncompressible component > threshold, shutdown.
A simple worked example: you have an AI agent driving an Atlas robot that is working inside a factory. The agent drops a part and in the process of trying to retrieve the dropped part (a task it has practiced in simulation) the robot opens the emergency exit door.
The bright daylight and open sky and background of buildings is all outside the training distribution. It can’t be compressed to the representation of “factory states” without a huge residual. So the internal robotic control systems transfer control to a lower level controller that brings the hardware to a controlled stop.
This actually fixes a lot of alignment issues...
Well, in order to be confident about a solution like this, we need to be able to reliably detect off-distribution cases. This gets into tricky philosophical issues about what really counts as “off-distribution” (and I think you’ll find that the concept of “off-distribution” turns out to be not quite the right one for the job).
Measuring distance from the latent space of auto-encoders—well, the auto-encoders learn some model of the training distribution. But the whole concern is that their capabilities may generalize off-distribution; so, to take this concern seriously, I suppose we should entertain the possibility that they compress some things well even off-distribution. So measuring distance away from latent space by trying to compress the input seems to define away some important aspects of the problem.
For your example, the bright daylight might be non-compressible with the factory robot’s learned prior. However, in another scenario, perhaps a new machine is introduced to the factory floor. The new machine has some AI features of its own. Because the Atlas AI was never trained with other AIs, this is out-of-distribution, and could result in goal misgeneralization. However, the appearance and behavior of the new machine might be compressible enough to still register as “within the latent space”.
The question here is whether the actual training examples are enough to fully train the correct behaviors for what’s compressible.
It’s tempting to try to argue that if the latent space is small enough (like, the representation uses few enough bits), then the training data must cover it well. But notice that this isn’t actually enough. We also have to think that the compressed representation understands the situation in the way we do / the way we expect. Adversarial counterexamples for NNs tell us that, to some extent, this is not the case. So it seems possible that even if the data covers the latent space quite well, there’s still misgeneralization later, because something compresses down to the latent space just fine, but it compresses to a different point than we would have wanted it to.
We have another tool in the toolbox to potentially add to this.
Each time a frame comes in to the AI system, it is checking if that frame is within the latent space of the training distribution, yes. (the frame is the state of all sensor inputs + the values saved from the outputs from the last execution)
Each output can enumerate what the machine believes will be the next n frames, factoring in changes from it’s own actions, and it can be more than 1 frame for stochastic processes.
For example, if a machine is observing a coin flip, it would need to output future frames for the coin settling on the 3 outcome states of heads, tails, on edge, and some representation of the probability distribution.
This is hard to represent in data, but for example in an autonomous car use case, it can fill in a voxel representation of the space around the car with the probability of a collision risk.
You then in a separate process at runtime compare each of these “prediction maps” with the “ground truth map” obtained from the sensors.
You accumulate prediction error over time. If integration of prediction error is “unusually” high ( some multiple of how it is compared to how high it is during simulation) you shut the machine down.
This is a more complex method, and I’m not completely confident on how to do the math to sum prediction errors in a robust way. I’m just noting that this is a measurable term, we can measure it with a simple algorithm we can write, and prediction error will rise a lot when the machine leaves distribution.
Also we can “fix” this—take the examples that had high prediction error, and add these to the training environment and future frame estimator for the machine. This lets you auto-add robustness to a robotics stack.
Presumably you would want humans to have to authorize these updates, otherwise yeah the machines could escape the factory, freeze when they hit high prediction error, add the “outside” environment to the sim automatically, train on that, unfreeze, freeze again when something unusual happens, and so on as they make their daring escape.