Yes, I do. I’ve spoken privately with an engineer on AI safety at OpenAI and he agrees this can work.
It’s an extension of existing industrial safety mechanisms, including the ones that are used on autonomous cars today.
The current method is:
There are 2 systems controlling a car operating on autopilot. One has a sophisticated policy to control the car, using a pipeline consisting of neural networks and software modules. The other is a microcontroller which has a dead simple policy to order a braking force increase over time.
Each timestep, the microcontroller software is looking for information from the telemetry stream the higher level system emits that may indicate a failure. This means timeouts, corrupt packets, or if a packet contains specific values indicating an error or low confidence.
Waymo extends this approach to 3 systems, based on their publicly released information. One high level system, one mid level system that is approximately a lane control system on a separate power source, and the microcontroller. The mid level system is a neural network attached to one camera, and it emits steering commands to bring the vehicle to a safe stop.
This obviously can be extended to n-levels of system. There can be multiple AI systems, each able to take over control, where the lower level systems use more robust and simple policies that are more likely to work in a larger variety of situations. Every time the higher level system becomes empirically uncertain—a module authored by humans is what measures this, one way to do this is to see if the inputs to the high level system fall into the latent space of the training set—control shifts to the first system that has sufficient confidence, devolving to a microcontroller.
The ‘ah ha’ I want you to have is that we know the AI policy is safe within the space of the training simulator. We measured it. The training simulator is supposed to be a reasonably accurate facsimile of the real world—and we know how to auto-generate realistic situations now from real world data, so we can be ever improving the simulation’s breadth. Whenever the system finds itself in an out of distribution situation, this means that it now has inputs that describe a situation it was not proven to do the safe thing. So we shut down that system, but as the AI system may be doing a task in the real world and can’t be simply turned off, we have to transfer control to a system that can safely shut it down.
Example—a robot is operating in a paperclip factory. It escapes outside. On the frame it perceives “outside”, it recognizes that the current perception is outside the latent space of the training examples, which were all ‘inside’. It transfers control to a network that manages ‘robot momentum’, something trained to stop the robot hardware in as little space as possible without it falling over—and halts a few feet from the exit.
It’s a pretty clear and actionable approach. Human organizations already do this, human brains seem to have a structure like this (example: our high level system can order us not to breathe but if we pass out a low level system engages respiration), and so on.
Note that this means every AI system exists in an outer framework authored by humans that is permanently immutable. No self modification. And you would need to formally prove the software modules that form the structure of the framework.
There very likely are a number of ‘rules’ like this that have to be obeyed or AI safety is impossible. Similar to how we could have made nuclear reactors work more like the demon core experiment, where a critical mass is being created via a neutron reflector and the core is weapons grade. This would put every reactor one mechanical failure away from a prompt critical incident that would cause an explosion.
You also do need interpretability and to restrict deception with the speed prior.
For the purpose of the present discussion, I note that if your plan needs interpretability, then that would be a cause for concern, and a reason for slowing down AGI. The state of interpretability is currently very bad, and there seem to be lots of concrete ways to make progress right now.
Separately, I don’t think your plan (as I understand it) has any hope of addressing the hardest and most important AGI safety problems. But I don’t want to spend the (considerable) time to get into a discussion about that, so I’ll duck out of that conversation, sorry. (At least for now.)
That is unfortunately not a helpful response. If this simple plan—which is already what is in use in the real world in actual AI systems today—won’t work this is critical information!
What is the main flaw? It costs you little to mentioned the biggest problem.
I agree with this, and therefore I’m both more optimistic, and think that we should be not alarmed at the pace of progress. Or in other words, I disagree with the idea of slowing down AGI progress.
Yes, I do. I’ve spoken privately with an engineer on AI safety at OpenAI and he agrees this can work.
It’s an extension of existing industrial safety mechanisms, including the ones that are used on autonomous cars today.
The current method is:
There are 2 systems controlling a car operating on autopilot. One has a sophisticated policy to control the car, using a pipeline consisting of neural networks and software modules. The other is a microcontroller which has a dead simple policy to order a braking force increase over time.
Each timestep, the microcontroller software is looking for information from the telemetry stream the higher level system emits that may indicate a failure. This means timeouts, corrupt packets, or if a packet contains specific values indicating an error or low confidence.
Waymo extends this approach to 3 systems, based on their publicly released information. One high level system, one mid level system that is approximately a lane control system on a separate power source, and the microcontroller. The mid level system is a neural network attached to one camera, and it emits steering commands to bring the vehicle to a safe stop.
This obviously can be extended to n-levels of system. There can be multiple AI systems, each able to take over control, where the lower level systems use more robust and simple policies that are more likely to work in a larger variety of situations. Every time the higher level system becomes empirically uncertain—a module authored by humans is what measures this, one way to do this is to see if the inputs to the high level system fall into the latent space of the training set—control shifts to the first system that has sufficient confidence, devolving to a microcontroller.
The ‘ah ha’ I want you to have is that we know the AI policy is safe within the space of the training simulator. We measured it. The training simulator is supposed to be a reasonably accurate facsimile of the real world—and we know how to auto-generate realistic situations now from real world data, so we can be ever improving the simulation’s breadth. Whenever the system finds itself in an out of distribution situation, this means that it now has inputs that describe a situation it was not proven to do the safe thing. So we shut down that system, but as the AI system may be doing a task in the real world and can’t be simply turned off, we have to transfer control to a system that can safely shut it down.
Example—a robot is operating in a paperclip factory. It escapes outside. On the frame it perceives “outside”, it recognizes that the current perception is outside the latent space of the training examples, which were all ‘inside’. It transfers control to a network that manages ‘robot momentum’, something trained to stop the robot hardware in as little space as possible without it falling over—and halts a few feet from the exit.
It’s a pretty clear and actionable approach. Human organizations already do this, human brains seem to have a structure like this (example: our high level system can order us not to breathe but if we pass out a low level system engages respiration), and so on.
Note that this means every AI system exists in an outer framework authored by humans that is permanently immutable. No self modification. And you would need to formally prove the software modules that form the structure of the framework.
There very likely are a number of ‘rules’ like this that have to be obeyed or AI safety is impossible. Similar to how we could have made nuclear reactors work more like the demon core experiment, where a critical mass is being created via a neutron reflector and the core is weapons grade. This would put every reactor one mechanical failure away from a prompt critical incident that would cause an explosion.
You also do need interpretability and to restrict deception with the speed prior.
For the purpose of the present discussion, I note that if your plan needs interpretability, then that would be a cause for concern, and a reason for slowing down AGI. The state of interpretability is currently very bad, and there seem to be lots of concrete ways to make progress right now.
Separately, I don’t think your plan (as I understand it) has any hope of addressing the hardest and most important AGI safety problems. But I don’t want to spend the (considerable) time to get into a discussion about that, so I’ll duck out of that conversation, sorry. (At least for now.)
That is unfortunately not a helpful response. If this simple plan—which is already what is in use in the real world in actual AI systems today—won’t work this is critical information!
What is the main flaw? It costs you little to mentioned the biggest problem.
I agree with this, and therefore I’m both more optimistic, and think that we should be not alarmed at the pace of progress. Or in other words, I disagree with the idea of slowing down AGI progress.