Hmm, Fermi invented the idea of control rods before building the first-ever nuclear reactor, and it worked as designed to control the nuclear reaction. So that’s at least one good example that we can hope to follow. I’m not sure what your last paragraph is referring to. For that first nuclear reactor, the exponential growth happened pretty much exactly as Fermi had calculated in advance, IIRC.
OK anyway, there’s a category of AGI safety work that we might call “Endgame Safety”, where we’re trying to do all the AGI safety work that we couldn’t (or didn’t) do ahead of time, in the very last moments before (or even after) people are actually playing around with the kind of powerful AGI algorithms that could get irreversibly out of control. I think we both agree that lots of the essential AGI safety work is in the category of “Endgame Safety”. I don’t know what the fraction is, but it seems that you and I are both agreeing that the fraction is not literally 100%.
(For my part, I wouldn’t be too surprised if Endgame Safety were 90% of the total useful person-hours of AGI safety, but I hope that lots of important conceptual / deconfusion work can be done further ahead, since those things sometimes take lots of wall-clock time.)
And as long as the fraction (AGI endgame safety work) / (all AGI safety work) is not literally 100%—i.e., as long as there is any AGI safety research whatsoever that we can do ahead of time—then we now have the core of an argument that slowing down AGI would be helpful.
For example, if AGI happens in 5 years, we can be frantically doing Endgame Safety starting in 5 years. And if AGI happens in 50 years, we can be frantically doing Endgame Safety starting in 50 years. What does it matter? Endgame Safety is going to be a frantic rush either way. But in the latter case, we can have more time to nail down everything that’s not Endgame Safety. And we can also have more time to do other useful things like outreach / field-building—to get from the current world where only a small fraction of people in AI / ML understand even really basic things like instrumental convergence and s-risk-mitigation 101, to a future world where the fraction is higher.
(You can make an argument that doing Endgame Safety in 50 years would be harder than doing Endgame Safety in 5 years because of other separate ways that the world would be different, e.g. bigger hardware overhang or whatever, but that’s a different argument that you didn’t seem to be making.)
Steven, how many months before the Chicago pile construction started did Fermi’s design team do the work on the control rods? There’s also a large difference between the idea of control rods—we have lots of ideas how to do AGI control mechanisms and no doubt some of them do work—and an actual machined control rod with enough cadmium/boron/etc to work.
In terms of labor hours, going from idea to working rod was probably >99% of the effort. Even after discovering empirically which materials act as neutron absorbers.
we have lots of ideas how to do AGI control mechanisms and no doubt some of them do work
I think AGI safety is in a worse place than you do.
It seems that you think that we already have at least one plan for Safe & Beneficial AGI that has no problems that are foreseeable at this point, they’ve been red-teamed to death and emerged unscathed with the information available, and we’re not going to get any further until we’re deeper into the implementation.
Whereas I think that we have zero plans for which we can say “given what we know now, we have strong reason to believe that successfully implementing / following this plan would give us Safe & Beneficial AGI”.
I also think that, just because you have code that reliably trains a deceptive power-seeking AGI, sitting right in front of you and available to test, doesn’t mean that you know how to write code that reliably trains a non-deceptive corrigible AGI. Especially when one of the problems we’re trying to solve right now is the issue that it seems very hard to know whether an AGI is deceptive / corrigible / etc.
Maybe the analogy for me would be that Fermi has a vague idea “What if we use a rod made of neutron-absorbing material?”. But there are no known neutron-absorbing materials. So Fermi starts going about testing materials to see if he can find any that absorb neutrons. Meanwhile, DeepPile develops technology that makes it easier and easier to purify uranium, and publishes all the details, and starts building its own piles, on the theory that when the piles are hitting criticality, it would make it easier to test possible control mechanisms. This is not helping Fermi! He already has a concrete problem that he’s stuck on, and he already has what he needs to make progress, he just needs time.
Yes, I do. I’ve spoken privately with an engineer on AI safety at OpenAI and he agrees this can work.
It’s an extension of existing industrial safety mechanisms, including the ones that are used on autonomous cars today.
The current method is:
There are 2 systems controlling a car operating on autopilot. One has a sophisticated policy to control the car, using a pipeline consisting of neural networks and software modules. The other is a microcontroller which has a dead simple policy to order a braking force increase over time.
Each timestep, the microcontroller software is looking for information from the telemetry stream the higher level system emits that may indicate a failure. This means timeouts, corrupt packets, or if a packet contains specific values indicating an error or low confidence.
Waymo extends this approach to 3 systems, based on their publicly released information. One high level system, one mid level system that is approximately a lane control system on a separate power source, and the microcontroller. The mid level system is a neural network attached to one camera, and it emits steering commands to bring the vehicle to a safe stop.
This obviously can be extended to n-levels of system. There can be multiple AI systems, each able to take over control, where the lower level systems use more robust and simple policies that are more likely to work in a larger variety of situations. Every time the higher level system becomes empirically uncertain—a module authored by humans is what measures this, one way to do this is to see if the inputs to the high level system fall into the latent space of the training set—control shifts to the first system that has sufficient confidence, devolving to a microcontroller.
The ‘ah ha’ I want you to have is that we know the AI policy is safe within the space of the training simulator. We measured it. The training simulator is supposed to be a reasonably accurate facsimile of the real world—and we know how to auto-generate realistic situations now from real world data, so we can be ever improving the simulation’s breadth. Whenever the system finds itself in an out of distribution situation, this means that it now has inputs that describe a situation it was not proven to do the safe thing. So we shut down that system, but as the AI system may be doing a task in the real world and can’t be simply turned off, we have to transfer control to a system that can safely shut it down.
Example—a robot is operating in a paperclip factory. It escapes outside. On the frame it perceives “outside”, it recognizes that the current perception is outside the latent space of the training examples, which were all ‘inside’. It transfers control to a network that manages ‘robot momentum’, something trained to stop the robot hardware in as little space as possible without it falling over—and halts a few feet from the exit.
It’s a pretty clear and actionable approach. Human organizations already do this, human brains seem to have a structure like this (example: our high level system can order us not to breathe but if we pass out a low level system engages respiration), and so on.
Note that this means every AI system exists in an outer framework authored by humans that is permanently immutable. No self modification. And you would need to formally prove the software modules that form the structure of the framework.
There very likely are a number of ‘rules’ like this that have to be obeyed or AI safety is impossible. Similar to how we could have made nuclear reactors work more like the demon core experiment, where a critical mass is being created via a neutron reflector and the core is weapons grade. This would put every reactor one mechanical failure away from a prompt critical incident that would cause an explosion.
You also do need interpretability and to restrict deception with the speed prior.
For the purpose of the present discussion, I note that if your plan needs interpretability, then that would be a cause for concern, and a reason for slowing down AGI. The state of interpretability is currently very bad, and there seem to be lots of concrete ways to make progress right now.
Separately, I don’t think your plan (as I understand it) has any hope of addressing the hardest and most important AGI safety problems. But I don’t want to spend the (considerable) time to get into a discussion about that, so I’ll duck out of that conversation, sorry. (At least for now.)
That is unfortunately not a helpful response. If this simple plan—which is already what is in use in the real world in actual AI systems today—won’t work this is critical information!
What is the main flaw? It costs you little to mentioned the biggest problem.
I agree with this, and therefore I’m both more optimistic, and think that we should be not alarmed at the pace of progress. Or in other words, I disagree with the idea of slowing down AGI progress.
Hmm, Fermi invented the idea of control rods before building the first-ever nuclear reactor, and it worked as designed to control the nuclear reaction. So that’s at least one good example that we can hope to follow. I’m not sure what your last paragraph is referring to. For that first nuclear reactor, the exponential growth happened pretty much exactly as Fermi had calculated in advance, IIRC.
OK anyway, there’s a category of AGI safety work that we might call “Endgame Safety”, where we’re trying to do all the AGI safety work that we couldn’t (or didn’t) do ahead of time, in the very last moments before (or even after) people are actually playing around with the kind of powerful AGI algorithms that could get irreversibly out of control. I think we both agree that lots of the essential AGI safety work is in the category of “Endgame Safety”. I don’t know what the fraction is, but it seems that you and I are both agreeing that the fraction is not literally 100%.
(For my part, I wouldn’t be too surprised if Endgame Safety were 90% of the total useful person-hours of AGI safety, but I hope that lots of important conceptual / deconfusion work can be done further ahead, since those things sometimes take lots of wall-clock time.)
And as long as the fraction (AGI endgame safety work) / (all AGI safety work) is not literally 100%—i.e., as long as there is any AGI safety research whatsoever that we can do ahead of time—then we now have the core of an argument that slowing down AGI would be helpful.
For example, if AGI happens in 5 years, we can be frantically doing Endgame Safety starting in 5 years. And if AGI happens in 50 years, we can be frantically doing Endgame Safety starting in 50 years. What does it matter? Endgame Safety is going to be a frantic rush either way. But in the latter case, we can have more time to nail down everything that’s not Endgame Safety. And we can also have more time to do other useful things like outreach / field-building—to get from the current world where only a small fraction of people in AI / ML understand even really basic things like instrumental convergence and s-risk-mitigation 101, to a future world where the fraction is higher.
(You can make an argument that doing Endgame Safety in 50 years would be harder than doing Endgame Safety in 5 years because of other separate ways that the world would be different, e.g. bigger hardware overhang or whatever, but that’s a different argument that you didn’t seem to be making.)
Steven, how many months before the Chicago pile construction started did Fermi’s design team do the work on the control rods? There’s also a large difference between the idea of control rods—we have lots of ideas how to do AGI control mechanisms and no doubt some of them do work—and an actual machined control rod with enough cadmium/boron/etc to work.
In terms of labor hours, going from idea to working rod was probably >99% of the effort. Even after discovering empirically which materials act as neutron absorbers.
I think AGI safety is in a worse place than you do.
It seems that you think that we already have at least one plan for Safe & Beneficial AGI that has no problems that are foreseeable at this point, they’ve been red-teamed to death and emerged unscathed with the information available, and we’re not going to get any further until we’re deeper into the implementation.
Whereas I think that we have zero plans for which we can say “given what we know now, we have strong reason to believe that successfully implementing / following this plan would give us Safe & Beneficial AGI”.
I also think that, just because you have code that reliably trains a deceptive power-seeking AGI, sitting right in front of you and available to test, doesn’t mean that you know how to write code that reliably trains a non-deceptive corrigible AGI. Especially when one of the problems we’re trying to solve right now is the issue that it seems very hard to know whether an AGI is deceptive / corrigible / etc.
Maybe the analogy for me would be that Fermi has a vague idea “What if we use a rod made of neutron-absorbing material?”. But there are no known neutron-absorbing materials. So Fermi starts going about testing materials to see if he can find any that absorb neutrons. Meanwhile, DeepPile develops technology that makes it easier and easier to purify uranium, and publishes all the details, and starts building its own piles, on the theory that when the piles are hitting criticality, it would make it easier to test possible control mechanisms. This is not helping Fermi! He already has a concrete problem that he’s stuck on, and he already has what he needs to make progress, he just needs time.
Yes, I do. I’ve spoken privately with an engineer on AI safety at OpenAI and he agrees this can work.
It’s an extension of existing industrial safety mechanisms, including the ones that are used on autonomous cars today.
The current method is:
There are 2 systems controlling a car operating on autopilot. One has a sophisticated policy to control the car, using a pipeline consisting of neural networks and software modules. The other is a microcontroller which has a dead simple policy to order a braking force increase over time.
Each timestep, the microcontroller software is looking for information from the telemetry stream the higher level system emits that may indicate a failure. This means timeouts, corrupt packets, or if a packet contains specific values indicating an error or low confidence.
Waymo extends this approach to 3 systems, based on their publicly released information. One high level system, one mid level system that is approximately a lane control system on a separate power source, and the microcontroller. The mid level system is a neural network attached to one camera, and it emits steering commands to bring the vehicle to a safe stop.
This obviously can be extended to n-levels of system. There can be multiple AI systems, each able to take over control, where the lower level systems use more robust and simple policies that are more likely to work in a larger variety of situations. Every time the higher level system becomes empirically uncertain—a module authored by humans is what measures this, one way to do this is to see if the inputs to the high level system fall into the latent space of the training set—control shifts to the first system that has sufficient confidence, devolving to a microcontroller.
The ‘ah ha’ I want you to have is that we know the AI policy is safe within the space of the training simulator. We measured it. The training simulator is supposed to be a reasonably accurate facsimile of the real world—and we know how to auto-generate realistic situations now from real world data, so we can be ever improving the simulation’s breadth. Whenever the system finds itself in an out of distribution situation, this means that it now has inputs that describe a situation it was not proven to do the safe thing. So we shut down that system, but as the AI system may be doing a task in the real world and can’t be simply turned off, we have to transfer control to a system that can safely shut it down.
Example—a robot is operating in a paperclip factory. It escapes outside. On the frame it perceives “outside”, it recognizes that the current perception is outside the latent space of the training examples, which were all ‘inside’. It transfers control to a network that manages ‘robot momentum’, something trained to stop the robot hardware in as little space as possible without it falling over—and halts a few feet from the exit.
It’s a pretty clear and actionable approach. Human organizations already do this, human brains seem to have a structure like this (example: our high level system can order us not to breathe but if we pass out a low level system engages respiration), and so on.
Note that this means every AI system exists in an outer framework authored by humans that is permanently immutable. No self modification. And you would need to formally prove the software modules that form the structure of the framework.
There very likely are a number of ‘rules’ like this that have to be obeyed or AI safety is impossible. Similar to how we could have made nuclear reactors work more like the demon core experiment, where a critical mass is being created via a neutron reflector and the core is weapons grade. This would put every reactor one mechanical failure away from a prompt critical incident that would cause an explosion.
You also do need interpretability and to restrict deception with the speed prior.
For the purpose of the present discussion, I note that if your plan needs interpretability, then that would be a cause for concern, and a reason for slowing down AGI. The state of interpretability is currently very bad, and there seem to be lots of concrete ways to make progress right now.
Separately, I don’t think your plan (as I understand it) has any hope of addressing the hardest and most important AGI safety problems. But I don’t want to spend the (considerable) time to get into a discussion about that, so I’ll duck out of that conversation, sorry. (At least for now.)
That is unfortunately not a helpful response. If this simple plan—which is already what is in use in the real world in actual AI systems today—won’t work this is critical information!
What is the main flaw? It costs you little to mentioned the biggest problem.
I agree with this, and therefore I’m both more optimistic, and think that we should be not alarmed at the pace of progress. Or in other words, I disagree with the idea of slowing down AGI progress.