Why does anyone assume ANY progress will be made on alignment if we don’t have potentially dangerous AGIs in existence to experiment with?
A second issue is that at least the current model for chatGPT REQUIRES human feedback to get smarter, and the greater the scale of userbase the smarter it can potentially become.
Other systems designed to scale to AGI may have to be trained this way: initial training from test environments and static human text, but refinement from interaction with live humans, where the company with the most users has runaway success because more users means the system learns faster and more users switch to it.
This dependency graph means we may have no choice but to proceed with AGI on the most expedient timescale. Once early systems attempt bad behavior and fail, then research those in large well funded isolated labs to discover ways to mitigate the issues.
How much progress did humans make on heavier than air flight before they had actual airplanes?
Nobody invented jet engines or computational fluid dynamics before they had many generations of aircraft and many air battles to justify the expense.
Why does anyone assume ANY progress will be made on alignment if we don’t have potentially dangerous AGIs in existence to experiment with?
I’m obviously biased, but I think we should assume this based on what we see with our eyes—we can look around and note that more than zero progress on alignment is being made right now.
If you think that “What Paul Christiano is doing right now is just totally useless, he might as well switch fields, do some cool math or whatever, and have a more relaxing time until real-deal AGIs show up, it would make no difference whatsoever”, and you also think that same thing about Scott Garrabrant, Venessa Kosoy, John Wentworth, Anthropic, Redwood Research, Conjecture, me (cf. here & here), etc. etc.—well, you’re obviously entitled to believe that, but I would be interested to hear a more detailed argument if you have time, not just analogizing to other fields. (Although, I do think that if the task was “make ANY progress on heavier-than-air flight before we have any actual airplanes”, this task would be easily achievable, because “any” is a very low bar! You could do general research towards stiff and light structures, towards higher-power-to-weight-ratio engines, etc.) For example, Eliezer Yudkowsky is on the very skeptical end of opinions about ongoing AGI safety research, but he seems to strongly believe that doing interpretability research right now is marginally helpful, not completely useless.
On the other hand every past technology humans made, whether or not they researched it for decades first or rushed it out with young engineers, I am not actually sure it made any difference. There is no way to falsify this but pretty much every technology built had crippling, often lethal to humans flaws in the first versions.
My point is there is immense information gain from actually fully constructing and testing a technology, and further large gains from deployment to scale.
While if you don’t have any of that the possibility space is much larger.
For example some propose llms as they currently exist could exhibit rampant behavior. This may be true or completely false because the RLHF step discouraged models that can exhibit such traits or some other reason.
Prior to fission reactors existing nuclear scientists may have been concerned about prompt criticality detonating power reactors. This has only happened once, possibly twice.
Hmm, Fermi invented the idea of control rods before building the first-ever nuclear reactor, and it worked as designed to control the nuclear reaction. So that’s at least one good example that we can hope to follow. I’m not sure what your last paragraph is referring to. For that first nuclear reactor, the exponential growth happened pretty much exactly as Fermi had calculated in advance, IIRC.
OK anyway, there’s a category of AGI safety work that we might call “Endgame Safety”, where we’re trying to do all the AGI safety work that we couldn’t (or didn’t) do ahead of time, in the very last moments before (or even after) people are actually playing around with the kind of powerful AGI algorithms that could get irreversibly out of control. I think we both agree that lots of the essential AGI safety work is in the category of “Endgame Safety”. I don’t know what the fraction is, but it seems that you and I are both agreeing that the fraction is not literally 100%.
(For my part, I wouldn’t be too surprised if Endgame Safety were 90% of the total useful person-hours of AGI safety, but I hope that lots of important conceptual / deconfusion work can be done further ahead, since those things sometimes take lots of wall-clock time.)
And as long as the fraction (AGI endgame safety work) / (all AGI safety work) is not literally 100%—i.e., as long as there is any AGI safety research whatsoever that we can do ahead of time—then we now have the core of an argument that slowing down AGI would be helpful.
For example, if AGI happens in 5 years, we can be frantically doing Endgame Safety starting in 5 years. And if AGI happens in 50 years, we can be frantically doing Endgame Safety starting in 50 years. What does it matter? Endgame Safety is going to be a frantic rush either way. But in the latter case, we can have more time to nail down everything that’s not Endgame Safety. And we can also have more time to do other useful things like outreach / field-building—to get from the current world where only a small fraction of people in AI / ML understand even really basic things like instrumental convergence and s-risk-mitigation 101, to a future world where the fraction is higher.
(You can make an argument that doing Endgame Safety in 50 years would be harder than doing Endgame Safety in 5 years because of other separate ways that the world would be different, e.g. bigger hardware overhang or whatever, but that’s a different argument that you didn’t seem to be making.)
Steven, how many months before the Chicago pile construction started did Fermi’s design team do the work on the control rods? There’s also a large difference between the idea of control rods—we have lots of ideas how to do AGI control mechanisms and no doubt some of them do work—and an actual machined control rod with enough cadmium/boron/etc to work.
In terms of labor hours, going from idea to working rod was probably >99% of the effort. Even after discovering empirically which materials act as neutron absorbers.
we have lots of ideas how to do AGI control mechanisms and no doubt some of them do work
I think AGI safety is in a worse place than you do.
It seems that you think that we already have at least one plan for Safe & Beneficial AGI that has no problems that are foreseeable at this point, they’ve been red-teamed to death and emerged unscathed with the information available, and we’re not going to get any further until we’re deeper into the implementation.
Whereas I think that we have zero plans for which we can say “given what we know now, we have strong reason to believe that successfully implementing / following this plan would give us Safe & Beneficial AGI”.
I also think that, just because you have code that reliably trains a deceptive power-seeking AGI, sitting right in front of you and available to test, doesn’t mean that you know how to write code that reliably trains a non-deceptive corrigible AGI. Especially when one of the problems we’re trying to solve right now is the issue that it seems very hard to know whether an AGI is deceptive / corrigible / etc.
Maybe the analogy for me would be that Fermi has a vague idea “What if we use a rod made of neutron-absorbing material?”. But there are no known neutron-absorbing materials. So Fermi starts going about testing materials to see if he can find any that absorb neutrons. Meanwhile, DeepPile develops technology that makes it easier and easier to purify uranium, and publishes all the details, and starts building its own piles, on the theory that when the piles are hitting criticality, it would make it easier to test possible control mechanisms. This is not helping Fermi! He already has a concrete problem that he’s stuck on, and he already has what he needs to make progress, he just needs time.
Yes, I do. I’ve spoken privately with an engineer on AI safety at OpenAI and he agrees this can work.
It’s an extension of existing industrial safety mechanisms, including the ones that are used on autonomous cars today.
The current method is:
There are 2 systems controlling a car operating on autopilot. One has a sophisticated policy to control the car, using a pipeline consisting of neural networks and software modules. The other is a microcontroller which has a dead simple policy to order a braking force increase over time.
Each timestep, the microcontroller software is looking for information from the telemetry stream the higher level system emits that may indicate a failure. This means timeouts, corrupt packets, or if a packet contains specific values indicating an error or low confidence.
Waymo extends this approach to 3 systems, based on their publicly released information. One high level system, one mid level system that is approximately a lane control system on a separate power source, and the microcontroller. The mid level system is a neural network attached to one camera, and it emits steering commands to bring the vehicle to a safe stop.
This obviously can be extended to n-levels of system. There can be multiple AI systems, each able to take over control, where the lower level systems use more robust and simple policies that are more likely to work in a larger variety of situations. Every time the higher level system becomes empirically uncertain—a module authored by humans is what measures this, one way to do this is to see if the inputs to the high level system fall into the latent space of the training set—control shifts to the first system that has sufficient confidence, devolving to a microcontroller.
The ‘ah ha’ I want you to have is that we know the AI policy is safe within the space of the training simulator. We measured it. The training simulator is supposed to be a reasonably accurate facsimile of the real world—and we know how to auto-generate realistic situations now from real world data, so we can be ever improving the simulation’s breadth. Whenever the system finds itself in an out of distribution situation, this means that it now has inputs that describe a situation it was not proven to do the safe thing. So we shut down that system, but as the AI system may be doing a task in the real world and can’t be simply turned off, we have to transfer control to a system that can safely shut it down.
Example—a robot is operating in a paperclip factory. It escapes outside. On the frame it perceives “outside”, it recognizes that the current perception is outside the latent space of the training examples, which were all ‘inside’. It transfers control to a network that manages ‘robot momentum’, something trained to stop the robot hardware in as little space as possible without it falling over—and halts a few feet from the exit.
It’s a pretty clear and actionable approach. Human organizations already do this, human brains seem to have a structure like this (example: our high level system can order us not to breathe but if we pass out a low level system engages respiration), and so on.
Note that this means every AI system exists in an outer framework authored by humans that is permanently immutable. No self modification. And you would need to formally prove the software modules that form the structure of the framework.
There very likely are a number of ‘rules’ like this that have to be obeyed or AI safety is impossible. Similar to how we could have made nuclear reactors work more like the demon core experiment, where a critical mass is being created via a neutron reflector and the core is weapons grade. This would put every reactor one mechanical failure away from a prompt critical incident that would cause an explosion.
You also do need interpretability and to restrict deception with the speed prior.
For the purpose of the present discussion, I note that if your plan needs interpretability, then that would be a cause for concern, and a reason for slowing down AGI. The state of interpretability is currently very bad, and there seem to be lots of concrete ways to make progress right now.
Separately, I don’t think your plan (as I understand it) has any hope of addressing the hardest and most important AGI safety problems. But I don’t want to spend the (considerable) time to get into a discussion about that, so I’ll duck out of that conversation, sorry. (At least for now.)
That is unfortunately not a helpful response. If this simple plan—which is already what is in use in the real world in actual AI systems today—won’t work this is critical information!
What is the main flaw? It costs you little to mentioned the biggest problem.
I agree with this, and therefore I’m both more optimistic, and think that we should be not alarmed at the pace of progress. Or in other words, I disagree with the idea of slowing down AGI progress.
Yeah, this is a big problem I have with alignment people. They forget that if we don’t have iteration, we don’t solve the problem, so all efforts should focus on making things paralleliziable. It’s a problem I had with MIRI’s early work, and today we need to set ourselves up for much more empirical evidence. This could be a reason to support capabilities advances.
They argue there is some unknown point of capabilities at which the system explodes and we all die.
If that’s the rules of the universe we happen to find ourselves in though there probably is no winning anyways though. Sort of how if the laws of physics were slightly different and the first nuclear test did ignite the atmosphere.
Were atmospheric gas fissionable things would be very different.
It’s a very similar criticality argument. Early AGIs that try bad stuff may “quench” because the world lacks sufficient easily remotely hackable nanoforges and fleets of armed killer robots ready to deploy. So they instead steal a few bitcoins, kill a few people, then are caught and shut down.
If instead the AGI finds an exploit to get criticality then we all die. I am concerned the AGI might create a cult of personality or a religion and get support from large numbers of gullible humans. These humans, despite the AGI openly killing people and acting completely selfishly, might give it the resources to develop a way to kill us all.
I see one critical flaw here.
Why does anyone assume ANY progress will be made on alignment if we don’t have potentially dangerous AGIs in existence to experiment with?
A second issue is that at least the current model for chatGPT REQUIRES human feedback to get smarter, and the greater the scale of userbase the smarter it can potentially become.
Other systems designed to scale to AGI may have to be trained this way: initial training from test environments and static human text, but refinement from interaction with live humans, where the company with the most users has runaway success because more users means the system learns faster and more users switch to it.
This dependency graph means we may have no choice but to proceed with AGI on the most expedient timescale. Once early systems attempt bad behavior and fail, then research those in large well funded isolated labs to discover ways to mitigate the issues.
How much progress did humans make on heavier than air flight before they had actual airplanes?
Nobody invented jet engines or computational fluid dynamics before they had many generations of aircraft and many air battles to justify the expense.
I’m obviously biased, but I think we should assume this based on what we see with our eyes—we can look around and note that more than zero progress on alignment is being made right now.
If you think that “What Paul Christiano is doing right now is just totally useless, he might as well switch fields, do some cool math or whatever, and have a more relaxing time until real-deal AGIs show up, it would make no difference whatsoever”, and you also think that same thing about Scott Garrabrant, Venessa Kosoy, John Wentworth, Anthropic, Redwood Research, Conjecture, me (cf. here & here), etc. etc.—well, you’re obviously entitled to believe that, but I would be interested to hear a more detailed argument if you have time, not just analogizing to other fields. (Although, I do think that if the task was “make ANY progress on heavier-than-air flight before we have any actual airplanes”, this task would be easily achievable, because “any” is a very low bar! You could do general research towards stiff and light structures, towards higher-power-to-weight-ratio engines, etc.) For example, Eliezer Yudkowsky is on the very skeptical end of opinions about ongoing AGI safety research, but he seems to strongly believe that doing interpretability research right now is marginally helpful, not completely useless.
Fair. Any is an unfair comparison.
On the other hand every past technology humans made, whether or not they researched it for decades first or rushed it out with young engineers, I am not actually sure it made any difference. There is no way to falsify this but pretty much every technology built had crippling, often lethal to humans flaws in the first versions.
My point is there is immense information gain from actually fully constructing and testing a technology, and further large gains from deployment to scale.
While if you don’t have any of that the possibility space is much larger.
For example some propose llms as they currently exist could exhibit rampant behavior. This may be true or completely false because the RLHF step discouraged models that can exhibit such traits or some other reason.
Prior to fission reactors existing nuclear scientists may have been concerned about prompt criticality detonating power reactors. This has only happened once, possibly twice.
Hmm, Fermi invented the idea of control rods before building the first-ever nuclear reactor, and it worked as designed to control the nuclear reaction. So that’s at least one good example that we can hope to follow. I’m not sure what your last paragraph is referring to. For that first nuclear reactor, the exponential growth happened pretty much exactly as Fermi had calculated in advance, IIRC.
OK anyway, there’s a category of AGI safety work that we might call “Endgame Safety”, where we’re trying to do all the AGI safety work that we couldn’t (or didn’t) do ahead of time, in the very last moments before (or even after) people are actually playing around with the kind of powerful AGI algorithms that could get irreversibly out of control. I think we both agree that lots of the essential AGI safety work is in the category of “Endgame Safety”. I don’t know what the fraction is, but it seems that you and I are both agreeing that the fraction is not literally 100%.
(For my part, I wouldn’t be too surprised if Endgame Safety were 90% of the total useful person-hours of AGI safety, but I hope that lots of important conceptual / deconfusion work can be done further ahead, since those things sometimes take lots of wall-clock time.)
And as long as the fraction (AGI endgame safety work) / (all AGI safety work) is not literally 100%—i.e., as long as there is any AGI safety research whatsoever that we can do ahead of time—then we now have the core of an argument that slowing down AGI would be helpful.
For example, if AGI happens in 5 years, we can be frantically doing Endgame Safety starting in 5 years. And if AGI happens in 50 years, we can be frantically doing Endgame Safety starting in 50 years. What does it matter? Endgame Safety is going to be a frantic rush either way. But in the latter case, we can have more time to nail down everything that’s not Endgame Safety. And we can also have more time to do other useful things like outreach / field-building—to get from the current world where only a small fraction of people in AI / ML understand even really basic things like instrumental convergence and s-risk-mitigation 101, to a future world where the fraction is higher.
(You can make an argument that doing Endgame Safety in 50 years would be harder than doing Endgame Safety in 5 years because of other separate ways that the world would be different, e.g. bigger hardware overhang or whatever, but that’s a different argument that you didn’t seem to be making.)
Steven, how many months before the Chicago pile construction started did Fermi’s design team do the work on the control rods? There’s also a large difference between the idea of control rods—we have lots of ideas how to do AGI control mechanisms and no doubt some of them do work—and an actual machined control rod with enough cadmium/boron/etc to work.
In terms of labor hours, going from idea to working rod was probably >99% of the effort. Even after discovering empirically which materials act as neutron absorbers.
I think AGI safety is in a worse place than you do.
It seems that you think that we already have at least one plan for Safe & Beneficial AGI that has no problems that are foreseeable at this point, they’ve been red-teamed to death and emerged unscathed with the information available, and we’re not going to get any further until we’re deeper into the implementation.
Whereas I think that we have zero plans for which we can say “given what we know now, we have strong reason to believe that successfully implementing / following this plan would give us Safe & Beneficial AGI”.
I also think that, just because you have code that reliably trains a deceptive power-seeking AGI, sitting right in front of you and available to test, doesn’t mean that you know how to write code that reliably trains a non-deceptive corrigible AGI. Especially when one of the problems we’re trying to solve right now is the issue that it seems very hard to know whether an AGI is deceptive / corrigible / etc.
Maybe the analogy for me would be that Fermi has a vague idea “What if we use a rod made of neutron-absorbing material?”. But there are no known neutron-absorbing materials. So Fermi starts going about testing materials to see if he can find any that absorb neutrons. Meanwhile, DeepPile develops technology that makes it easier and easier to purify uranium, and publishes all the details, and starts building its own piles, on the theory that when the piles are hitting criticality, it would make it easier to test possible control mechanisms. This is not helping Fermi! He already has a concrete problem that he’s stuck on, and he already has what he needs to make progress, he just needs time.
Yes, I do. I’ve spoken privately with an engineer on AI safety at OpenAI and he agrees this can work.
It’s an extension of existing industrial safety mechanisms, including the ones that are used on autonomous cars today.
The current method is:
There are 2 systems controlling a car operating on autopilot. One has a sophisticated policy to control the car, using a pipeline consisting of neural networks and software modules. The other is a microcontroller which has a dead simple policy to order a braking force increase over time.
Each timestep, the microcontroller software is looking for information from the telemetry stream the higher level system emits that may indicate a failure. This means timeouts, corrupt packets, or if a packet contains specific values indicating an error or low confidence.
Waymo extends this approach to 3 systems, based on their publicly released information. One high level system, one mid level system that is approximately a lane control system on a separate power source, and the microcontroller. The mid level system is a neural network attached to one camera, and it emits steering commands to bring the vehicle to a safe stop.
This obviously can be extended to n-levels of system. There can be multiple AI systems, each able to take over control, where the lower level systems use more robust and simple policies that are more likely to work in a larger variety of situations. Every time the higher level system becomes empirically uncertain—a module authored by humans is what measures this, one way to do this is to see if the inputs to the high level system fall into the latent space of the training set—control shifts to the first system that has sufficient confidence, devolving to a microcontroller.
The ‘ah ha’ I want you to have is that we know the AI policy is safe within the space of the training simulator. We measured it. The training simulator is supposed to be a reasonably accurate facsimile of the real world—and we know how to auto-generate realistic situations now from real world data, so we can be ever improving the simulation’s breadth. Whenever the system finds itself in an out of distribution situation, this means that it now has inputs that describe a situation it was not proven to do the safe thing. So we shut down that system, but as the AI system may be doing a task in the real world and can’t be simply turned off, we have to transfer control to a system that can safely shut it down.
Example—a robot is operating in a paperclip factory. It escapes outside. On the frame it perceives “outside”, it recognizes that the current perception is outside the latent space of the training examples, which were all ‘inside’. It transfers control to a network that manages ‘robot momentum’, something trained to stop the robot hardware in as little space as possible without it falling over—and halts a few feet from the exit.
It’s a pretty clear and actionable approach. Human organizations already do this, human brains seem to have a structure like this (example: our high level system can order us not to breathe but if we pass out a low level system engages respiration), and so on.
Note that this means every AI system exists in an outer framework authored by humans that is permanently immutable. No self modification. And you would need to formally prove the software modules that form the structure of the framework.
There very likely are a number of ‘rules’ like this that have to be obeyed or AI safety is impossible. Similar to how we could have made nuclear reactors work more like the demon core experiment, where a critical mass is being created via a neutron reflector and the core is weapons grade. This would put every reactor one mechanical failure away from a prompt critical incident that would cause an explosion.
You also do need interpretability and to restrict deception with the speed prior.
For the purpose of the present discussion, I note that if your plan needs interpretability, then that would be a cause for concern, and a reason for slowing down AGI. The state of interpretability is currently very bad, and there seem to be lots of concrete ways to make progress right now.
Separately, I don’t think your plan (as I understand it) has any hope of addressing the hardest and most important AGI safety problems. But I don’t want to spend the (considerable) time to get into a discussion about that, so I’ll duck out of that conversation, sorry. (At least for now.)
That is unfortunately not a helpful response. If this simple plan—which is already what is in use in the real world in actual AI systems today—won’t work this is critical information!
What is the main flaw? It costs you little to mentioned the biggest problem.
I agree with this, and therefore I’m both more optimistic, and think that we should be not alarmed at the pace of progress. Or in other words, I disagree with the idea of slowing down AGI progress.
Yeah, this is a big problem I have with alignment people. They forget that if we don’t have iteration, we don’t solve the problem, so all efforts should focus on making things paralleliziable. It’s a problem I had with MIRI’s early work, and today we need to set ourselves up for much more empirical evidence. This could be a reason to support capabilities advances.
They argue there is some unknown point of capabilities at which the system explodes and we all die.
If that’s the rules of the universe we happen to find ourselves in though there probably is no winning anyways though. Sort of how if the laws of physics were slightly different and the first nuclear test did ignite the atmosphere.
Were atmospheric gas fissionable things would be very different.
It’s a very similar criticality argument. Early AGIs that try bad stuff may “quench” because the world lacks sufficient easily remotely hackable nanoforges and fleets of armed killer robots ready to deploy. So they instead steal a few bitcoins, kill a few people, then are caught and shut down.
If instead the AGI finds an exploit to get criticality then we all die. I am concerned the AGI might create a cult of personality or a religion and get support from large numbers of gullible humans. These humans, despite the AGI openly killing people and acting completely selfishly, might give it the resources to develop a way to kill us all.