risks without a pause are low, and they don’t significantly reduce with a pause, then a pause makes things worse. If risks without a pause are high, but risks after a 20-year pause are much lower, then a pause is an improvement even for personal risk for sufficiently young people.
Yes. Although you have 2 problems:
Why do you think a 20 year pause, or any pause, will change anything.
Like for example you may know that cybersecurity on game consoles and iPhones keeps getting cracked.
AI control is similar in many ways to cybersecurity in that you are trying to limit the AIs access to functions that let it do bad things, and prevent the AI from seeing information that will allow it to fail. (Betrayal is control failure, the model cannot betray in coordinated way if it doesn’t somehow receive a message from other models that now is the time)
Are you going to secure your iPhones and consoles by researching cybersecurity for 20 years and then deploying the next generation? Or do you do the best you can with information from the last failures and try again? Each time you try to limit the damage, for example with game consoles there are various strategies that have grown more sophisticated to encourage users to purchase access to games rather than pirate them.
With AI you probably won’t learn during a pause anything that will help you. We know this from experience because on paper, securing the products I mentioned is trivially easy. Sign everything with a key server, don’t run unsigned code, check the key is valid on really well verified and privileged code.
Note that no one who breaks game consoles or iphones does it by cracking the encryption directly, pub/private key crypto is still unbroken.
Similarly I would expect very early on human engineers will develop an impervious method of AI control. You can write one yourself it’s not difficult. But like everything it will fail on implementation....
You know the cost of a 20 year pause. Just add up the body bags, or 20 years of deaths worldwide to aging. More than a billion people.
You don’t necessarily have a good case that the benefit of the pause will save the lives of all humans because even guessing the problem will benefit during a pause is speculation. It’s not speculation to say the cost is more than a billion lives, because younger human bodies keep them alive with high probability, eventually AI discover a fix. Even if it takes a 100 years, it will be in +100 years not +120 years.
Historically the lesson is all of the parties “agreeing” to the pause are going to betray, and whoever betrays more gets an advantage. US navy lost an aircraft carrier in WW2 that a larger carrier would have survived because they honored the tonnage limits.
Hypotheticals disentangle models from values. A pause is not a policy, not an attempt at a pause that might fail, it’s the actual pause, the hypothetical. We can then look at the various hypotheticals and ask what happens there, which one is better. Hopefully our values can handle the strain of out-of-distribution evaluation and don’t collapse into incoherence of goodharting, unable to say anything relevant about situations that our models consider impossible in actual reality.
In the hypothetical of a 100-year pause, the pause actually happens, even if this is impossible in actual reality. One of the things within that hypothetical is death of 4 generations of humans. Another is the AGI that gets built at the end. In your model, that AGI is no safer than the one that we build without the magic hypothetical of the pause. In my model, that AGI is significantly safer. A safer AGI translates into more value of the whole future, which is much longer than the current age of the universe. And an unsafe AGI now is less than helpful to those 4 generations.
AI control is similar in many ways to cybersecurity in that you are trying to limit the AIs access to functions that let it do bad things, and prevent the AI from seeing information that will allow it to fail.
That’s the point of AI alignment as distinct from AI control. Your model says the distinction doesn’t work. My model says it does. Therefore my model endorses the hypothetical of a pause.
Having endorsed a hypothetical, I can start paying attention to ways of moving reality in its direction. But that is distinct from a judgement about what the hypothetical entails.
How would you know any method of alignment works without AGI of increasing capabilities/child AGI that are supposed to inherit aligned property to test this?
One of the reasons I gave current cybersecurity as an example is that pub/private key signing is correct. Nobody has broken the longer keys. Yet if you spent 20 years or 100 years proving it correct then deployed to software using present techniques you would get hacked immediately. Implementation is hard and is the majority of the difficulty.
Assuming ai alignment can be paper solved like this way I see it as the same situation. It will fail in ways you won’t know until you try it for real.
Consider an indefinite moratorium on AGI that awaits better tools that make building it a good idea rather than a bad idea. If there was a magic button that rewrote laws of nature to make this happen, would it be a good idea to press it? My point is that we both endorse pressing this button, the only difference is that your model says that building an AGI immediately is a good idea, and so the moratorium should end immediately. My model disagrees. This particular disagreement is not about the generations of people who forgo access to potential technology (where there is no disagreement), and it’s not about feasibility of the magic button (which is a separate disagreement). It’s about how this technology works, what works in influencing its design and deployment, and the effect it has on the world once deployed.
The crux of that disagreement seems to be about importance of preparation in advance of doing a thing, compared to the process of actually doing the thing in the real world. A pause enables extensive preparation to building an AGI, and high serial speed of thought of AGIs enables AGIs extensive preparation to acting on the world. If such preparation doesn’t give decisive advantage, a pause doesn’t help, and AGIs don’t rewrite reality in a year once deployed. If it does give a decisive advantage, a pause helps significantly, and a fast-thinking AGI shortly gains the affordance of overwriting humanity with whatever it plans to enact.
I see preparation as raising generations of giants to stand on the shoulders of, which in time changes the character of the practical projects that would be attempted, and the details we pay attention to as we carry out such projects. Yes, cryptography isn’t sufficient to make systems secure, but absence of cryptography certainly makes them less secure, as is attempting to design cryptographic algorithms without taking the time to get good at it. This is the kind of preparation that makes a difference. Noticing that superintelligence doesn’t imply supermorality and that alignment is a concern at all is an important development. Appreciating goodharting and corrigibility changes the safety properties of AIs that appear important, when looking into more practical designs that don’t necessarily originate from these considerations. Deceptive alignment is a useful concern to keep in mind, even if in the end it turns out that practical systems don’t have that problem. Experiments on GPT-2 sized systems still have a whole lot to teach us about interpretable and steerable architectures.
Without AGI interrupting this process, the kinds of things that people would attempt in order to build an AGI would be very different 20 years from now, and different yet again in 40, 60, 80, and 100 years. I expect some accumulated wisdom to steer such projects in better and better directions, even if the resulting implementation details remain sufficiently messy and make the resulting systems moderately unsafe, with some asymptote of safety where the aging generations make it worthwhile to forgo additional preparation.
Basic science and pure mathematics enable their own subsequent iterations without having them as explicit targets or even without being able to imagine these developments, while doing the work crucial in making them possible.
Extensive preparation never happened with a thing that is ready to be attempted experimentally, because in those cases we just do the experiments, there is no reason not to. With AGI, the reason not to do this is the unbounded blast radius of a failure, an unprecedented problem. Unprecedented things are less plausible, but unfortunately this can’t be expected to have happened before, because then you are no longer here to update on the observation.
If the blast radius is not unbounded, if most failures can be contained, then it’s more reasonable to attempt to develop AGI in the usual way, without extensive preparation that doesn’t involve actually attempting to build it. If preparation in general doesn’t help, it doesn’t help AGIs either, making them less dangerous and reducing the scope of failure, and so preparation for building them is not as needed. If preparation does help, it also helps AGIs, and so preparation is needed.
If the blast radius is not unbounded, if most failures can be contained, then it’s more reasonable to attempt to develop AGI in the usual way, without extensive preparation that doesn’t involve actually attempting to build it
Is it true or not true that there is no evidence for an “unbounded” blast radius for any AI model someone has trained. I am not aware of any evidence.
What would constitute evidence that the situation was now in the “unbounded” failure case? How would you prove it?
So we don’t end up in a loop, assume someone has demonstrated a major danger with current AI models. Assume there is a really obvious method of control that will contain the problem. Now what? It seems to me like the next step would be to restrict AI development in a similar way to how cobalt-60 sources are restricted, where only institutions with licenses, inspections, and methods of control can handle the stuff, but that’s still not a pause...
When could you ever reach a situation where a stronger control mechanism won’t work?
Like I try to imagine it, and I can imagine more and more layers of defense—“don’t read anything the model wrote”, “more firewalls, more isolation, servers in a salt mine”—but never a point where you couldn’t agree it was under control. Like if you make a radioactive source more radioactive you just add more inches of shielding until the dose is acceptable.
The blast radius of AGIs is unbounded in the same way as that of humanity, there is potential for taking over all of the future. There are many ways of containing it, and alignment is a way of making the blast a good thing. The point is that a sufficiently catastrophic failure that doesn’t involve containing the blast is unusually impactful. Arguments about ease of containing the blast are separate from this point in the way I intended it.
If you don’t expect AGIs to become overwhelmingly powerful faster than they are made robustly aligned, containing the blast takes care of itself right until it becomes unnecessary. But with the opposite expectation, containing becomes both necessary (since early AGIs are not yet robustly aligned) and infeasible (since early AGIs are very powerful). So there’s a question of which expectation is correct, but the consequences of either position seem to straightforwardly follow.
Yes. Although you have 2 problems:
Why do you think a 20 year pause, or any pause, will change anything.
Like for example you may know that cybersecurity on game consoles and iPhones keeps getting cracked.
AI control is similar in many ways to cybersecurity in that you are trying to limit the AIs access to functions that let it do bad things, and prevent the AI from seeing information that will allow it to fail. (Betrayal is control failure, the model cannot betray in coordinated way if it doesn’t somehow receive a message from other models that now is the time)
Are you going to secure your iPhones and consoles by researching cybersecurity for 20 years and then deploying the next generation? Or do you do the best you can with information from the last failures and try again? Each time you try to limit the damage, for example with game consoles there are various strategies that have grown more sophisticated to encourage users to purchase access to games rather than pirate them.
With AI you probably won’t learn during a pause anything that will help you. We know this from experience because on paper, securing the products I mentioned is trivially easy. Sign everything with a key server, don’t run unsigned code, check the key is valid on really well verified and privileged code.
Note that no one who breaks game consoles or iphones does it by cracking the encryption directly, pub/private key crypto is still unbroken.
Similarly I would expect very early on human engineers will develop an impervious method of AI control. You can write one yourself it’s not difficult. But like everything it will fail on implementation....
You know the cost of a 20 year pause. Just add up the body bags, or 20 years of deaths worldwide to aging. More than a billion people.
You don’t necessarily have a good case that the benefit of the pause will save the lives of all humans because even guessing the problem will benefit during a pause is speculation. It’s not speculation to say the cost is more than a billion lives, because younger human bodies keep them alive with high probability, eventually AI discover a fix. Even if it takes a 100 years, it will be in +100 years not +120 years.
Historically the lesson is all of the parties “agreeing” to the pause are going to betray, and whoever betrays more gets an advantage. US navy lost an aircraft carrier in WW2 that a larger carrier would have survived because they honored the tonnage limits.
Hypotheticals disentangle models from values. A pause is not a policy, not an attempt at a pause that might fail, it’s the actual pause, the hypothetical. We can then look at the various hypotheticals and ask what happens there, which one is better. Hopefully our values can handle the strain of out-of-distribution evaluation and don’t collapse into incoherence of goodharting, unable to say anything relevant about situations that our models consider impossible in actual reality.
In the hypothetical of a 100-year pause, the pause actually happens, even if this is impossible in actual reality. One of the things within that hypothetical is death of 4 generations of humans. Another is the AGI that gets built at the end. In your model, that AGI is no safer than the one that we build without the magic hypothetical of the pause. In my model, that AGI is significantly safer. A safer AGI translates into more value of the whole future, which is much longer than the current age of the universe. And an unsafe AGI now is less than helpful to those 4 generations.
That’s the point of AI alignment as distinct from AI control. Your model says the distinction doesn’t work. My model says it does. Therefore my model endorses the hypothetical of a pause.
Having endorsed a hypothetical, I can start paying attention to ways of moving reality in its direction. But that is distinct from a judgement about what the hypothetical entails.
How would you know any method of alignment works without AGI of increasing capabilities/child AGI that are supposed to inherit aligned property to test this?
One of the reasons I gave current cybersecurity as an example is that pub/private key signing is correct. Nobody has broken the longer keys. Yet if you spent 20 years or 100 years proving it correct then deployed to software using present techniques you would get hacked immediately. Implementation is hard and is the majority of the difficulty.
Assuming ai alignment can be paper solved like this way I see it as the same situation. It will fail in ways you won’t know until you try it for real.
Consider an indefinite moratorium on AGI that awaits better tools that make building it a good idea rather than a bad idea. If there was a magic button that rewrote laws of nature to make this happen, would it be a good idea to press it? My point is that we both endorse pressing this button, the only difference is that your model says that building an AGI immediately is a good idea, and so the moratorium should end immediately. My model disagrees. This particular disagreement is not about the generations of people who forgo access to potential technology (where there is no disagreement), and it’s not about feasibility of the magic button (which is a separate disagreement). It’s about how this technology works, what works in influencing its design and deployment, and the effect it has on the world once deployed.
The crux of that disagreement seems to be about importance of preparation in advance of doing a thing, compared to the process of actually doing the thing in the real world. A pause enables extensive preparation to building an AGI, and high serial speed of thought of AGIs enables AGIs extensive preparation to acting on the world. If such preparation doesn’t give decisive advantage, a pause doesn’t help, and AGIs don’t rewrite reality in a year once deployed. If it does give a decisive advantage, a pause helps significantly, and a fast-thinking AGI shortly gains the affordance of overwriting humanity with whatever it plans to enact.
I see preparation as raising generations of giants to stand on the shoulders of, which in time changes the character of the practical projects that would be attempted, and the details we pay attention to as we carry out such projects. Yes, cryptography isn’t sufficient to make systems secure, but absence of cryptography certainly makes them less secure, as is attempting to design cryptographic algorithms without taking the time to get good at it. This is the kind of preparation that makes a difference. Noticing that superintelligence doesn’t imply supermorality and that alignment is a concern at all is an important development. Appreciating goodharting and corrigibility changes the safety properties of AIs that appear important, when looking into more practical designs that don’t necessarily originate from these considerations. Deceptive alignment is a useful concern to keep in mind, even if in the end it turns out that practical systems don’t have that problem. Experiments on GPT-2 sized systems still have a whole lot to teach us about interpretable and steerable architectures.
Without AGI interrupting this process, the kinds of things that people would attempt in order to build an AGI would be very different 20 years from now, and different yet again in 40, 60, 80, and 100 years. I expect some accumulated wisdom to steer such projects in better and better directions, even if the resulting implementation details remain sufficiently messy and make the resulting systems moderately unsafe, with some asymptote of safety where the aging generations make it worthwhile to forgo additional preparation.
Do any examples of preparation over an extended length of time exist in human history?
I would suspect they do not for the simple reason that preparation in advance of a need you don’t have, has no roi.
Basic science and pure mathematics enable their own subsequent iterations without having them as explicit targets or even without being able to imagine these developments, while doing the work crucial in making them possible.
Extensive preparation never happened with a thing that is ready to be attempted experimentally, because in those cases we just do the experiments, there is no reason not to. With AGI, the reason not to do this is the unbounded blast radius of a failure, an unprecedented problem. Unprecedented things are less plausible, but unfortunately this can’t be expected to have happened before, because then you are no longer here to update on the observation.
If the blast radius is not unbounded, if most failures can be contained, then it’s more reasonable to attempt to develop AGI in the usual way, without extensive preparation that doesn’t involve actually attempting to build it. If preparation in general doesn’t help, it doesn’t help AGIs either, making them less dangerous and reducing the scope of failure, and so preparation for building them is not as needed. If preparation does help, it also helps AGIs, and so preparation is needed.
Is it true or not true that there is no evidence for an “unbounded” blast radius for any AI model someone has trained. I am not aware of any evidence.
What would constitute evidence that the situation was now in the “unbounded” failure case? How would you prove it?
So we don’t end up in a loop, assume someone has demonstrated a major danger with current AI models. Assume there is a really obvious method of control that will contain the problem. Now what? It seems to me like the next step would be to restrict AI development in a similar way to how cobalt-60 sources are restricted, where only institutions with licenses, inspections, and methods of control can handle the stuff, but that’s still not a pause...
When could you ever reach a situation where a stronger control mechanism won’t work?
Like I try to imagine it, and I can imagine more and more layers of defense—“don’t read anything the model wrote”, “more firewalls, more isolation, servers in a salt mine”—but never a point where you couldn’t agree it was under control. Like if you make a radioactive source more radioactive you just add more inches of shielding until the dose is acceptable.
The blast radius of AGIs is unbounded in the same way as that of humanity, there is potential for taking over all of the future. There are many ways of containing it, and alignment is a way of making the blast a good thing. The point is that a sufficiently catastrophic failure that doesn’t involve containing the blast is unusually impactful. Arguments about ease of containing the blast are separate from this point in the way I intended it.
If you don’t expect AGIs to become overwhelmingly powerful faster than they are made robustly aligned, containing the blast takes care of itself right until it becomes unnecessary. But with the opposite expectation, containing becomes both necessary (since early AGIs are not yet robustly aligned) and infeasible (since early AGIs are very powerful). So there’s a question of which expectation is correct, but the consequences of either position seem to straightforwardly follow.