How would you know any method of alignment works without AGI of increasing capabilities/child AGI that are supposed to inherit aligned property to test this?
One of the reasons I gave current cybersecurity as an example is that pub/private key signing is correct. Nobody has broken the longer keys. Yet if you spent 20 years or 100 years proving it correct then deployed to software using present techniques you would get hacked immediately. Implementation is hard and is the majority of the difficulty.
Assuming ai alignment can be paper solved like this way I see it as the same situation. It will fail in ways you won’t know until you try it for real.
Consider an indefinite moratorium on AGI that awaits better tools that make building it a good idea rather than a bad idea. If there was a magic button that rewrote laws of nature to make this happen, would it be a good idea to press it? My point is that we both endorse pressing this button, the only difference is that your model says that building an AGI immediately is a good idea, and so the moratorium should end immediately. My model disagrees. This particular disagreement is not about the generations of people who forgo access to potential technology (where there is no disagreement), and it’s not about feasibility of the magic button (which is a separate disagreement). It’s about how this technology works, what works in influencing its design and deployment, and the effect it has on the world once deployed.
The crux of that disagreement seems to be about importance of preparation in advance of doing a thing, compared to the process of actually doing the thing in the real world. A pause enables extensive preparation to building an AGI, and high serial speed of thought of AGIs enables AGIs extensive preparation to acting on the world. If such preparation doesn’t give decisive advantage, a pause doesn’t help, and AGIs don’t rewrite reality in a year once deployed. If it does give a decisive advantage, a pause helps significantly, and a fast-thinking AGI shortly gains the affordance of overwriting humanity with whatever it plans to enact.
I see preparation as raising generations of giants to stand on the shoulders of, which in time changes the character of the practical projects that would be attempted, and the details we pay attention to as we carry out such projects. Yes, cryptography isn’t sufficient to make systems secure, but absence of cryptography certainly makes them less secure, as is attempting to design cryptographic algorithms without taking the time to get good at it. This is the kind of preparation that makes a difference. Noticing that superintelligence doesn’t imply supermorality and that alignment is a concern at all is an important development. Appreciating goodharting and corrigibility changes the safety properties of AIs that appear important, when looking into more practical designs that don’t necessarily originate from these considerations. Deceptive alignment is a useful concern to keep in mind, even if in the end it turns out that practical systems don’t have that problem. Experiments on GPT-2 sized systems still have a whole lot to teach us about interpretable and steerable architectures.
Without AGI interrupting this process, the kinds of things that people would attempt in order to build an AGI would be very different 20 years from now, and different yet again in 40, 60, 80, and 100 years. I expect some accumulated wisdom to steer such projects in better and better directions, even if the resulting implementation details remain sufficiently messy and make the resulting systems moderately unsafe, with some asymptote of safety where the aging generations make it worthwhile to forgo additional preparation.
Basic science and pure mathematics enable their own subsequent iterations without having them as explicit targets or even without being able to imagine these developments, while doing the work crucial in making them possible.
Extensive preparation never happened with a thing that is ready to be attempted experimentally, because in those cases we just do the experiments, there is no reason not to. With AGI, the reason not to do this is the unbounded blast radius of a failure, an unprecedented problem. Unprecedented things are less plausible, but unfortunately this can’t be expected to have happened before, because then you are no longer here to update on the observation.
If the blast radius is not unbounded, if most failures can be contained, then it’s more reasonable to attempt to develop AGI in the usual way, without extensive preparation that doesn’t involve actually attempting to build it. If preparation in general doesn’t help, it doesn’t help AGIs either, making them less dangerous and reducing the scope of failure, and so preparation for building them is not as needed. If preparation does help, it also helps AGIs, and so preparation is needed.
If the blast radius is not unbounded, if most failures can be contained, then it’s more reasonable to attempt to develop AGI in the usual way, without extensive preparation that doesn’t involve actually attempting to build it
Is it true or not true that there is no evidence for an “unbounded” blast radius for any AI model someone has trained. I am not aware of any evidence.
What would constitute evidence that the situation was now in the “unbounded” failure case? How would you prove it?
So we don’t end up in a loop, assume someone has demonstrated a major danger with current AI models. Assume there is a really obvious method of control that will contain the problem. Now what? It seems to me like the next step would be to restrict AI development in a similar way to how cobalt-60 sources are restricted, where only institutions with licenses, inspections, and methods of control can handle the stuff, but that’s still not a pause...
When could you ever reach a situation where a stronger control mechanism won’t work?
Like I try to imagine it, and I can imagine more and more layers of defense—“don’t read anything the model wrote”, “more firewalls, more isolation, servers in a salt mine”—but never a point where you couldn’t agree it was under control. Like if you make a radioactive source more radioactive you just add more inches of shielding until the dose is acceptable.
The blast radius of AGIs is unbounded in the same way as that of humanity, there is potential for taking over all of the future. There are many ways of containing it, and alignment is a way of making the blast a good thing. The point is that a sufficiently catastrophic failure that doesn’t involve containing the blast is unusually impactful. Arguments about ease of containing the blast are separate from this point in the way I intended it.
If you don’t expect AGIs to become overwhelmingly powerful faster than they are made robustly aligned, containing the blast takes care of itself right until it becomes unnecessary. But with the opposite expectation, containing becomes both necessary (since early AGIs are not yet robustly aligned) and infeasible (since early AGIs are very powerful). So there’s a question of which expectation is correct, but the consequences of either position seem to straightforwardly follow.
How would you know any method of alignment works without AGI of increasing capabilities/child AGI that are supposed to inherit aligned property to test this?
One of the reasons I gave current cybersecurity as an example is that pub/private key signing is correct. Nobody has broken the longer keys. Yet if you spent 20 years or 100 years proving it correct then deployed to software using present techniques you would get hacked immediately. Implementation is hard and is the majority of the difficulty.
Assuming ai alignment can be paper solved like this way I see it as the same situation. It will fail in ways you won’t know until you try it for real.
Consider an indefinite moratorium on AGI that awaits better tools that make building it a good idea rather than a bad idea. If there was a magic button that rewrote laws of nature to make this happen, would it be a good idea to press it? My point is that we both endorse pressing this button, the only difference is that your model says that building an AGI immediately is a good idea, and so the moratorium should end immediately. My model disagrees. This particular disagreement is not about the generations of people who forgo access to potential technology (where there is no disagreement), and it’s not about feasibility of the magic button (which is a separate disagreement). It’s about how this technology works, what works in influencing its design and deployment, and the effect it has on the world once deployed.
The crux of that disagreement seems to be about importance of preparation in advance of doing a thing, compared to the process of actually doing the thing in the real world. A pause enables extensive preparation to building an AGI, and high serial speed of thought of AGIs enables AGIs extensive preparation to acting on the world. If such preparation doesn’t give decisive advantage, a pause doesn’t help, and AGIs don’t rewrite reality in a year once deployed. If it does give a decisive advantage, a pause helps significantly, and a fast-thinking AGI shortly gains the affordance of overwriting humanity with whatever it plans to enact.
I see preparation as raising generations of giants to stand on the shoulders of, which in time changes the character of the practical projects that would be attempted, and the details we pay attention to as we carry out such projects. Yes, cryptography isn’t sufficient to make systems secure, but absence of cryptography certainly makes them less secure, as is attempting to design cryptographic algorithms without taking the time to get good at it. This is the kind of preparation that makes a difference. Noticing that superintelligence doesn’t imply supermorality and that alignment is a concern at all is an important development. Appreciating goodharting and corrigibility changes the safety properties of AIs that appear important, when looking into more practical designs that don’t necessarily originate from these considerations. Deceptive alignment is a useful concern to keep in mind, even if in the end it turns out that practical systems don’t have that problem. Experiments on GPT-2 sized systems still have a whole lot to teach us about interpretable and steerable architectures.
Without AGI interrupting this process, the kinds of things that people would attempt in order to build an AGI would be very different 20 years from now, and different yet again in 40, 60, 80, and 100 years. I expect some accumulated wisdom to steer such projects in better and better directions, even if the resulting implementation details remain sufficiently messy and make the resulting systems moderately unsafe, with some asymptote of safety where the aging generations make it worthwhile to forgo additional preparation.
Do any examples of preparation over an extended length of time exist in human history?
I would suspect they do not for the simple reason that preparation in advance of a need you don’t have, has no roi.
Basic science and pure mathematics enable their own subsequent iterations without having them as explicit targets or even without being able to imagine these developments, while doing the work crucial in making them possible.
Extensive preparation never happened with a thing that is ready to be attempted experimentally, because in those cases we just do the experiments, there is no reason not to. With AGI, the reason not to do this is the unbounded blast radius of a failure, an unprecedented problem. Unprecedented things are less plausible, but unfortunately this can’t be expected to have happened before, because then you are no longer here to update on the observation.
If the blast radius is not unbounded, if most failures can be contained, then it’s more reasonable to attempt to develop AGI in the usual way, without extensive preparation that doesn’t involve actually attempting to build it. If preparation in general doesn’t help, it doesn’t help AGIs either, making them less dangerous and reducing the scope of failure, and so preparation for building them is not as needed. If preparation does help, it also helps AGIs, and so preparation is needed.
Is it true or not true that there is no evidence for an “unbounded” blast radius for any AI model someone has trained. I am not aware of any evidence.
What would constitute evidence that the situation was now in the “unbounded” failure case? How would you prove it?
So we don’t end up in a loop, assume someone has demonstrated a major danger with current AI models. Assume there is a really obvious method of control that will contain the problem. Now what? It seems to me like the next step would be to restrict AI development in a similar way to how cobalt-60 sources are restricted, where only institutions with licenses, inspections, and methods of control can handle the stuff, but that’s still not a pause...
When could you ever reach a situation where a stronger control mechanism won’t work?
Like I try to imagine it, and I can imagine more and more layers of defense—“don’t read anything the model wrote”, “more firewalls, more isolation, servers in a salt mine”—but never a point where you couldn’t agree it was under control. Like if you make a radioactive source more radioactive you just add more inches of shielding until the dose is acceptable.
The blast radius of AGIs is unbounded in the same way as that of humanity, there is potential for taking over all of the future. There are many ways of containing it, and alignment is a way of making the blast a good thing. The point is that a sufficiently catastrophic failure that doesn’t involve containing the blast is unusually impactful. Arguments about ease of containing the blast are separate from this point in the way I intended it.
If you don’t expect AGIs to become overwhelmingly powerful faster than they are made robustly aligned, containing the blast takes care of itself right until it becomes unnecessary. But with the opposite expectation, containing becomes both necessary (since early AGIs are not yet robustly aligned) and infeasible (since early AGIs are very powerful). So there’s a question of which expectation is correct, but the consequences of either position seem to straightforwardly follow.