I think the key story for wins from AI control specifically is a scenario where we have human-level to slightly superhuman AI inside a lab that isn’t aligned and wants to take over, but it turns out that it’s easier to control what AI affordances are given than it is to align an AI, and in particular it’s easier to catch an AI scheming than it is to make it aligned, and the lab wants to use AIs for alignment/control research.
I don’t see this as a probable scenario, but I do see it as a valuable scenario to work on, so it does have value in my eyes.
Imagine that there are just a few labs with powerful A.I., all of which are responsible enough to use existing A.I. control strategies which have been prepared for this situation, and none of which open source their models. Now if they successfully use their A.I. for alignment, they will also be able to successfully use it for capabilities research. At some point, control techniques will no longer be sufficient, and we have to hope that by then A.I. aided alignment has succeeded enough to prevent bad outcomes. I don’t believe this is a serious possibility; the first A.I. capable of solving the alignment problem completely will also be able to deceive us about solving the alignment problem (more) easily—up to and including this point, A.I. will produce partial, convincing solutions to the alignment problem which human engineers will go forward with. Control techniques will simply threshold (below) the capabilities of the first unaligned A.I. that escapes, which is plausibly a net negative since it means we won’t have early high impact warnings. If occasional A.I. escapes turn out to be non-lethal, economic incentives will favor better A.I. control, so working on this early won’t really matter. If occasional A.I. escapes turn out to be lethal, then we will die unless we solve the alignment problem ourselves.
One assumption that could be used to defuse the danger is if we can apply the controlled AIs to massively improve computer security such that computer security wins the race over attacking the computers.
This is at least semi-plausible for the first AIs, who will almost certainly be wildly superhuman at both coding and mathematically proving theorems in for example Lean, because there’s a very obvious path to how to get the training data to bootstrap yourself, and like go, it is relatively easy to verify that your solution actually works.
Another assumption that could work for AI control is that once the AIs are controlled enough, labs start using the human-level AIs to further enhance control/alignment strategies, and thus from the base case we can inductively show that the next level up of AIs are even better controlled until you reach a limit, and that the capabilities that are done are there to make the AI safer, not more dangerous.
Improving computer security seems possible but there are many other attack vectors. For instance, even if an A.I. can prove a system’s software is secure, it may choose to introduce social engineering style back doors if it is not aligned. It’s true that controlled A.I.’s can be used to harden society but overall I don’t find that strategy comforting.
I’m not convinced that this induction argument goes through. I think it fails on the first generation that is smarter than humans, for basically Yudkowskian reasons.
I think the key story for wins from AI control specifically is a scenario where we have human-level to slightly superhuman AI inside a lab that isn’t aligned and wants to take over, but it turns out that it’s easier to control what AI affordances are given than it is to align an AI, and in particular it’s easier to catch an AI scheming than it is to make it aligned, and the lab wants to use AIs for alignment/control research.
I don’t see this as a probable scenario, but I do see it as a valuable scenario to work on, so it does have value in my eyes.
Imagine that there are just a few labs with powerful A.I., all of which are responsible enough to use existing A.I. control strategies which have been prepared for this situation, and none of which open source their models. Now if they successfully use their A.I. for alignment, they will also be able to successfully use it for capabilities research. At some point, control techniques will no longer be sufficient, and we have to hope that by then A.I. aided alignment has succeeded enough to prevent bad outcomes. I don’t believe this is a serious possibility; the first A.I. capable of solving the alignment problem completely will also be able to deceive us about solving the alignment problem (more) easily—up to and including this point, A.I. will produce partial, convincing solutions to the alignment problem which human engineers will go forward with. Control techniques will simply threshold (below) the capabilities of the first unaligned A.I. that escapes, which is plausibly a net negative since it means we won’t have early high impact warnings. If occasional A.I. escapes turn out to be non-lethal, economic incentives will favor better A.I. control, so working on this early won’t really matter. If occasional A.I. escapes turn out to be lethal, then we will die unless we solve the alignment problem ourselves.
One assumption that could be used to defuse the danger is if we can apply the controlled AIs to massively improve computer security such that computer security wins the race over attacking the computers.
This is at least semi-plausible for the first AIs, who will almost certainly be wildly superhuman at both coding and mathematically proving theorems in for example Lean, because there’s a very obvious path to how to get the training data to bootstrap yourself, and like go, it is relatively easy to verify that your solution actually works.
https://www.lesswrong.com/posts/2wxufQWK8rXcDGbyL/access-to-powerful-ai-might-make-computer-security-radically
Another assumption that could work for AI control is that once the AIs are controlled enough, labs start using the human-level AIs to further enhance control/alignment strategies, and thus from the base case we can inductively show that the next level up of AIs are even better controlled until you reach a limit, and that the capabilities that are done are there to make the AI safer, not more dangerous.
Improving computer security seems possible but there are many other attack vectors. For instance, even if an A.I. can prove a system’s software is secure, it may choose to introduce social engineering style back doors if it is not aligned. It’s true that controlled A.I.’s can be used to harden society but overall I don’t find that strategy comforting.
I’m not convinced that this induction argument goes through. I think it fails on the first generation that is smarter than humans, for basically Yudkowskian reasons.