I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters—only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
I do see advantages to hardening important institutions against cyberattacks and increasing individual and group rationality so that humans remain agentic for as long as possible.
I think the key story for wins from AI control specifically is a scenario where we have human-level to slightly superhuman AI inside a lab that isn’t aligned and wants to take over, but it turns out that it’s easier to control what AI affordances are given than it is to align an AI, and in particular it’s easier to catch an AI scheming than it is to make it aligned, and the lab wants to use AIs for alignment/control research.
I don’t see this as a probable scenario, but I do see it as a valuable scenario to work on, so it does have value in my eyes.
Imagine that there are just a few labs with powerful A.I., all of which are responsible enough to use existing A.I. control strategies which have been prepared for this situation, and none of which open source their models. Now if they successfully use their A.I. for alignment, they will also be able to successfully use it for capabilities research. At some point, control techniques will no longer be sufficient, and we have to hope that by then A.I. aided alignment has succeeded enough to prevent bad outcomes. I don’t believe this is a serious possibility; the first A.I. capable of solving the alignment problem completely will also be able to deceive us about solving the alignment problem (more) easily—up to and including this point, A.I. will produce partial, convincing solutions to the alignment problem which human engineers will go forward with. Control techniques will simply threshold (below) the capabilities of the first unaligned A.I. that escapes, which is plausibly a net negative since it means we won’t have early high impact warnings. If occasional A.I. escapes turn out to be non-lethal, economic incentives will favor better A.I. control, so working on this early won’t really matter. If occasional A.I. escapes turn out to be lethal, then we will die unless we solve the alignment problem ourselves.
One assumption that could be used to defuse the danger is if we can apply the controlled AIs to massively improve computer security such that computer security wins the race over attacking the computers.
This is at least semi-plausible for the first AIs, who will almost certainly be wildly superhuman at both coding and mathematically proving theorems in for example Lean, because there’s a very obvious path to how to get the training data to bootstrap yourself, and like go, it is relatively easy to verify that your solution actually works.
Another assumption that could work for AI control is that once the AIs are controlled enough, labs start using the human-level AIs to further enhance control/alignment strategies, and thus from the base case we can inductively show that the next level up of AIs are even better controlled until you reach a limit, and that the capabilities that are done are there to make the AI safer, not more dangerous.
Improving computer security seems possible but there are many other attack vectors. For instance, even if an A.I. can prove a system’s software is secure, it may choose to introduce social engineering style back doors if it is not aligned. It’s true that controlled A.I.’s can be used to harden society but overall I don’t find that strategy comforting.
I’m not convinced that this induction argument goes through. I think it fails on the first generation that is smarter than humans, for basically Yudkowskian reasons.
Hard to be sure without more detail, but your comment gives me the impression that you haven’t thought through the various different branches of how AI and geopolitics might go in the next 10 years.
I, for one, am pretty sure AI control and powerful narrow AI tools will both be pretty key for humanity surviving the next 10 years. I don’t expect us to have robustly solved ASI-aligment in that timeframe.
I also don’t expect us to have robustly solved ASI-alignment in that timeframe. I simply fail to see a history in which AI control work now is a decisive factor. If you insist on making a top level claim that I haven’t thought through the branches of how things go, I’d appreciate a more substantive description of the branch I am not considering.
I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters—only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
I do see advantages to hardening important institutions against cyberattacks and increasing individual and group rationality so that humans remain agentic for as long as possible.
I think the key story for wins from AI control specifically is a scenario where we have human-level to slightly superhuman AI inside a lab that isn’t aligned and wants to take over, but it turns out that it’s easier to control what AI affordances are given than it is to align an AI, and in particular it’s easier to catch an AI scheming than it is to make it aligned, and the lab wants to use AIs for alignment/control research.
I don’t see this as a probable scenario, but I do see it as a valuable scenario to work on, so it does have value in my eyes.
Imagine that there are just a few labs with powerful A.I., all of which are responsible enough to use existing A.I. control strategies which have been prepared for this situation, and none of which open source their models. Now if they successfully use their A.I. for alignment, they will also be able to successfully use it for capabilities research. At some point, control techniques will no longer be sufficient, and we have to hope that by then A.I. aided alignment has succeeded enough to prevent bad outcomes. I don’t believe this is a serious possibility; the first A.I. capable of solving the alignment problem completely will also be able to deceive us about solving the alignment problem (more) easily—up to and including this point, A.I. will produce partial, convincing solutions to the alignment problem which human engineers will go forward with. Control techniques will simply threshold (below) the capabilities of the first unaligned A.I. that escapes, which is plausibly a net negative since it means we won’t have early high impact warnings. If occasional A.I. escapes turn out to be non-lethal, economic incentives will favor better A.I. control, so working on this early won’t really matter. If occasional A.I. escapes turn out to be lethal, then we will die unless we solve the alignment problem ourselves.
One assumption that could be used to defuse the danger is if we can apply the controlled AIs to massively improve computer security such that computer security wins the race over attacking the computers.
This is at least semi-plausible for the first AIs, who will almost certainly be wildly superhuman at both coding and mathematically proving theorems in for example Lean, because there’s a very obvious path to how to get the training data to bootstrap yourself, and like go, it is relatively easy to verify that your solution actually works.
https://www.lesswrong.com/posts/2wxufQWK8rXcDGbyL/access-to-powerful-ai-might-make-computer-security-radically
Another assumption that could work for AI control is that once the AIs are controlled enough, labs start using the human-level AIs to further enhance control/alignment strategies, and thus from the base case we can inductively show that the next level up of AIs are even better controlled until you reach a limit, and that the capabilities that are done are there to make the AI safer, not more dangerous.
Improving computer security seems possible but there are many other attack vectors. For instance, even if an A.I. can prove a system’s software is secure, it may choose to introduce social engineering style back doors if it is not aligned. It’s true that controlled A.I.’s can be used to harden society but overall I don’t find that strategy comforting.
I’m not convinced that this induction argument goes through. I think it fails on the first generation that is smarter than humans, for basically Yudkowskian reasons.
Hard to be sure without more detail, but your comment gives me the impression that you haven’t thought through the various different branches of how AI and geopolitics might go in the next 10 years.
I, for one, am pretty sure AI control and powerful narrow AI tools will both be pretty key for humanity surviving the next 10 years. I don’t expect us to have robustly solved ASI-aligment in that timeframe.
I also don’t expect us to have robustly solved ASI-alignment in that timeframe. I simply fail to see a history in which AI control work now is a decisive factor. If you insist on making a top level claim that I haven’t thought through the branches of how things go, I’d appreciate a more substantive description of the branch I am not considering.