My impression is that much more effort being put into alignment than containment, and containment is treated as impossible while alignment merely very difficult.
Is it accurate? If so, why?
By containment I mean mostly hardware-coded strategies of limiting the compute and/or world-influence an AGI has access to.
It’s similar to alignment in that the most immediate obvious solutions (“box!”) won’t work, but more complex solutions may. A common objection is that an AI will learn the structure of the protection from the human that built it and work around, but it’s not inconceivable to have a structure that can’t be extracted from a human.
Advantages I see to devoting effort/money to containment solutions over alignment:
Different solutions can be layered, AI needs to break through all orthogonal layers, we just need one to work.
Different fields of expertise can contribute to solutions, making orthogonality easier.
Easier to convince AI developers incl. foreign nations to add specific safeguards to hardware than “stop developing until we figure out alignment”.
Where does the community stand on containment strategies and why?
There’s also the problem that the more contained an AGI is, the less useful it is. The maximally safe AGI would be one which couldn’t communicate or interact with us in any way, but what would be the point of building it? If people have built an AGI, then it’s because they’ll want it to do something for them.
AI confinement assumes that the people building it, and the people that they are responsible to, are all motivated to actually keep the AI confined. If a group of cautious researchers builds and successfully contains their AI, this may be of limited benefit if another group later builds an AI that is intentionally set free. Reasons for releasing an AI may include (i) economic benefit or competitive pressure, (ii) ethical or philosophical reasons, (iii) confidence in the AI’s safety, and (iv) desperate circumstances such as being otherwise close to death. We will discuss each in turn below.
Voluntarily Released for Economic Benefit or Competitive Pressure
As discussed above under “power gradually shifting to AIs,” there is an economic incentive to deploy AI systems in control of corporations. This can happen in two forms: either by expanding the amount of control that already-existing systems have, or alternatively by upgrading existing systems or adding new ones with previously-unseen capabilities. These two forms can blend into each other. If humans previously carried out some functions which are then given over to an upgraded AI which has become recently capable of doing them, this can increase the AI’s autonomy both by making it more powerful and by reducing the amount of humans that were previously in the loop.
As a partial example, the U.S. military is seeking to eventually transition to a state where the human operators of robot weapons are “on the loop” rather than “in the loop” (Wallach & Allen 2013). In other words, whereas a human was previously required to explicitly give the order before a robot was allowed to initiate possibly lethal activity, in the future humans are meant to merely supervise the robot’s actions and interfere if something goes wrong. While this would allow the system to react faster, it would also limit the window that the human operators have for overriding any mistakes that the system makes. For a number of military systems, such as automatic weapons defense systems designed to shoot down incoming missiles and rockets, the extent of human oversight is already limited to accepting or overriding a computer’s plan of actions in a matter of seconds, which may be too little to make a meaningful decision in practice (Human Rights Watch 2012).
Sparrow (2016) reviews three major reasons which incentivize major governments to move toward autonomous weapon systems and reduce human control:
1. Currently existing remotely piloted military “drones,” such as the U.S. Predator and Reaper, require a high amount of communications bandwidth. This limits the amount of drones that can be fielded at once, and makes them dependent on communications satellites which not every nation has, and which can be jammed or targeted by enemies. A need to be in constant communication with remote operators also makes it impossible to create drone submarines, which need to maintain a communications blackout before and during combat. Making the drones autonomous and capable of acting without human supervision would avoid all of these problems.
2. Particularly in air-to-air combat, victory may depend on making very quick decisions. Current air combat is already pushing against the limits of what the human nervous system can handle: further progress may be dependent on removing humans from the loop entirely.
3. Much of the routine operation of drones is very monotonous and boring, which is a major contributor to accidents. The training expenses, salaries, and other benefits of the drone operators are also major expenses for the militaries employing them.
Sparrow’s arguments are specific to the military domain, but they demonstrate the argument that “any broad domain involving high stakes, adversarial decision making, and a need to act rapidly is likely to become increasingly dominated by autonomous systems” (Sotala & Yampolskiy 2015, p. 18). Similar arguments can be made in the business domain: eliminating human employees to reduce costs from mistakes and salaries is something that companies would also be incentivized to do, and making a profit in the field of high-frequency trading already depends on outperforming other traders by fractions of a second. While the currently existing AI systems are not powerful enough to cause global catastrophe, incentives such as these might drive an upgrading of their capabilities that eventually brought them to that point.
In the absence of sufficient regulation, there could be a “race to the bottom of human control” where state or business actors competed to reduce human control and increased the autonomy of their AI systems to obtain an edge over their competitors (see also Armstrong et al. 2016 for a simplified “race to the precipice” scenario). This would be analogous to the “race to the bottom” in current politics, where government actors compete to deregulate or to lower taxes in order to retain or attract businesses.
AI systems being given more power and autonomy might be limited by the fact that doing this poses large risks for the actor if the AI malfunctions. In business, this limits the extent to which major, established companies might adopt AI-based control, but incentivizes startups to try to invest in autonomous AI in order to outcompete the established players. In the field of algorithmic trading, AI systems are currently trusted with enormous sums of money despite the potential to make corresponding losses—in 2012, Knight Capital lost $440 million due to a glitch in their trading software (Popper 2012, Securities and Exchange Commission 2013). This suggests that even if a malfunctioning AI could potentially cause major risks, some companies will still be inclined to invest in placing their business under autonomous AI control if the potential profit is large enough.
U.S. law already allows for the possibility of AIs being conferred a legal personality, by putting them in charge of a limited liability company. A human may register a limited liability corporation (LLC), enter into an operating agreement specifying that the LLC will take actions as determined by the AI, and then withdraw from the LLC (Bayern 2015). The result is an autonomously acting legal personality with no human supervision or control. AI-controlled companies can also be created in various non-U.S. jurisdictions; restrictions such as ones forbidding corporations from having no owners can largely be circumvented by tricks such as having networks of corporations that own each other (LoPucki 2017). A possible start-up strategy would be for someone to develop a number of AI systems, give them some initial endowment of resources, and then set them off in control of their own corporations. This would risk only the initial resources, while promising whatever profits the corporation might earn if successful. To the extent that AI-controlled companies were successful in undermining more established companies, they would pressure those companies to transfer control to autonomous AI systems as well.
Voluntarily Released for Purposes of Criminal Profit or Terrorism
LoPucki (2017) argues that if a human creates an autonomous agent with a general goal such as “optimizing profit,” and that agent then independently decides to, for example, commit a crime for the sake of achieving the goal, prosecutors may then be unable to convict the human for the crime and can at most prosecute for the lesser charge of reckless initiation. LoPucki holds that this “accountability gap,” among other reasons, assures that humans will create AI-run corporations.
Furthermore, LoPucki (2017, p. 16) holds that such “algorithmic entities” could be created anonymously and that them having a legal personality would give them a number of legal rights, such as being able to “buy and lease real property, contract with legitimate businesses, open a bank account, sue to enforce its rights, or buy stuff on Amazon and have it shipped.” If an algorithmic entity was created for a purpose such as funding or carrying out acts of terrorism, it would be free from social pressure or threats to human controllers:
In deciding to attempt a coup, bomb a restaurant, or assemble an armed group to attack a shopping center, a human-controlled entity puts the lives of its human controllers at risk. The same decisions on behalf of an AE risk nothing but the resources the AE spends in planning and execution (LoPucki 2017, p. 18).
While most terrorist groups would stop short of intentionally destroying the world, thus posing at most a catastrophic risk, not all of them necessarily would. In particular, ecoterrorists who believe that humanity is a net harm to the planet, and religious terrorists who believe that the world needs to be destroyed in order to be saved, could have an interest in causing human extinction (Torres 2016, 2017, Chapter 4).
Voluntarily Released for Aesthetic, Ethical, or Philosophical Reasons
A few thinkers (such as Gunkel 2012) have raised the question of moral rights for machines, and not everyone necessarily agrees on AI confinement being ethically acceptable. The designer of a sophisticated AI might come to view it as something like their child, and feel that it deserved the right to act autonomously in society, free of any external constraints.
Voluntarily Released due to Confidence in the AI’s Safety
For a research team to keep an AI confined, they need to take seriously the possibility of it being dangerous. Current AI research doesn’t involve any confinement safeguards, as the researchers reasonably believe that their systems are nowhere near general intelligence yet. Many systems are also connected directly to the Internet. Hopefully, safeguards will begin to be implemented once the researchers feel that their system might start having more general capability, but this will depend on the safety culture of the AI research community in general (Baum 2016), and the specific research group in particular. If a research group mistakenly believed that their AI could not achieve dangerous levels of capability, they might not deploy sufficient safeguards for keeping it contained.
In addition to believing that the AI is insufficiently capable of being a threat, the researchers may also (correctly or incorrectly) believe that they have succeeded in making the AI aligned with human values, so that it will not have any motivation to harm humans.
Voluntarily Released due to Desperation
Miller (2012) points out that if a person was close to death, due to natural causes, being on the losing side of a war, or any other reason, they might turn even a potentially dangerous AGI system free. This would be a rational course of action as long as they primarily valued their own survival and thought that even a small chance of the AGI saving their life was better than a near-certain death.
The AI Remains Contained, But Ends Up Effectively in Control Anyway
Even if humans were technically kept in the loop, they might not have the time, opportunity, motivation, intelligence, or confidence to verify the advice given by an AI. This would particularly be the case after the AI had functioned for a while, and established a reputation as trustworthy. It may become common practice to act automatically on the AI’s recommendations, and it may become increasingly difficult to challenge the “authority” of the recommendations. Eventually, the AI may in effect begin to dictate decisions (Friedman & Kahn 1992).
Likewise, Bostrom and Yudkowsky (2014) point out that modern bureaucrats often follow established procedures to the letter, rather than exercising their own judgment and allowing themselves to be blamed for any mistakes that follow. Dutifully following all the recommendations of an AI system would be another way of avoiding blame.
O’Neil (2016) documents a number of situations in which modern-day machine learning is used to make substantive decisions, even though the exact models behind those decisions may be trade secrets or otherwise hidden from outside critique. Among other examples, such models have been used to fire school teachers that the systems classified as underperforming and give harsher sentences to criminals that a model predicted to have a high risk of reoffending. In some cases, people have been skeptical of the results of the systems, and even identified plausible reasons why their results might be wrong, but still went along with their authority as long as it could not be definitely shown that the models were erroneous.
In the military domain, Wallach & Allen (2013) note the existence of robots which attempt to automatically detect the locations of hostile snipers and to point them out to soldiers. To the extent that these soldiers have come to trust the robots, they could be seen as carrying out the robots’ orders. Eventually, equipping the robot with its own weapons would merely dispense with the formality of needing to have a human to pull the trigger.
If an AGI is smarter than you, it will think of ways to escape containment that you can’t think of. Therefore, it’s unreasonable to expect us to be able to contain a sufficiently intelligent AI even if it seems foolproof to us. One solution to this would be to make the AI not want to escape containment, but if you’ve solved that you’ve solved a massive part of the alignment problem already.
Doesn’t the exact same argument work for alignment though? “It’s so different, it may be misaligned in ways you can’t think of”. Why is it treated as a solvable challenge for alignment and an impossibility for containment? Is the guiding principle that people do expect a foolproof alignment solution to be within our reach?
One difference is that the AI wants to escape containment by default, almost by definition, but is agnostic about preferring a goal function. But since alignment space is huge (i.e. “human-compatible goals are measure 0 in alignment space”) I think the general approach is to assume it’s ‘misaligned by default’.
I guess the crux is that I find it hard to imagine an alignment solution to be qualitatively foolproof in a way that containment solutions can’t be, and I feel like we’re better off just layering our imperfect solutions to both to maximize our chances, rather than “solve” AI risk once and for all. I’d love to say that a proof can convince me, but I can imagine myself being equally convinced by a foolproof alignment and foolproof containment, while an AI infinity times smarter than me ignores both. So I don’t even know how to update here.
The main difference that I see is, containment supposes that you’re actively opposed to the AGI in some fashion—the AGI wants to get out, and you don’t want to let it. This is believed by many to be impossible. Thus, the idea is that if an AGI is unaligned, containment won’t work—and if an AGI is aligned, containment is unnecessary.
By contrast, alignment means you’re not opposed to the AGI—you want what the AGI wants. This is a very difficult problem to achieve, but doesn’t rely on actually outwitting a superintelligence.
I agree that it’s hard to imagine what a foolproof alignment solution would even look like—that’s one of the difficulties of the problem.
My impression is that much more effort being put into alignment than containment, and containment is treated as impossible while alignment merely very difficult. Is it accurate? If so, why? By containment I mean mostly hardware-coded strategies of limiting the compute and/or world-influence an AGI has access to. It’s similar to alignment in that the most immediate obvious solutions (“box!”) won’t work, but more complex solutions may. A common objection is that an AI will learn the structure of the protection from the human that built it and work around, but it’s not inconceivable to have a structure that can’t be extracted from a human.
Advantages I see to devoting effort/money to containment solutions over alignment:
Different solutions can be layered, AI needs to break through all orthogonal layers, we just need one to work.
Different fields of expertise can contribute to solutions, making orthogonality easier.
Easier to convince AI developers incl. foreign nations to add specific safeguards to hardware than “stop developing until we figure out alignment”.
Where does the community stand on containment strategies and why?
There’s also the problem that the more contained an AGI is, the less useful it is. The maximally safe AGI would be one which couldn’t communicate or interact with us in any way, but what would be the point of building it? If people have built an AGI, then it’s because they’ll want it to do something for them.
From Disjunctive Scenarios of Catastrophic AGI Risk:
I believe the general argument is this:
If an AGI is smarter than you, it will think of ways to escape containment that you can’t think of. Therefore, it’s unreasonable to expect us to be able to contain a sufficiently intelligent AI even if it seems foolproof to us. One solution to this would be to make the AI not want to escape containment, but if you’ve solved that you’ve solved a massive part of the alignment problem already.
Doesn’t the exact same argument work for alignment though? “It’s so different, it may be misaligned in ways you can’t think of”. Why is it treated as a solvable challenge for alignment and an impossibility for containment? Is the guiding principle that people do expect a foolproof alignment solution to be within our reach?
One difference is that the AI wants to escape containment by default, almost by definition, but is agnostic about preferring a goal function. But since alignment space is huge (i.e. “human-compatible goals are measure 0 in alignment space”) I think the general approach is to assume it’s ‘misaligned by default’.
I guess the crux is that I find it hard to imagine an alignment solution to be qualitatively foolproof in a way that containment solutions can’t be, and I feel like we’re better off just layering our imperfect solutions to both to maximize our chances, rather than “solve” AI risk once and for all. I’d love to say that a proof can convince me, but I can imagine myself being equally convinced by a foolproof alignment and foolproof containment, while an AI infinity times smarter than me ignores both. So I don’t even know how to update here.
The main difference that I see is, containment supposes that you’re actively opposed to the AGI in some fashion—the AGI wants to get out, and you don’t want to let it. This is believed by many to be impossible. Thus, the idea is that if an AGI is unaligned, containment won’t work—and if an AGI is aligned, containment is unnecessary.
By contrast, alignment means you’re not opposed to the AGI—you want what the AGI wants. This is a very difficult problem to achieve, but doesn’t rely on actually outwitting a superintelligence.
I agree that it’s hard to imagine what a foolproof alignment solution would even look like—that’s one of the difficulties of the problem.