AGI safety from first principles: Control

It’s important to note that my previous arguments by themselves do not imply that AGIs will end up in control of the world instead of us. As an analogy, scientific knowledge allows us to be much more capable than stone-age humans. Yet if dropped back in that time with just our current knowledge, I very much doubt that one modern human could take over the stone-age world. Rather, this last step of the argument relies on additional predictions about the dynamics of the transition from humans being the smartest agents on Earth to AGIs taking over that role. These will depend on technological, economic and political factors, as I’ll discuss in this section. One recurring theme will be the importance of our expectation that AGIs will be deployed as software that can be run on many different computers, rather than being tied to a specific piece of hardware as humans are.^[1]

I’ll start off by discussing two very high-level arguments. The first is that being more generally intelligent allows you to acquire more power, via large-scale coordination and development of novel technological capabilities. Both of these contributed to the human species taking control of the world; and they both contributed to other big shifts in the distribution of power (such as the industrial revolution). If the set of all humans and aligned AGIs is much less capable in these two ways than the set of all misaligned AGIs, then we should expect the latter to develop more novel technologies, and use them to amass more resources, unless strong constraints are placed on them, or they’re unable to coordinate well (I’ll discuss both possibilities shortly.)

On the other hand, though, it’s also very hard to take over the world. In particular, if people in power see their positions being eroded, it’s generally a safe bet that they’ll take action to prevent that. Further, it’s always much easier to understand and reason about a problem when it’s more concrete and tangible; our track record at predicting large-scale future developments is pretty bad. And so even if the high-level arguments laid out above seem difficult to rebut, there may well be some solutions we missed which people will spot when their incentives to do so, and the range of approaches available to them, are laid out more clearly.

How can we move beyond these high-level arguments? In the rest of this section I’ll lay out two types of disaster scenarios, and then four factors which will affect our ability to remain in control if we develop AGIs that are not fully aligned:

Speed of AI development
Transparency of AI systems
Constrained deployment strategies
Human political and economic coordination

Disaster scenarios

There have been a number of attempts to describe the catastrophic outcomes that might arise from misaligned superintelligences, although it has proven difficult to characterise them in detail. Broadly speaking, the most compelling scenarios fall into two categories. Christiano describes AGIs gaining influence within our current economic and political systems by taking or being given control of companies and institutions. Eventually “we reach the point where we could not recover from a correlated automation failure”—after which those AGIs are no longer incentivised to follow human laws. Hanson also lays out a scenario in which virtual minds come to dominate the economy (although he is less worried about misalignment, partly because he focuses on emulated human minds). In both scenarios, biological humans lose influence because they are less competitive at strategically important tasks, but no single AGI is able to seize control of the world. To some extent these scenarios are analogous to our current situation, in which large corporations and institutions are able to amass power even when most humans disapprove of their goals. However, since these organisations are staffed by humans, there are still pressures on them to be aligned with human values which won’t apply to groups of AGIs.

By contrast, Yudkowsky and Bostrom describe scenarios where a single AGI gains power primarily through technological breakthroughs, in a way that’s largely separate from the wider economy. The key assumption which distinguishes this category of scenarios from the previous category is that a single AGI will be able to gain enough power via such breakthroughs that they can seize control of the world. Previous descriptions of this type of scenario have featured superhuman nanotechnology, biotechnology, and hacking; however, detailed characterisations are difficult because the relevant technologies don’t yet exist. Yet it seems very likely that there exist some future technologies which would provide a decisive strategic advantage if possessed only by a single actor, and so the key factor influencing the plausibility of these scenarios is whether AI development will be rapid enough to allow such concentration of power, as I discuss below.

In either case, humans and aligned AIs end up with much less power than misaligned AIs, which could then appropriate our resources towards their own goals. An even worse scenario is if misaligned AGIs act in ways which are deliberately hostile to human values—for example, by making threats to force concessions from us. How can we avoid these scenarios? It’s tempting to aim directly towards the final goal of being able to align arbitrarily intelligent AIs, but I think that the most realistic time horizon to plan towards is the point when AIs are much better than humans at doing safety research. So our goal should be to ensure that those AIs are aligned, and that their safety research will be used to build their successors. Which category of disaster is most likely to prevent that depends not only on the intelligence, agency and goals of the AIs we end up developing, but also on the four factors listed above, which I’ll explore in more detail now.

Speed of AI development

If AI development proceeds very quickly, then our ability to react appropriately will be much lower. In particular, we should be interested in how long it will take for AGIs to proceed from human-level intelligence to superintelligence, which we’ll call the takeoff period. The history of systems like AlphaStar, AlphaGo and OpenAI Five provides some evidence that this takeoff period will be short: after a long development period, each of them was able to improve rapidly from top amateur level to superhuman performance. A similar phenomenon occurred during human evolution, where it only took us a few million years to become much more intelligent than chimpanzees. In our case one of the key factors was scaling up our brain hardware—which, as I have already discussed, will be much easier for AGIs than it was for humans.

While the question of what returns we will get from scaling up hardware and training time is an important one, in the long term the most important question is what returns we should expect from scaling up the intelligence of scientific researchers—because eventually AGIs themselves will be doing the vast majority of research in AI and related fields (in a process I’ve been calling recursive improvement). In particular, within the range of intelligence we’re interested in, will a given increase δ in the intelligence of an AGI increase the intelligence of the best successor that AGI can develop by more than or less than δ? If more, then recursive improvement will eventually speed up the rate of progress in AI research dramatically. In favour of this hypothesis, Yudkowsky argues:

The history of hominid evolution to date shows that it has not required exponentially greater amounts of evolutionary optimization to produce substantial real-world gains in cognitive performance—it did not require ten times the evolutionary interval to go from Homo erectus to Homo sapiens as from Australopithecus to Homo erectus. All compound interest returned on discoveries such as the invention of agriculture, or the invention of science, or the invention of computers, has occurred without any ability of humans to reinvest technological dividends to increase their brain sizes, speed up their neurons, or improve the low-level algorithms used by their neural circuitry. Since an AI can reinvest the fruits of its intelligence in larger brains, faster processing speeds, and improved low-level algorithms, we should expect an AI’s growth curves to be sharply above human growth curves.

I consider this a strong argument that the pace of progress will eventually become much faster than it currently is. I’m much less confident about when the speedup will occur—for example, the positive feedback loop outlined above might not make a big difference until AGIs are already superintelligent, so that the takeoff period (as defined above) is still quite slow. There has been particular pushback against the more extreme fast takeoff scenarios, which postulate a discontinuous jump in AI capabilities before AI has had transformative impacts. Some of the key arguments:

The development of AGI will be a competitive endeavour in which many researchers will aim to build general cognitive capabilities into their AIs, and will gradually improve at doing so. This makes it unlikely that there will be low-hanging fruit which, when picked, allow large jumps in capabilities. (Arguably, cultural evolution was this sort of low-hanging fruit during human evolution, which would explain why it facilitated such rapid progress.)
Compute availability, which on some views is the key driver of progress in AI, increases fairly continuously.
Historically, continuous technological progress has been much more common than discontinuous progress. For example, progress on chess-playing AIs was steady and predictable over many decades.

Note that these three arguments are all consistent with AI development progressing continuously but at an increasing pace, as AI systems contribute to it an increasing amount.

Transparency of AI systems

A transparent AI system is one whose thoughts and behaviour we can understand and predict; we could be more confident that we can maintain control over an AGI if it were transparent. If we could tell when a system is planning treacherous behaviour, then we could shut it down before it gets the opportunity to carry out that plan. Note that such information would also be valuable for increasing human coordination towards dealing with AGIs; and of course for training, as I discussed briefly in previous sections.

Hubinger lists three broad approaches to making AIs more transparent. One is by creating interpretability tools which allow us to analyse the internal functioning of an existing system. While our ability to interpret human and animal brains is not currently very robust, this is partly because research has been held back by the difficulty of making high-resolution measurements. By contrast, in neural networks we can read each weight and each activation directly, as well as individually changing them to see what happens. On the other hand, if our most advanced systems change rapidly, then previous transparency research may quickly become obsolete. In this respect, neuroscientists—who can study one brain architecture for decades—have it easier.

A second approach is to create training incentives towards transparency. For example, we might reward an agent for explaining its thought processes, or for behaving in predictable ways. Interestingly, some hypotheses imply that this occurred during human evolution, which suggests that multi-agent interactions might be a useful way to create such incentives (if we can find a way to prevent incentives towards deception from also arising).

A third approach is to design algorithms and architectures that are inherently more interpretable. For example, a model-based planner like AlphaGo explores many possible branches of the game tree to decide which move to take. By examining which moves it explores, we can understand what it’s planning before it chooses a move. However, in doing so we rely on the fact that AlphaGo uses an exact model of Go. More general agents in larger environments will need to plan using compressed representations of those environments, which will by default be much less interpretable. It also remains to be seen whether transparency-friendly architectures and algorithms can be competitive with the performance of more opaque alternatives, but I strongly suspect not.

Despite the difficulties inherent in each of these approaches, one advantage we do have in transparency analysis is access to different versions of an AI over time. This mechanism of cross-examination in Debate takes advantage of this. Or as a more pragmatic example, if AI systems which are slightly less intelligent than humans keep trying to deceive their supervisors, that’s pretty clear evidence that the more intelligent ones will do so as well. However, this approach is limited because it doesn’t allow us to identify unsafe plans until they affect behaviour. If the realisation that treachery is an option is always accompanied by the realisation that treachery won’t work yet, we might not observe behavioural warning signs until an AI arises which expects its treachery to succeed.

Constrained deployment strategies

If we consider my earlier analogy of a modern human dropped in the stone age, one key factor that would prevent them from taking over the world is that they would be “deployed” in a very constrained way. They could only be in one place at a time; they couldn’t travel or even send messages very rapidly; they would not be very robust to accidents; and there would be little existing infrastructure for them to leverage. By contrast, it takes much more compute to train deep learning systems than to run them—once an AGI has been trained, it will likely be relatively cheap to deploy many copies of it. A misaligned superintelligence with internet access will be able to create thousands of duplicates of itself, which we will have no control over, by buying (or hacking) the necessary hardware. At this point, our intuitions about the capabilities of a “single AGI” become outdated, and the “second species” terminology becomes more appropriate.

We can imagine trying to avoid this scenario by deploying AGIs in more constrained ways—for example by running them on secure hardware and only allowing them to take certain pre-approved actions (such as providing answers to questions). This seems significantly safer. However, it also seems less likely in a competitive marketplace—judging by today’s trends, a more plausible outcome is for almost everyone to have access to an AGI personal assistant via their phone. This brings us to the fourth factor:

Human political and economic coordination

By default, we shouldn’t rely on a high level of coordination to prevent AGI safety problems. We haven’t yet been able to coordinate adequately to prevent global warming, which is a well-documented, gradually-worsening problem. In the case of AGI deployment, the extrapolation from current behaviour to future danger is much harder to model clearly. Meanwhile, in the absence of technical solutions to safety problems, there will be strong short-term economic incentives to ignore the lack of safety guarantees about speculative future events.

However, this is very dependent on the three previous points. It will be much easier to build a consensus on how to deal with superintelligence if AI systems approach then surpass human-level performance over a timeframe of decades, rather than weeks or months. This is particularly true if less-capable systems display misbehaviour which would clearly be catastrophic if performed by more capable agents. Meanwhile, different actors who might be at the forefront of AGI development—governments, companies, nonprofits—will vary in their responsiveness to safety concerns, cooperativeness, and ability to implement constrained deployment strategies. And the more of them are involved, the harder coordination between them will be.

↩︎
For an exploration of the possible consequences of software-based intelligence (as distinct from the consequences of increased intelligence) see Hanson’s Age of Em.