Avoiding perpetual risk from TAI
Stephen Casper, scasper@mit.edu. Thanks to Rose Hadshar and Daniel Dewey for some discussions and feedback.
The goal of this post is to sort through some questions involving the difficulty of avoiding perpetual risk from transformative AI. Feedback in the comments is welcome!
TL;DR
Getting AI to go well means that at some point in time, the acute period of risk posed by the onset of transformative AI must end. Ending that period will require establishing a regime for transformative AI that is exclusive, benevolent, stable, and successful at alignment. This post argues that this type of regime may be very difficult to establish and that this will largely be an AI governance problem. If so, this gives an argument for emphasizing work on AI safety challenges other than just alignment.
Intro
Often–at least in conversation–we talk about how we need “a solution” to AI safety. And maybe AI safety is a problem that has a once-and-for-all solution. If we can avoid doom, maybe we can use aligned AI to solve our key challenges, avoid all other X-risks, and set ourselves on a sustainable course for the future. This seems possible, and it would be great. Thinking in these terms stems back at least to Nick Bostrom’s Superintelligence. Bostrom discusses the possibility that in the same way that extinction is an attractor state, meeting our cosmic endowment may be as well. If we can get AI right just once, maybe that’s the last key challenge we need to solve.
From Superintelligence:
However, achieving immortality would mean a lot of things would have to go right. Establishing a regime that avoids perpetual risk will be complex, and it will definitely involve more than just figuring out how to align AI systems with our goals.
Five requirements for avoiding perpetual risk
Suppose that highly transformative AI (TAI) technologies are someday developed and that they are powerful enough to do catastrophically dangerous things. By definition I will refer to whatever set of institutions that control the TAI and could cause catastrophic risks with it as the TAI regime. The regime could consist of human institutions, AI institutions, or both:
TAI regime = whatever set of actors controls TAI
To avoid perpetual AI risk, five things need to be true about the TAI regime. And to the extent that any are false, the probability of perpetual risk will substantially increase. Note that this only applies within some sort of cosmic sphere of influence.
Exclusivity
There will almost certainly exist bad actors who would cause major risks if they had the ability to create and deploy the TAI without being stopped at some point in the process. (For example, the same applies with nuclear weapons.) To avoid this, the regime needs to be effectively closed to those who would cause havoc if they joined it.
Benevolence
The TAI regime needs to be one that will try to lock in a stable and good future instead of pursuing more evil or myopic goals.
Stability
The regime needs to be stable over time and not be overthrown or degenerate into an unaligned one.
Alignment
The TAI regime needs to solve outer alignment and inner alignment sufficiently well to ensure that their TAI does not itself cause major risks.
Resilience
The ability to respond to and survive disasters is also really nice to have if/when risks become reality.
Each of these five may be difficult
Exclusivity
Developing TAI may offer substantial first-mover advantages that gives the first members of the regime a lot of influence. And there is some precedent for certain sectors that are really hard for new actors to break into and be competitive in such as the search engine industry. But exclusivity still seems hard.
The research and development ecosystems for AI are competitive. Due in part to arms races and copycatting, there is a lot of precedent for different groups making similar breakthroughs at the same time. For example, OpenAI’s DALLE-2, Google’s Imagen, Midjourney’s Midjourney, Meta’s Make-A-Video, and IBM’s eDiff-I were all released between April and November of this year. It may be difficult for the earliest developers of powerful TAI to outcompete and exclude others right on their tail.
The mechanisms by which the members of a nascent TAI regime could exclude competitors might be hard and/or illegal. Espionage is illegal. Antitrust legislation might make it hard to simply outcompete competitors. And if a potential new regime member is in a different part of the world than the existing members, it may be especially hard to exclude them.
If the regime is not unipolar, standard coordination problems and tragedies of the commons may arise involving the question of who should pay for exclusionary actions.
Benevolence
Extreme power might make one more inclined to be altruistic, but this seems to be a tenuous hope at best.
Being primarily altruistic and trying to make the world a good, non-risky place to live in doesn’t seem to be the norm among powerful companies or people. Precedent suggests that if it is possible to make money off of something, some company will want to do it even if it is risky for the planet. The fossil fuel industry is an example of this. And even having immense amounts of wealth and power may not drive entities to benevolence. For example, this isn’t the case most of the time among the billionaires of the world who already have more than enough money to buy everything a single person could need or want.
Members of the AI safety community are not a majority among AI researchers or developers.
The alignment tax might be expensive, so there may be a strong disincentive to pay it, and those who are willing to do so may be uncompetitive.
Stability
Structures such as constitutions that create checks and balances for regimes seem good and probably necessary for a tenable TAI regime. But things could go wrong even if the TAI regime is well-structured.
Corruption happens. Most governments and big companies develop it. There’s nothing like having a substantial amount of control over the world to incentivize internal power grabs.
Regimes can be overhauled or overthrown. Powerful competitors, uprisings, or armies might make this likely. The longest lasting continuous government in world history was the republic of Venice which lasted 1,100 years and underwent substantial evolution over this time period.
Alignment
These troubles are simple.
Outer alignment is hard.
Inner alignment is hard.
Hopefully though, containment, corrigibility, and off-switches can temper the risks of failures here.
Resilience
This can be considered hard because it is such a large and complex problem and involves problems with politics, governments, logistics, bio-risk, cybersecurity, etc.
What might this mean for AI safety work?
Consider a taxonomy of AI safety strategies that groups them into three types.
Strategy 1: Making it easier to make safer AI. The whole field of AI alignment is the key example of this. But this also includes governance strategies that promote safer work or establish healthy norms in the research and development ecosystems.
Strategy 2: Making it harder to make less safe AI. Examples include establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply chains to slow things down, and avoiding arms races.
Strategy 3: Responding to problems as they arise. There might be a decent amount of time to act between the creation of an X-risky AI and extinction from it. This is especially true if the extinction happens via a cascade of globally-destabilizing events. It would probably be hard for TAI systems to gain influence over the world for some of the same reasons it’s hard for people/companies/countries to do the same. The world seems awfully big and able to adapt/fight-back for the path to extinction to be super short. Given this, some strategies to make us more resilient might be very useful including giving governments powers to rapidly detect and respond to firms doing risky things with TAI, hitting killswitches involving global finance or the internet, cybersecurity, and generally being more resilient to catastrophes as a global community.
Note that we could also consider a fourth category: meta work that aims to build good paradigms and healthy institutions. But this isn’t direct work, so I’ll only mention it here on the side.
Non-alignment aspects of AI safety are key.
Strategy 1 only addresses ensuring that the TAI regime is successful at alignment. Strategy 2 is key for exclusivity and benevolence. And Strategy 3 is useful for stability and resilience against disasters if we end up in a regime of significant risk.
I think this is important to bear in mind because the most common and most interesting types of AI safety work that are the easiest to nerd-snipe researchers with seem to fall into Strategy 1. But Strategies 2 and 3 may be at least as important to work on if not more. As someone who works on problems in Strategy 1, I am currently thinking about is whether I should work more toward 2 and 3. These seem to be mostly (but not entirely) governance-related problems that are relatively neglected. I’d appreciate feedback and discussion in the comments.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 325 points) (
- Solving alignment isn’t enough for a flourishing future by 2 Feb 2024 18:22 UTC; 27 points) (EA Forum;
- Solving alignment isn’t enough for a flourishing future by 2 Feb 2024 18:23 UTC; 27 points) (
- AI Doom Is Not (Only) Disjunctive by 30 Mar 2023 1:42 UTC; 12 points) (
I’m a bit skeptical about calling this an “AI governance” problem. This sounds more like “governance” or maybe “existential risk governance”—if future technologies make irreversible destruction increasingly easy, how can we govern the world to avoid certain eventual doom?
Handling that involves political challenges, fundamental tradeoffs, institutional design problems, etc., but I don’t think it’s distinctive to risks posed by AI, don’t think that a solution necessarily involves AI, don’t think it’s right to view “access to TAI” as the only or primary lever of political power to prevent destructive acts, and I’m not convinced that this problem should be addressed by a community focused on AI in particular.
It seems good for people to think about the general long-term challenge as well as to think about the concrete possible destructive technologies on the horizon, in case there is narrower work that can help mitigate the risks they pose and thereby delay the need to implement a general solution. But in some sense this is just “delaying the inevitable.”
I wrote some of my thoughts on this relationship in Handling destructive technology.
One potential difference is that I don’t see TAI as automatically posing a catastrophic risk. Alignment itself could pose a catastrophic risk. But if we resolve that, then I think we get some (unknown) amount of subjective time until the next thing goes wrong, which might be AI enabling access to destructive physical technology or might be something more conceptually gnarly. The further off that next risk is, the more political change is likely to happen in the interim.
This is an interesting point. But I’m not convinced, at least immediately, that this isn’t likely to be largely a matter of AI governance.
There is a long list of governance strategies that aren’t specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples:
And I think that some of the things I mentioned for strategy 3 do too:
So ultimately, I won’t make claims about whether avoiding perpetual risk is mostly an AI governance problem or mostly a more general governance problem, but certainly there are a bunch of AI specific things in this domain. I also think they might be a bit neglected relative to some of the strategy 1 stuff.
The opposites of those four requirements sound also pretty good.
Exclusivity—Corrigibility
Humans that are being harmed should be able to effectively steer the AI to cease hurting them.
Benevolence—Servitude
The AI should serve humans and not put its own goals ahead of others.
Stability—Responsitivity
The AI should stay relevant and answer challenges to its existence. It should keep up with the world and not become out of distribution by turning into a relic.
Success at aligment—Fallibility
A minor mistake should not spell doom to the world. The setup should fail gracefully and accept fixes.
I’m not sure I understand what you mean. As I understand it, this comment seems a bit non sequitur to the post. First, I don’t agree with any of the four pairs you mentioned at all as being opposites. Second, it seems to me like you’re talking about an specific AI system, and not a TAI regime like I am.
I’m going to say that while strategy 1 isn’t going to solve all the problems, it might also solve benevolence, primarily because I think the AI Alignment problem is far far more general than say, the alignment problem of states aligned to their citizens. It’s much more like the problem of humans aligning to animals, and here the evidence is mostly depressing here, with the exception of pets, more or less.
Compared to states being aligned to citizens, where we actually have mechanisms that work imperfectly, in human-to-animal alignment, there aren’t mechanisms that work at all, short of pets.
I think several factors contribute to the problem:
A much more capable party can ignore restraints like laws or contracts, for the most part, and thus depends on their own goals, which are usually misaligned.
We depend on the fact that there aren’t that much differences in behavior, intelligence, and so on, and thus if you break it, things get bad fast. This is also known as the IID distribution on capabilities assumption.
Thus, success on strategy 1, especially if it can be extended to arbitrarily large inequalities in capabilities like intelligence, can essentially solve many of the special cases of alignment problems like states aligned to citizens.
I see this point about how making it easier to build safer AI can help to solve the benevolence problem by making the benevolent agents more competitive and this lowering the effective alignment tax. This is a good point.
But I would note that this only applies to the extent that one’s approach to strategy 1 means focusing on helping people working on safer AI do it more effectively. This does not include AI alignment goals. Ultimately, if a terrorist has a powerful AI system that is well-aligned with their goals, that’s very bad.