Stephen Casper, scasper@mit.edu. Thanks to Rose Hadshar and Daniel Dewey for some discussions and feedback.
The goal of this post is to sort through some questions involving the difficulty of avoiding perpetual risk from transformative AI. Feedback in the comments is welcome!
TL;DR
Getting AI to go well means that at some point in time, the acute period of risk posed by the onset of transformative AI must end. Ending that period will require establishing a regime for transformative AI that is exclusive, benevolent, stable, and successful at alignment. This post argues that this type of regime may be very difficult to establish and that this will largely be an AI governance problem. If so, this gives an argument for emphasizing work on AI safety challenges other than just alignment.
Intro
Often–at least in conversation–we talk about how we need “a solution” to AI safety. And maybe AI safety is a problem that has a once-and-for-all solution. If we can avoid doom, maybe we can use aligned AI to solve our key challenges, avoid all other X-risks, and set ourselves on a sustainable course for the future. This seems possible, and it would be great. Thinking in these terms stems back at least to Nick Bostrom’s Superintelligence. Bostrom discusses the possibility that in the same way that extinction is an attractor state, meeting our cosmic endowment may be as well. If we can get AI right just once, maybe that’s the last key challenge we need to solve.
From Superintelligence:
However, achieving immortality would mean a lot of things would have to go right. Establishing a regime that avoids perpetual risk will be complex, and it will definitely involve more than just figuring out how to align AI systems with our goals.
Five requirements for avoiding perpetual risk
Suppose that highly transformative AI (TAI) technologies are someday developed and that they are powerful enough to do catastrophically dangerous things. By definition I will refer to whatever set of institutions that control the TAI and could cause catastrophic risks with it as the TAI regime. The regime could consist of human institutions, AI institutions, or both:
TAI regime = whatever set of actors controls TAI
To avoid perpetual AI risk, five things need to be true about the TAI regime. And to the extent that any are false, the probability of perpetual risk will substantially increase. Note that this only applies within some sort of cosmic sphere of influence.
Exclusivity
There will almost certainly exist bad actors who would cause major risks if they had the ability to create and deploy the TAI without being stopped at some point in the process. (For example, the same applies with nuclear weapons.) To avoid this, the regime needs to be effectively closed to those who would cause havoc if they joined it.
Benevolence
The TAI regime needs to be one that will try to lock in a stable and good future instead of pursuing more evil or myopic goals.
Stability
The regime needs to be stable over time and not be overthrown or degenerate into an unaligned one.
Alignment
The TAI regime needs to solve outer alignment and inner alignment sufficiently well to ensure that their TAI does not itself cause major risks.
Resilience
The ability to respond to and survive disasters is also really nice to have if/when risks become reality.
Each of these five may be difficult
Exclusivity
Developing TAI may offer substantial first-mover advantages that gives the first members of the regime a lot of influence. And there is some precedent for certain sectors that are really hard for new actors to break into and be competitive in such as the search engine industry. But exclusivity still seems hard.
The research and development ecosystems for AI are competitive. Due in part to arms races and copycatting, there is a lot of precedent for different groups making similar breakthroughs at the same time. For example, OpenAI’s DALLE-2, Google’s Imagen, Midjourney’s Midjourney, Meta’s Make-A-Video, and IBM’s eDiff-I were all released between April and November of this year. It may be difficult for the earliest developers of powerful TAI to outcompete and exclude others right on their tail.
The mechanisms by which the members of a nascent TAI regime could exclude competitors might be hard and/or illegal. Espionage is illegal. Antitrust legislation might make it hard to simply outcompete competitors. And if a potential new regime member is in a different part of the world than the existing members, it may be especially hard to exclude them.
If the regime is not unipolar, standard coordination problems and tragedies of the commons may arise involving the question of who should pay for exclusionary actions.
Benevolence
Extreme power might make one more inclined to be altruistic, but this seems to be a tenuous hope at best.
Being primarily altruistic and trying to make the world a good, non-risky place to live in doesn’t seem to be the norm among powerful companies or people. Precedent suggests that if it is possible to make money off of something, some company will want to do it even if it is risky for the planet. The fossil fuel industry is an example of this. And even having immense amounts of wealth and power may not drive entities to benevolence. For example, this isn’t the case most of the time among the billionaires of the world who already have more than enough money to buy everything a single person could need or want.
Members of the AI safety community are not a majority among AI researchers or developers.
The alignment tax might be expensive, so there may be a strong disincentive to pay it, and those who are willing to do so may be uncompetitive.
Stability
Structures such as constitutions that create checks and balances for regimes seem good and probably necessary for a tenable TAI regime. But things could go wrong even if the TAI regime is well-structured.
Corruption happens. Most governments and big companies develop it. There’s nothing like having a substantial amount of control over the world to incentivize internal power grabs.
Regimes can be overhauled or overthrown. Powerful competitors, uprisings, or armies might make this likely. The longest lasting continuous government in world history was the republic of Venice which lasted 1,100 years and underwent substantial evolution over this time period.
Alignment
These troubles are simple.
Outer alignment is hard.
Inner alignment is hard.
Hopefully though, containment, corrigibility, and off-switches can temper the risks of failures here.
Resilience
This can be considered hard because it is such a large and complex problem and involves problems with politics, governments, logistics, bio-risk, cybersecurity, etc.
What might this mean for AI safety work?
Consider a taxonomy of AI safety strategies that groups them into three types.
Strategy 1: Making it easier to make safer AI. The whole field of AI alignment is the key example of this. But this also includes governance strategies that promote safer work or establish healthy norms in the research and development ecosystems.
Strategy 2: Making it harder to make less safe AI. Examples include establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply chains to slow things down, and avoiding arms races.
Strategy 3: Responding to problems as they arise. There might be a decent amount of time to act between the creation of an X-risky AI and extinction from it. This is especially true if the extinction happens via a cascade of globally-destabilizing events. It would probably be hard for TAI systems to gain influence over the world for some of the same reasons it’s hard for people/companies/countries to do the same. The world seems awfully big and able to adapt/fight-back for the path to extinction to be super short. Given this, some strategies to make us more resilient might be very useful including giving governments powers to rapidly detect and respond to firms doing risky things with TAI, hitting killswitches involving global finance or the internet, cybersecurity, and generally being more resilient to catastrophes as a global community.
Note that we could also consider a fourth category: meta work that aims to build good paradigms and healthy institutions. But this isn’t direct work, so I’ll only mention it here on the side.
Non-alignment aspects of AI safety are key.
Strategy 1 only addresses ensuring that the TAI regime is successful at alignment. Strategy 2 is key for exclusivity and benevolence. And Strategy 3 is useful for stability and resilience against disasters if we end up in a regime of significant risk.
I think this is important to bear in mind because the most common and most interesting types of AI safety work that are the easiest to nerd-snipe researchers with seem to fall into Strategy 1. But Strategies 2 and 3 may be at least as important to work on if not more. As someone who works on problems in Strategy 1, I am currently thinking about is whether I should work more toward 2 and 3. These seem to be mostly (but not entirely) governance-related problems that are relatively neglected. I’d appreciate feedback and discussion in the comments.
Avoiding perpetual risk from TAI
Stephen Casper, scasper@mit.edu. Thanks to Rose Hadshar and Daniel Dewey for some discussions and feedback.
The goal of this post is to sort through some questions involving the difficulty of avoiding perpetual risk from transformative AI. Feedback in the comments is welcome!
TL;DR
Getting AI to go well means that at some point in time, the acute period of risk posed by the onset of transformative AI must end. Ending that period will require establishing a regime for transformative AI that is exclusive, benevolent, stable, and successful at alignment. This post argues that this type of regime may be very difficult to establish and that this will largely be an AI governance problem. If so, this gives an argument for emphasizing work on AI safety challenges other than just alignment.
Intro
Often–at least in conversation–we talk about how we need “a solution” to AI safety. And maybe AI safety is a problem that has a once-and-for-all solution. If we can avoid doom, maybe we can use aligned AI to solve our key challenges, avoid all other X-risks, and set ourselves on a sustainable course for the future. This seems possible, and it would be great. Thinking in these terms stems back at least to Nick Bostrom’s Superintelligence. Bostrom discusses the possibility that in the same way that extinction is an attractor state, meeting our cosmic endowment may be as well. If we can get AI right just once, maybe that’s the last key challenge we need to solve.
From Superintelligence:
However, achieving immortality would mean a lot of things would have to go right. Establishing a regime that avoids perpetual risk will be complex, and it will definitely involve more than just figuring out how to align AI systems with our goals.
Five requirements for avoiding perpetual risk
Suppose that highly transformative AI (TAI) technologies are someday developed and that they are powerful enough to do catastrophically dangerous things. By definition I will refer to whatever set of institutions that control the TAI and could cause catastrophic risks with it as the TAI regime. The regime could consist of human institutions, AI institutions, or both:
TAI regime = whatever set of actors controls TAI
To avoid perpetual AI risk, five things need to be true about the TAI regime. And to the extent that any are false, the probability of perpetual risk will substantially increase. Note that this only applies within some sort of cosmic sphere of influence.
Exclusivity
There will almost certainly exist bad actors who would cause major risks if they had the ability to create and deploy the TAI without being stopped at some point in the process. (For example, the same applies with nuclear weapons.) To avoid this, the regime needs to be effectively closed to those who would cause havoc if they joined it.
Benevolence
The TAI regime needs to be one that will try to lock in a stable and good future instead of pursuing more evil or myopic goals.
Stability
The regime needs to be stable over time and not be overthrown or degenerate into an unaligned one.
Alignment
The TAI regime needs to solve outer alignment and inner alignment sufficiently well to ensure that their TAI does not itself cause major risks.
Resilience
The ability to respond to and survive disasters is also really nice to have if/when risks become reality.
Each of these five may be difficult
Exclusivity
Developing TAI may offer substantial first-mover advantages that gives the first members of the regime a lot of influence. And there is some precedent for certain sectors that are really hard for new actors to break into and be competitive in such as the search engine industry. But exclusivity still seems hard.
The research and development ecosystems for AI are competitive. Due in part to arms races and copycatting, there is a lot of precedent for different groups making similar breakthroughs at the same time. For example, OpenAI’s DALLE-2, Google’s Imagen, Midjourney’s Midjourney, Meta’s Make-A-Video, and IBM’s eDiff-I were all released between April and November of this year. It may be difficult for the earliest developers of powerful TAI to outcompete and exclude others right on their tail.
The mechanisms by which the members of a nascent TAI regime could exclude competitors might be hard and/or illegal. Espionage is illegal. Antitrust legislation might make it hard to simply outcompete competitors. And if a potential new regime member is in a different part of the world than the existing members, it may be especially hard to exclude them.
If the regime is not unipolar, standard coordination problems and tragedies of the commons may arise involving the question of who should pay for exclusionary actions.
Benevolence
Extreme power might make one more inclined to be altruistic, but this seems to be a tenuous hope at best.
Being primarily altruistic and trying to make the world a good, non-risky place to live in doesn’t seem to be the norm among powerful companies or people. Precedent suggests that if it is possible to make money off of something, some company will want to do it even if it is risky for the planet. The fossil fuel industry is an example of this. And even having immense amounts of wealth and power may not drive entities to benevolence. For example, this isn’t the case most of the time among the billionaires of the world who already have more than enough money to buy everything a single person could need or want.
Members of the AI safety community are not a majority among AI researchers or developers.
The alignment tax might be expensive, so there may be a strong disincentive to pay it, and those who are willing to do so may be uncompetitive.
Stability
Structures such as constitutions that create checks and balances for regimes seem good and probably necessary for a tenable TAI regime. But things could go wrong even if the TAI regime is well-structured.
Corruption happens. Most governments and big companies develop it. There’s nothing like having a substantial amount of control over the world to incentivize internal power grabs.
Regimes can be overhauled or overthrown. Powerful competitors, uprisings, or armies might make this likely. The longest lasting continuous government in world history was the republic of Venice which lasted 1,100 years and underwent substantial evolution over this time period.
Alignment
These troubles are simple.
Outer alignment is hard.
Inner alignment is hard.
Hopefully though, containment, corrigibility, and off-switches can temper the risks of failures here.
Resilience
This can be considered hard because it is such a large and complex problem and involves problems with politics, governments, logistics, bio-risk, cybersecurity, etc.
What might this mean for AI safety work?
Consider a taxonomy of AI safety strategies that groups them into three types.
Strategy 1: Making it easier to make safer AI. The whole field of AI alignment is the key example of this. But this also includes governance strategies that promote safer work or establish healthy norms in the research and development ecosystems.
Strategy 2: Making it harder to make less safe AI. Examples include establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply chains to slow things down, and avoiding arms races.
Strategy 3: Responding to problems as they arise. There might be a decent amount of time to act between the creation of an X-risky AI and extinction from it. This is especially true if the extinction happens via a cascade of globally-destabilizing events. It would probably be hard for TAI systems to gain influence over the world for some of the same reasons it’s hard for people/companies/countries to do the same. The world seems awfully big and able to adapt/fight-back for the path to extinction to be super short. Given this, some strategies to make us more resilient might be very useful including giving governments powers to rapidly detect and respond to firms doing risky things with TAI, hitting killswitches involving global finance or the internet, cybersecurity, and generally being more resilient to catastrophes as a global community.
Note that we could also consider a fourth category: meta work that aims to build good paradigms and healthy institutions. But this isn’t direct work, so I’ll only mention it here on the side.
Non-alignment aspects of AI safety are key.
Strategy 1 only addresses ensuring that the TAI regime is successful at alignment. Strategy 2 is key for exclusivity and benevolence. And Strategy 3 is useful for stability and resilience against disasters if we end up in a regime of significant risk.
I think this is important to bear in mind because the most common and most interesting types of AI safety work that are the easiest to nerd-snipe researchers with seem to fall into Strategy 1. But Strategies 2 and 3 may be at least as important to work on if not more. As someone who works on problems in Strategy 1, I am currently thinking about is whether I should work more toward 2 and 3. These seem to be mostly (but not entirely) governance-related problems that are relatively neglected. I’d appreciate feedback and discussion in the comments.