AI Safety Endgame Stories
Assume you are in the set of possible worlds where AI takeover happens by default. If you do nothing, then at some point in the 21st century the AI lab Magma develops a transformative AI system. Magma employees perform a number of safety checks, conclude the system is safe enough, and deploy it. They deploy it slowly and incrementally, with careful monitoring. But despite their efforts, the system turns out to be unsafe and the monitoring insufficient, triggering a cascade of events eventually leading to an existential catastrophe.[1]
I’ll refer to this sequence of events as the “baseline story” going forward.
Assume further that you’re in the narrower set of worlds where this AI catastrophe is contingent on your actions. In other words, there exists a sequence of actions you (or your organization) can take that averts catastrophe, a decisive intervention. Not necessarily a pivotal act, an intervention that averts all existential risk from AI. Just an intervention that prevents this specific Magma catastrophe, giving humanity some breathing room, perhaps only a few months or years.[2]
Let’s try to understand what this decisive sequence of actions could look like. It’s tempting to start at the beginning of the sequence and think about what the first few actions look like. Unfortunately, the most probable starting actions are “meta” actions like thinking really hard, talking to experts, or recruiting more people to work on the problem. These are the same kinds of actions that any successful project starts with! So it doesn’t help us constrain the space of decisive interventions.
Instead, it’s more helpful to start with the endgame: how, in the end, did your actions change the baseline story and avert catastrophe? And what were the last nodes in the causal chain leading up to the change?
At the most abstract level, the baseline story has the following structure. A social process (Magma) instantiates a technological artifact (unsafe AI) which destroys the world. There are three objects here, and any change to the story requires changing the properties of at least one of them.
This leads naturally to a 3-way categorization of endgame stories, though the real endgame story will probably involve some combination of the three:
Changing the technology. You changed something about the technology that Magma had access to, which delayed the deployment or made it safe.
Changing the decision maker. You changed something about Magma, or more broadly the social decision process that led to the unsafe deployment.
Changing the broader world. You changed something about the broader world that made it resilient to Magma’s deployment decision.
In the rest of this post I’ll use this categorization to walk through a diverse array of endgame stories.
Changing the Technology
Differential Development of Safety
Let’s start with the broad endgame story that most technical alignment research is aimed at, differential development of safety technology (a special case of differential technological development):
You develop technology that makes AI safer, with mild competitiveness penalties. Because of your direct work, the technology is developed N years faster than it would have been by default. Magma’s deployment falls within that N-year window, so they use your technology, so their deployment does not lead to catastrophe.
Can we make this story more concrete? How exactly does the new technology prevent catastrophe? The simplest path involves finding decisive safety augmentation, something like “add an adversarial robustness term to the training objective”—a conceptual breakthrough that Magma adopts immediately because it is relatively easy to implement and aligned with business objectives.
Another simple path involves decisive monitoring technology, like in Chris Olah’s “interpretability gives you a mulligan” story: differentially advanced interpretability tech allows Magma to detect that the AI is unsafe and abort the deployment.
Other stories involve deeper changes in the technological landscape. The endgame of proposals like Iterated Amplification is to differentially advance a safer way to build ML systems for the same tasks. Perhaps in the baseline story, Magma uses model-based RL to train its model; but a concerted research effort manages to advance amplification capability so much that Magma changes their technology stack and uses amplification instead. The Microscope AI endgame involves an even deeper change, where differential interpretability progress leads Magma to use humans with AI-enhanced understanding instead of deploying an agentic AI system at all. Perhaps the most extreme differential-development endgame is the old MIRI strategy of building a safe-by-construction AGI from first principles, routing around modern ML altogether.
There are also differential development endgame stories that don’t involve AI technology at all—you could instead advance a technology that is an economic substitute for AI such as brain emulation. More speculatively, you could slow down AI development by advancing technologies that profitably redirect key inputs to AI such as compute or software engineering talent—the 2021 crypto boom may have had this effect by accident.
Stretching the definition of technology to include conceptual understanding, another differential development endgame story involves improving our understanding of AI systems in some way, e.g. discovering that large generative models exhibit unnpredictable capability jumps as they scale. Knowledge of such a phenomenon could help Magma take the decisive safety precaution that averts catastrophe.
Note that in most cases, just developing the technology is not enough; Magma also needs to know about the technology, and needs to have the ability and incentive to implement or integrate it. Any knowledge you share with Magma is likely to be dual-use; many safety improvements depend on insights that could be used by Magma to advance capabilities instead.
Differential development endgames stories can seem implausible, especially if you’re thinking about interventions on the scale of an individual or a small team. There are, or soon will be, millions of AI researchers and engineers worldwide. How can you possibly reshape the technological landscape enough to get Magma to deploy a substantially different system? One answer is to leverage technological attractor states.
Technological Attractor States
There are strong incentives for researchers and engineers to work on systems that are state-of-the-art. When a new technology becomes state-of-the-art, everyone starts using it and developing techniques to improve it, quickly amplifying what may have been a small performance difference into an insurmountable gulf. Because of this feedback loop, technological development can fall into one of several different self-reinforcing paths, or attractor states. And a very small push at the right time—perhaps just a single compelling prototype or research paper—could change the attractor the world falls into.
To illustrate the key dynamic with a stylized endgame story:
It turns out there are two different ways to build transformative AI, one of which is safe and the other isn’t. Safe AI requires 2x the compute for the same downstream task performance. There are 10 tricky algorithmic improvements like dropout to be discovered, each of which improves compute efficiency 2x. But, critically, there are totally distinct improvements for the two trees: insight doesn’t transfer between approaches, like knowing about dropout doesn’t help you train better SVMs. By default, unsafe AI will win out, because it’s more competitive. But a well-timed burst of research could discover 2 efficiency improvements for safe AI, making it state-of-the-art. Nearly all researchers and corporate labs switch to safe AI. Because the field’s attention is on the safe AI approach, more and more improvements get discovered, and the unsafe AI path falls further and further behind. Eventually, maybe many decades later, Magma trains a transformative AI system, but because of that well-timed burst of research it is safe.
A promising concrete endgame story along these lines is Ought’s plan to avoid the dangerous attractor state of AI systems that are optimized end-to-end (“outcome-based systems”) by differentially advancing the capabilities of process-based systems. Process-based systems (i.e. systems that use human-understandable processes to make decisions) may be an attractor state because they are more composable: if most economically valuable tasks can be solved by composing together a few existing systems in a human-understandable way, the incentive for end-to-end optimization is much lower. Just as today you wouldn’t train a neural net if you could just write a few lines of Python code instead.
The related endgame story of Comprehensive AI Services is that we might be able to navigate to a benign attractor state where safe, narrow tool AIs can do everything a general agent-like AI might do. It’s less clear what a decisive intervention would look like, perhaps designing a broadly adopted protocol that interfaces between narrow AI systems.
Changing the Decision Maker
Let’s now assume the technological landscape is fixed, and investigate how we might change the social decision processes that causes the catastrophe. We’ll use a broad definition of what counts as the decision process, that includes not only Magma employees but also Magma’s investors, regulators, cultural influencers, and competitors.
Defusing Races
A key driver of AI risk is what Ajeya Cotra calls the “racing forward” assumption, that at least one powerful organization will be trying to train the most powerful models possible. Others have called this the “AGI race dynamic”. What does a story for defusing this race dynamic look like?
The global AI treaty story involves dramatically increasing global coordination on AI:
You engage in some political process and your intervention leads to a global AI control agreement, analogous to nuclear weapons non-proliferation treaties. The agreement specifies the kinds of AI systems that should not be built, or a set of safeguards that any deployer of advanced AI systems must implement. It has enough enforcement power to actually affect the behavior of the leading AI lab. Because of its compliance with the treaty, Magma doesn’t train or deploy the dangerous AI system.
For those cynical about global political processes, remember that the goal is not to write a treaty that permanently averts AI x-risk: delaying catastrophe by only a few months with some simple safety measures may give enough breathing room for one of the “changing technology” stories to bring us more durable security.
Plausible modifications to the story include aiming for regulations in specific countries (especially the US and China) instead of global coordination, or developing agreements between the leading AI labs that bypass the political process altogether. More locally, individual AI labs can make commitments like the Windfall clause and OpenAI’s “join the leader”[3] clause, which may help defuse race dynamics.
Because the impacts and risks from AI are so uncertain, it may be that a static treaty is insufficient. As an alternative path to defusing race dynamics, you could create an organization that helps dynamically coordinate safety efforts across the leading AI labs, such as Holden’s hypothetical “IAIA”. The work of the International Atomic Energy Agency is analogous here, since it also deals with a powerful dual-use technology and seeks to promote its positive uses while preventing negative effects from military use and civilian accidents.
Changing Magma’s Culture
If Magma looks anything like existing tech companies, its employees have a great deal of power. They are not simple interchangeable cogs in a profit-maximizing machine; their beliefs and habits strongly influence Magma’s behavior. Hence many plausible endgame stories go through influencing Magma’s employees, for instance:
Because of your work translating AI safety ideas for an ML research audience, ML researchers strongly prefer to work for companies that strongly commit to safe deployment practices. Because ML research talent is the scarcest resource in AI development, Magma is forced to make a strong enforceable commitment to safety, which averts catastrophe.
Alternatives to this story involve the creation of strong norms among ML professionals analogous to extant norms for geneticists, doctors, safety engineers, and cybersecurity experts. You could also influence Magma employees through the general public: popularizing AI risks widely makes unsafe AI companies unfashionable to work for, like cigarette or oil companies today.
The simplest endgame story that leverages employee power is almost trivial:
You become the key decision maker in Magma—perhaps the CEO, or the swing vote if it’s a committee decision. You decide not to train or not to deploy the AI system, averting the catastrophe.
This is an endgame only relevant for a very small set of actors, but a critically important one. There are also many promising meta strategies that indirectly lead to this endgame: you could help someone else to become this key decision maker, or influence the key decision maker by giving them relevant information.
Replacing Magma
A simple replacement story only changes the identity of the organization:
Because of your intervention—perhaps funding it, starting it, or joining it—a different organization leads the AI race, changing the decision maker in the story to be “SafeOrg” rather than Magma. SafeOrg is more risk-averse or better-informed than Magma, so it doesn’t deploy the dangerous AI system that Magma would have.
Notice that this is not a true success story; by default, Magma will still make its unsafe deployment, at roughly the same time as in the baseline story. Maybe even earlier, if there’s any knowledge transfer from SafeOrg to Magma. SafeOrg must use its capability lead to stop Magma. But how?
It could use the lead time to perform safety and monitoring work, essentially implementing one of the “changing technology” strategies above. It could use its capabilities lead as leverage somehow to influence Magma to delay deployment. It could use its capabilities lead to make the world generally safer when Magma does deploy its AI. At the most extreme end, it could use its capabilities lead to perform a pivotal act that leads to permanent existential security.
But wait—none of these stories necessarily require a capabilities lead! Capabilities here are used as just another form of power, mostly fungible with money, cultural or political influence. So the “replace Magma” story is not really an endgame, but rather a meta strategy to amplify philanthropic capital. You started out with $1B and turned it into $100B by investing it in an AI company; now you can use your $100B to prevent AI x-risk. It is not a replacement for a direct endgame strategy like differential development, but a meta strategy that can be compared to other amplifiers like community-building, investing in financial markets, and political lobbying.
The exception is stories where a capabilities lead is not fungible with other forms of power. For example, executing one of the “change the technology” strategies may require access to very high levels of capability: OpenAI and Anthropic’s alignment strategies are both predicated on this. In theory you could just pay or influence Magma to give you access to their technology, but transaction costs like lack of trust could make such an agreement unworkable in practice. Being the capabilities leader also gives you outsized influence on changing culture and setting norms, like when OpenAI’s decision to not open source GPT-2 helped set a norm of delaying the release of state-of-the-art language models.
Changing the Broader World
Let’s now assume you can’t affect Magma or its deployment process at all. The deployment will happen regardless; how could you change the broader world to be more resilient?
This is the hardest endgame to think about in the abstract, because the type of resilience needed depends on the details of the specific failure story. If the failure story involves hacking, you might patch critical information security vulnerabilities. If it involves superhuman persuasion or propaganda, you might harden social media, isolate key decision makers from the Internet, or develop ways to inoculate people against “memetic plagues”. If it involves biorisk, you might regulate on-demand DNA synthesis companies or increase pandemic preparedness.
You could improve social decision making by raising the sanity waterline, developing better research assistants, or improving institutions. Such broad interventions are not precisely targeted at mitigating x-risk, and could even increase it in the wrong hands, so they may be best deployed strategically.
A broadly applicable intervention that is targeted precisely at x-risk is building shelters and refuges, from nuclear submarines to space colonies. Shelters will not save us from the deadliest version of AI x-risk (a recursively self-improving superintelligence expanding at the speed of light), but could potentially avert other scenarios like AI-engineered pandemics or AI-triggered nuclear winter.
Counterfactual Impact and Power-Seeking
It worries me that many of the most promising theories of impact for alignment end up with the structure “acquire power, then use it for good”.
This seems to be a result of the counterfactual impact framing and a bias towards simple plans. You are a tiny agent in an unfathomably large world, trying to intervene on what may be the biggest event in human history. If you try to generate stories where you have a clear, simple counterfactual impact, most of them will involve power-seeking for the usual instrumental convergence reasons. Power-seeking might be necessary sometimes, but it seems extremely dangerous as a general attitude; ironically human power-seeking is one of the key drivers of AI x-risk to begin with. Benjamin Ross Hoffman writes beautifully about this problem in Against responsibility.
I don’t have any good solutions, other than a general bias away from power-seeking strategies and towards strategies involving cooperation, dealism, and reducing transaction costs. I think the pivotal act framing is particularly dangerous, and aiming to delay existential catastrophe rather than preventing it completely is a better policy for most actors.
Thinking about meta strategies is also a useful antidote. For any endgame story where you perform decisive intervention X, you can generate a modified story in which you “assist someone in performing X” or “research possible consequences of X” or “create a social context in which more people are trying to do X” or “build a for-profit company that is incentivized to do more X” or “use AI to do X better”. Or just give someone doing X a hug, a smile, a word of encouragement. Any specific story like this is unlikely to prove decisive; but summing up over all the possible stories, the majority of your expected impact will come from such indirect actions.
A final note of epistemic caution. This post illustrates the breadth of possible interventions that could avert AI x-risk, but it is very far from exhaustive. The world is much bigger and weirder than our minds can comprehend. There are decisive interventions lurking in all sorts of unexpected places. The real history of AI risk in the 21st century, if and when it is written, will be far stranger than any story.
Thanks to Jungwon Byun, Andreas Stuhlmuller, Todor Markov, and Anna Wang for feedback on a draft.
- ^
The story is most directly inspired by Ajeya’s takeover post, but meant to cover most AI x-risk stories including What failure looks like, AGI Ruin: A List of Lethalities, and most multipolar failures. It’s also mostly agnostic to timelines and takeoff speeds.
- ^
I revisit this assumption later in the essay, but I think it is analytically useful for two reasons. First, any plan that leads to true existential security will need to have an answer for how to avert this specific Magma catastrophe, so much of the analysis will transfer over. Second, achieving existential security or building friendly AGI may simply not be possible, and all we can do is tread water and delay catastrophe a few years at a time. Cryptography is like this—we haven’t found any perfect ways to do encryption and may never, but we can chain together enough kludges that extremely secure communication is possible most of the time.
- ^
From the OpenAI charter: “if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.””
I’ve been thinking along similar lines recently. A possible path to AI safety that I’ve been thinking about extends upon this:
Technological Attractor: Off-the-shelf subsystems
One possible tech-tree path is that we start building custom silicon to implement certain subsystems in an AI agent. These components would be analogous to functional neural regions of the human brain such as the motor cortex, visual system, etc. -- the key hypothesis being that once we reach a certain level of model complexity, the benefits from training a model end-to-end are not worth the costs of re-learning all of these fundamental structures, and furthermore that you can get much better performance-per-cost by casting these modular, reusable components onto an ASIC. This could be a more feasible way of achieving something like Microscope AI.
Given a few such low-level components, we could enter a technological attractor where getting SOTA through another approach requires either i) throwing multiple OOM more compute than previous SOTA at re-training your own copy of these components in an end-to-end model, or ii) building your own microchip fab to implement your custom component design. Both of these could be high enough barriers that in practice the market participants simply use the off-the-shelf components. And in this attractor more R&D goes into building better-performing and higher-abstraction components that can be combined arbitrarily for different tasks.
Given such off-the-shelf components, say, a standard visual cortex for image recognition, or a 3D spatial reasoning system which can be connected atop a visual cortex, or a generic working memory, etc., it becomes much more tractable to engage in interpretability work that can convince us of the safety of these components. That then just leaves the problem of proving the highest-level network is safe (i.e. whatever is trained to use these components). This is by no means trivial, but for example the ELK paper spent a lot of time worrying about whether you can even trust the basic ontology of an agent, and I think this path would make Ontology Identification tractable. (Of course, this by construction excludes a number of possible agent-configurations).
Now, it’s possible one could plug in these components and still accidentally have the trained agent learn its own replacement visual cortex that is deceptive and not interpretable, but I think that we can get an intuition for when this might happen. If the best off-the-shelf visual cortex module has complexity on the order of N parameters, and our new model that stitches together these lower-level modules has N parameters or less, then it seems implausible that it has somehow learned a new visual cortex network that is much more compact/efficient than the previous state of the art—increasingly implausible the more resources are devoted to optimizing these components. (It would be like someone coming along today and claiming they had achieved a 10x performance improvement on CPU clock speed; given the huge research budgets and historical gradual capability increases, we don’t expect this to happen). On the other hand, we should be suspicious of a N×102 parameter model where we are unable to interpret what 10% of the network is doing; that’s easily enough complexity to hide its own unsafe re-implementation of our components. (I’m aware that there’s a lot of ground in between these two points, I’m merely trying to illustrate that there is “likely safe” and “likely unsafe” ground, rather than claim exactly how big they each are.)
The final step here is the shakiest. It’s not clear to me that we can keep the “top layer” (the actual network that is stitching together the low-level components; perhaps the Neocortex, by analogy to human neural architecture?) thin enough to be obviously not learning its own unsafe component-replacements. However, I think this framework at least paints a picture of a “known safe” or at least “likely safe” path to AGI; if we see that the practical engineering and economic decisions produce thin top-layer models using thick component layers, then we can devote energy to proving the components are safe/interpretable by construction, and exploring the interpretation of the top-level networks that consume the lower-level components. AGI “neurobiology” will be much more tractable if the “neural architecture” is relatively standardized. And so, this could be a good place to provide an early nudge to tip the system into this attractor; heavy investment into research on componentized NN architectures could be viewed as “gain of function” research, but it could also have a much safer end-point.
Another way of thinking about this is that by crystalizing at least some parts of the AGI’s network into slowly-changing structures, we allow time to thoroughly test those parts. It seems very hard to thoroughly test models for safety in a paradigm where the whole model is potentially retrained regularly.
Interesting, I haven’t seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.
One other thought after considering this a bit more—we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality.
I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different high-level goals, do you get savings by not having to train two “visual cortexes”?
This is in a similar vein to Google’s foundation models, where they train one jumbo model that then gets specialized for each usecase. Can that foundation model be modularized? (Maybe for relatively narrow usecases like “text comprehension” it’s actually reasonable to think of a foundation model as a single submodule, but I think they are quite broad right now. ) The big difference is I think all the weights are mutable in the “refine the foundation model” step?
Perhaps another concrete proposal for a technological attractor would be to build a SOTA foundation model and make that so good that the community uses it instead of training their own, and then that would also give a slower-moving architecture/target to interpret.
We need to test designs, and most specifically alignment designs, but giving up retraining (ie lifetime learning) and burning circuits into silicon is unlikely to be competitive; throwing out the baby with the bathwater.
Also whether a design is aligned or not isn’t the type of question one can answer by analyzing the agent’s visual cortex, it’s near purely a function of what is steering the planning system.
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. Obviously you’d need higher order components too. Like I said, I am imagining here that the majority of the model is “off the shelf”, and just a thin layer is usecase-specific.
To make this more explicit, if you had not only off-the-shelf visual cortex, but also spatio-temporal reasoning modules built atop (as the human brain does), then you could point your debugger at the contents of that module and understand what entities in space were being perceived at what time. And the mapping of “high level strategies” to “low level entities” would be a per-model bit of interpretability work, but should become more tractable to the extent that those low level entities are already mapped and understood.
So for the explicit problem that the ELK paper was trying to solve, if you are confident you know what underlying representation SmartVault is using, it’s much easier to interpret its higher-level actions/strategies.
So:
DL based AGI is arriving soonish
DL based AGI raised in the right social environments will automatically learn efficient models of external agent values (and empowerment bounds thereof)
The main challenge then is locating the learned representation of external agent values and wiring/grounding it up to the agent’s core utility function (which is initially unsafe: self-motivated empowerment etc), and timing that transition replacement carefully
Evolution also solved both alignment and the circuit grounding problem; we can improve on those solutions (proxy matching)
We can iterate safely on 3 in well constructed sandbox sims
Ideally as we approach AGI there would be cooperative standardization on alignment benchmarks and all the major players would subject their designs to extensive testing in sandbox sims. Hopefully 1-5 will become increasingly self evident and influence ‘Magma’. If not some other org (perhaps a decentralized system), could hopefully beat Magma to the finish line. Alignment need not have much additional cost: it doesn’t require additional runtime computations, it doesn’t require much additional training cost, and with the ideal test environments it hopefully doesn’t have much of a research iteration penalty (as the training environments and can simultaneously test for intelligence and alignment).
This is why AI risk is so high, in a nutshell.
Yet unlike this post (or Benjamin Ross Hoffman’s post), I think this was a sad, but crucially necessary decision. I think the option you propose is at least partially a fabricated option. I think a lot of the reason is people dearly want to there be a better option, even if it’s not there.
Link to fabricated options:
https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options
Fabricated options are products of incoherent thinking; what is the incoherence you’re pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?
I think the fabricated option here is just supporting the companies making AI, when my view is that by default, capitalist incentives kill us all due to boosting AI capabilities while doing approximately zero AI safety, in particular deceptive alignment would not be invested in despite this being the majority of the risk.
One of the most important points for AGI safety is the leader in AGI needs a lot of breathing space and leadership ahead of their competitors, and I think this needs to be done semi-unilaterally by an organization not having capitalist incentives, because all the incentives point towards ever faster, not slowing down AGI capabilities. That’s why I think your options are fabricated, because they assume unrealistically good incentives to do what you want.
I don’t mean to suggest “just supporting the companies” is a good strategy, but there are promising non-power-seeking strategies like “improve collaboration between the leading AI labs” that I think are worth biasing towards.
Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn’t delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.
This is the crux, thank you for identifying it.
Yeah, I’m fairly pessimistic for several years time, since I don’t think they’re that special of a company in resisting capitalist nudges and incentives.
And yeah I’m laughing because unless the alignment/safety teams control what capabilities are added, then I do not expect the capabilities teams to stop, because they won’t get paid for that.