The Alignment Problem

Last month Eliezer Yudkowsky wrote “a poorly organized list of individual rants” about how AI is going to kill us all. In this post, I attempt to summarize the rants in my own words.

These are not my personal opinions. This post is not the place for my personal opinions. Instead, this post is my attempt to write my understanding of Yudkowsky’s opinions.

I am much more optimistic about our future than Yudkowsky is. But that is not the topic of this post.

Humanity is going to build an AGI as fast as we can.

Humanity is probably going to build an AGI, and soon.

But if an AGI is going to kill us all can’t we choose to just not build an AGI?

Nope! If humanity had the coordination ability to “just not build an AGI because an AGI is an existential threat” then we wouldn’t have built doomsday weapons whose intended purpose is to be an existential threat.

The first nuclear weapon was a proof of concept. The second and third nuclear bombs were detonated on civilian targets. “Choosing not to build an AGI” is much, much harder than choosing not to build nuclear weapons because:

  1. Nukes are physical. Software is digital. It is very hard to regulate information.

  2. Nukes are expensive. Only nation-states can afford them. This limits the number of actors who are required to coordinate.

  3. Nukes require either plutonium or enriched uranium, both of which are rare and which have a small number of legitimate uses. Datacenters meet none of those criteria.

  4. Uranium centrifuges are difficult to hide.

  5. Nuclear bombs and nuclear reactors are very different technologies. It is easy to build a nuclear reactor that cannot be easily converted into weapon use.

  6. A nuclear reactor will never turn into a nuclear weapon by accident. A nuclear weapon is not just a nuclear reactor that accidentally melted down.

Nuclear weapons are the easiest thing in the world to coordinate on and (TPNW aside) humanity has mostly failed to do so.

Maybe people will build narrow AIs instead of AGIs.

We [humanity] will build the most powerful AIs we can. An AGI that combines two narrow AIs will be more powerful than both of the narrow AIs. This is because the hardest part of building an AGI is figuring out what representation to use. An AGI can do everything a narrow AI can do plus it gets transfer learning on top of that.

Maybe building an AGI is really hard—so hard we won’t build it this century.

Maybe. But recent developments, especially at OpenAI, show that you can get really far just brute forcing the problem with a mountain of data and warehouse full of GPUs.

The first AGI will, by default, kill everyone

Despite AGIs being more useful than narrow AIs, the things we actually use machine learning for are narrow domains. But if you tell a superintelligent AGI to solve a narrow problem then it will sacrifice all of humanity and all of our future lightcone to solving that narrow problem. Because that’s what you told it to do.

Thus, every tech power in the world—from startups to nation-states—is racing as fast as we can to build a machine that will kill us all.

But what if whoever wins the AGI race builds an aligned AGI instead of an unaligned AGI?

Almost nobody (as a fraction of the people in the AI space) is trying to solve the alignment problem (which is a prerequisite to building an aligned AI). But let’s suppose that the first team of people who build a superintelligence first decide not to turn the machine on and immediately surrender our future to it. Suppose they recognize the danger and decide not to press “run” until they have solved alignment….

AI Alignment is really hard

AI Alignment is stupidly, incredibly, absurdly hard. I cannot refute every method of containing an AI because there are an infinite number of systems that won’t work.

AI Alignment is, effectively, a security problem. It is easier to invent an encryption system than to break it. Similarly, it is easier to invent a plausible method of containing an AI than to demonstrate how it will fail. The only way to get good at writing encryption systems is to break other peoples’ systems. The same goes for AI Alignment. The only way to get good at AI Alignment is to break other peoples’ alignment schemes.[1]

Can’t we experiment on sub-human intelligences?

Yes! We can and should. But just because an alignment scheme works on a subhuman intelligence doesn’t mean it’ll work on a superhuman intelligence. We don’t know whether an alignment scheme will withstand a superhuman attacker until you test it against that superhuman attacker. But we don’t get unlimited retries against superhuman attackers. We might not even get a single retry.

Why is AI Alignment so hard?

A superhuman intelligence will, by default, hack its reward function. If you base the reward function on sensory inputs then the AGI will hack the sensory inputs. If you base the reward function on human input then it will hack its human operators.

We don’t know what a superintelligence might do until we run it, and it is dangerous to run a superintelligence unless you know what it will do. AI Alignment is a chicken-and-egg problem of literally superhuman difficulty.

Why can’t we use the continuity of machine learning architectures to predict (within some ) what the AGI will do?

Because a superintelligence will have sharp capability gains. After all, human beings do just within our naturally-occurring variation. Most people cannot write a bestselling novel or invent complicated recursive algorithms.

TL;DR

AI Alignment is, effectively, a security problem. How do you secure against an adversary that is much smarter than you?


  1. ↩︎

    Maybe we should have alignment scheme-breaking contests.