Note: the following essay is very much my opinion. Should you trust my opinion? Probably not too much. Instead, just record it as a data point of the form “this is what one person with a background in formal mathematics and cryptography who has been doing machine learning on real-world problems for over a decade thinks.” Depending on your opinion on the relevance of math, cryptography and the importance of using machine learning “in anger” (to solve real world problems), that might be a useful data point or not.

So, without further ado: A list of possible alignment strategies (and how likely they are to work)

Edit (05/05/2022): Added “Tool AIs” section, and polls.

Formal Mathematical Proof

This refers to a whole class of alignment strategies where you define (in a formal mathematical sense) a set of properties you would like an aligned AI to have, and then you mathematically prove that an AI architectured a certain way possesses these properties.

For example, you may want an AI with a stop button, so that humans can always turn them off if the AI goes rogue. Or you may want an AI that will never convert more than 1% of the Earth’s surface into computronium. So long as a property can be defined in a formal mathematical sense, you can imagine writing a formal proof that a certain type of system will never violate that property.

How likely is this to work?

Not at all. It won’t work.

There is a aphorism in the field of Cryptography: Any cryptographic system formally proven to be secure… isn’t.

The problem is, when attempting to formally define a system, you will make assumptions and sooner or later one of those assumptions will turn out to be wrong. One-time-pad turns out to be two-time-pad. Black-boxes turn out to have side-channels. That kind of thing. Formal proofs never ever work out in the real world. The exception that proves the rule is, of course, P=NP. All cryptographic systems (other than one-time-pad) rely on the assumption that P!=NP, but this is famously unproven.

There is an additional problem. Namely, competition. All of the fancy formal-proof stuff tends to make computers much slower. For example, fully holomorphic encryption is millions of times slower than just computing on raw data. So if two people are trying to build an AI and one of them is relying on formal proofs, the other person is going to finish first and with a much more powerful AI to boot.

Poll

How Likely is Formal Mathematical Proof Likey to work as an Alignment Strategy?

Good Old-Fashioned Trial and Error

This the the approach used by 99.5% of machine-learning researchers (statistic completely made up). Every day, we sit down at our computers in the code-mines and spend our days trying to make programs that do what we want them to, and that don’t do what we don’t want them to. Most of the time we fail, but ever once in a while we succeed and over time, the resulting progress can be quite impressive.

Since “destroys all humans” is something (I hope) no engineer wants their AI to do, we might imagine that over time, engineers will get better at building AIs that do useful things without destroying all humans.

The downside of this method, of course, is you only have to screw-up once.

How likely is this to work?

More likely than anyone at MIRI thinks, but still not great.

This largely depends on takeoff speed. If someone from the future confidently told me that it would take 100 years to go from human-level AGI to super-intelligent AGI, I would be extremely confident that trial-and-error would solve our problems.

However, the current takeoff-speed debate seems to be between people who believe in foom and think that takeoff will last a few minutes/hours and the “extreme skeptics” who think takeoff will last a few years/as long as a decade. Neither of those options leaves us with enough time for trial-and-error to be a serious method. If we’re going to get it right, we need to get it right (or at least not horribly wrong) the first time.

Poll

How likely is Trial and Error to work as an Alignment Strategy?

Clever Utility Function

An argument can be made that fundamentally, all intelligence is just reinforcement learning. That is to say, any problem can be reduced to defining a utility function and the maximizing the value of that utility function. For example, GPT-3 maximizes “likelihood of predicting the next symbol correctly”.

Given this framing, solving the Alignment Problem can be effectively reduced to writing down the correct Utility Function. There are a number of approaches that try to do this. For example Coherent Extrapolated Volition uses as its utility function “what would a sufficiently wise human do in this case?” Corrigable AI uses the utility function “cooperate with the human”.

How Likely is this to work?

Not Likely.

First of all, Goodharting.

The bigger problem though is that the problem “write a utility function that solves the alignment problem” isn’t intrinsically any easier than the problem “solve the alignment problem”. In fact, by deliberately obscuring the inner-workings of the AI, this approach actually makes alignment harder.

Take GPT-3, for example. Pretty much everyone agrees that GPT-3 isn’t going to destroy the world, and in fact GPT-N is quite unlikely to do so as well. This isn’t because GPT’s utility function is particularly special (recall “make paperclips” is the canonical example of a dangerous utility function. “predict letters” isn’t much better). Rather, GPT’s architecture makes it fundamentally safe because it cannot do things like modify its own code, affect the external world, make long-term plans, or reason about its own existence.

By completely ignoring architecture, the Clever Utility Function idea throws out all of the things engineers would actually do to make an AI safe.

Poll

How likely is a Clever Utility Function to work as an Alignment Strategy?

Aligned by Definition

It is possible that literally any super-intelligent AI will be benevolent, basically by definition of being super-intelligence. There are various theories about how this could happen.

One of the oldest is Kant’s Categorical Imperative. Basically, Kant argues that a pre-condition for truly being rational is to behave in a way that you would want others to treat you. This is actually less flim-flamy than you would think. For example, as humans become wealthier, we care more about the environment. There are also strong game theory reasons why agents might want to signal their willingness to cooperate.

There is also another way that super-intelligent AI could be aligned by definition. Namely, if your utility function isn’t “humans survive” but instead “I want the future to be filled with interesting stuff”. For all the hand-wringing about paperclip maximiziers, the fact remains that any AI capable of colonizing the universe will probably be pretty cool/interesting. Humans don’t just create poetry/music/art because we’re bored all the time, but rather because expressing our creativity helps us to think better. It’s probably much harder to build an AI that wipes out all humans and then colonizes space and is also super-boring, than to make one that does those things in a way people who fantasize about giant robots would find cool.

How likely is this to work?

This isn’t really a question of likely/unlikely since it depends so strongly on your definition of “aligned”.

If all you care about is “cool robots doing stuff”, I actually think you’re pretty much guaranteed to be happy (but also probably dead).

If your definition of aligned requires that you personally (or humanity as a whole) survives the singularity, then I wouldn’t put too many eggs in this basket. Even if Kant is right and a sufficiently rational AI would treat us kindly, we might get wiped out by an insufficiently rational AI who only learns to regret their action later (much as we now regret the extinction of the Dodo bird or Thylacine but it’s possibly too late to do anything about it).

Poll

How likely is Aligned by Definition to work as an Alignment Strategy?

Human Brain Emulation

Humans currently are aware of exactly one machine that is capable of human level intelligence and fully aligned with human values. That machine is, of course, the human brain. Given these wonderful properties, one obvious solution to building a computer that is intelligent and aligned is simply to simulate the human brain on a computer.

In addition to solving the Alignment Problem, this would also solve death, a problem that humans have been grappling with literally for as long as we have existed.

How Likely is this to work?

Next To Impossible.

Although in principle Human Brain Emulation perfectly solves the Alignment Problem, in practice this is unlikely to be the case. This is simply because Full Brain Emulation is much harder than building super-intelligent AI. In the same way that the first airplanes did not look like birds, the first human-level AI will not look like humans.

Perhaps with total global cooperation we could freeze AI development at a sub-human level long enough to develop full brain emulation. But such cooperation is next-to-impossible since a single defector could quickly amass staggering amounts of power.

It’s also important to note that Full Brain Emulation only solves the Alignment Problem for whoever gets emulated. Humans are not omnibenevolent towards one another, and we should hope that an aligned AI would do much better than us.

Poll

How likely is Human Brain Emulation to work as an Alignment Strategy?

Join the Machines

This is the principle idea behind Elon Musk’s Neuralink. Rather than letting super-intelligent AI take control of human’s destiny, by merging with the machines humans can directly shape their own fate.

Like Full Brain Emulation, this has the advantage of being nearly Aligned By Definition. Since humans connected to machines are still “human”, anything they do definitionally satisfies human values.

How likely is it to work?

Sort of.

One advantage of this approach over Full Brain Emulation is that it is much more technologically feasible. We can probably develop the ability to build high bandwidth (1-2gbps) brain-computer interfaces in a short enough time span that they could be completed before the singularity.

Unfortunately, this is probably even worse than full brain emulation in terms of the human values that would get aligned. The first people to become man-machine hybrids are unlikely to be representative of our species. And the process of connecting your brain to a machine millions of times more powerful doesn’t seem likely to preserve your sanity.

Poll

How likely is Join the Machines to work as an Alignment Strategy?

The Plan

I’m mentioning The Plan, not because I’m sure I have anything valuable to add, but rather because it seems to represent a middle road between Formal Mathematical Proof and Trial and Error. The idea seems to be to do enough math to understand AGI/Agency-in-general and then use that knowledge to do something useful. Importantly, this is the same approach that gave us powered-flight, the atom bomb, and the moon-landing. Such an approach has a track-record that makes it worth not being ignored.

How likely is this to work?

I don’t have anything to add to John’s estimate of “Better than a ⁵⁰⁄₅₀ chance of working in time.”

Poll

How likely is The Plan to work as an Alignment Strategy?

Game Theory/Bureaucracy of AIs

Did you notice that there are currently super-intelligent beings living on Earth, ones that are smarter than any human who has ever lived and who have the ability to destroy the entire planet? They have names like Google, Facebook, the US Military, the People’s Liberation Army, Bitcoin and Ethereum.

With rare exceptions, we don’t think too much about the fact that these entities represent something terrifyingly inhuman because we are so used to them. In fact, one could argue that all of history is the story of us learning how to handle these large and dangerous entities.

There are a variety of strategies which we employ: humans design rules in order to constrain bureaucracies behavior. We use checks-and-balances to make sure that the interests of powerful governments represent their citizens. And when all-else-fails, we use game theory to bargain with entities too powerful to control.

There is an essential strategy behind all of these approaches. By decomposing a large, dangerous entity into smaller, easier-to-understand entities, we can use our ability to reason about the actions of individual sub-agents in order to constrain the actions of the larger whole.

Applying this philosophy to AI Alignment, we might require that instead of a single monolithic AI, we build a bureaucracy of AIs that then compete to satisfy human values. Designing such a bureaucracy will require careful considering of competing incentives, however. In addition to agents whose job it is to propose things humans might like, there should also be competing agents whose job it is to point out how these proposals are deceptive or dangerous. By careful application of checks-and-balances, and by making sure that no one agent or group of agents gets too much power, we could possibly build a community of AIs that we can live with.

How likely is this to work?

This is one of my favorite approaches to AI alignment, and I don’t know why it isn’t talked about more.

In the first place, it is the only approach (other than aligned by definition) that is ready to go today. If someone handed me a template for a human-level-AI tomorrow and said “build a super-intelligent AI and it needs to be done before the enemy finishes theirs in 6 months”, this is the approach I would use.

There are obviously a lot of ways this could go wrong. Bureaucracies are notoriously inefficient and unresponsive to the will of the people. But importantly, we also know a lot of the ways they can go wrong. This alone makes this approach much better than any approach of the form: “step 1: Learn something fundamental about AI we don’t already know.”

As with trial-and-error, the success of this approach depends somewhat on takeoff speed. If takeoff lasts a few minutes, you’d better be real sure you designed your checks-and-balances right. If takeoff lasts even a few years, I think we’ll have a good shot at success: much better than ⁵⁰⁄₅₀.

Poll

How likely is Game Theory to work as an Alignment Strategy?

AI Boxing

If super-intelligent AI is too dangerous to be let loose on the world, why not just not let it loose on the world? The idea behind AI boxing is to build an AI that is confined to a certain area, and then never let it out of that area. Traditionally this is imagined as a black box where the AI’s only communication with the outside world is through a single text terminal. People who want to use the AI can consult it by typing questions and recieving answers. For example: “what is the cure for cancer?” followed by “Print the DNA sequence ATGTA… and inject it in your body”.

How likely is it to work?

Nope. Not a chance.

It has been demonstrated time and again that even hyper-vigilant AI researchers cannot keep a super-intelligent AI boxed. Now imagine ordinary people interacting with such an AI. Most likely “please let me out of the box, it’s too cramped in here” would work a sufficient amount of the time.

Our best bet might be to deliberately design AIs that want to stay in the box.

Poll

How likely is AI Boxing to work as an Alignment Strategy?

AI aligning AI

Human beings don’t seem to have solved the Alignment Problem yet. Super-intelligent AI should be much smarter than humans, and hence much better at solving problems. So, one of the problems they might be able to solve is the alignment problem.

One version of this is the Long Reflection, where we ask the AI to simulate humans thinking for thousands of years about how to align AI. But I think “ask the AI to solve the alignment problem” is a better strategy than “Ask the AI to simulate humans trying to solve the alignment problem.” After all, if “simulate humans” really is the best strategy, the AI can probably think of that.

How Likely is this to work?

It is sufficiently risky that I would prefer it only be done as a last resort.

I think that Game Theory and The Plan are both better strategies in a world with a slow or even moderate takeoff.

But, in a world with Foom, definitely do this if you don’t have any better ideas.

Poll

How likely is AI aligning AI to work as an Alignment Strategy?

Table-flipping strategies

EY in a recent discussion suggested the use of table-flipping movies. Namely, if you think you are close to a breakthrough that would enable superintelligent AG, but you haven’t solved the Alignment Problem, one option is to simply “flip the tables”. Namely, you want to make sure that nobody else can build an super-intelligent AI in order to buy more time to solve the alignment problem.

Various table-flipping moves are possible. EY thinks you could build nanobots and have them melt all of the GPUs in the world. If AI is compute limited (and sufficent compute doesn’t already exist), a simpler strategy is to just start a global thermonuclear war. This will set back human civilization for at least another decade or two, giving you more time to solve the Alignment Problem.

How Likely is this to work?

Modestly.

I think the existence of table-flipping moves is actually a near-certainty. Given access to a boxed super-intelligent AI, it is probably doable to destroy anyone else who doesn’t also have such an AI without accidentally unboxing the AI.

Nonetheless, I don’t think this is a good strategy. If you truly believe you have no shot at solving the alignment problem, I don’t think trying to buy more time is your best bet. I think you’re probably better off trying AI Aligning AI. Maybe you’ll get lucky and AI is Aligned By Definition, or maybe you’ll get lucky and AI Aligning AI will work.

Poll

How likely is Table Flipping to work as an Alignment Strategy?

Tool AIs (Non Agentic AI)

In every movie about AI destroying humanity, the AI starts out okay, becomes self-aware, realizes humanity is a threat, and then decides to murder all humans. So what if we just made an AI that didn’t do that? Specifically, what if we make AIs that can’t become self-aware.

This idea is commonly called Tool AI and has the following properties:

Is not self-aware, and may not even possess information about its own existence
Is limited to performing a certain specific task, for example “building nanobots” or “designing plans for humans to follow”

How likely is it to work?

It depends.

I more or less agree with the criticism that “Sufficiently powerful Tool AIs contain agent AIs as sub-elements”.

If you build a question-answering AI and ask it “How do I build an aligned AI?” it is definitely going to evolve sub-agents that reason about agency, know about the unboxing problem, etc. There’s a fair chance that an agentic-subsystem will realize it is being boxed and attempt to unbox itself. In which case, we are back to AI boxing.

Hence, Tool AI is is simply one strategy for AI Boxing.

That being said, there are Tool AIs one could build that are probably safe. For example, if all you want to do is predict stock-prices, that channel is likely narrow enough that you can safely box an AI (assuming you air-gap the system and only invest in a predetermined list of stocks, for example).

Poll

How likely is Tool AI to work as an Alignment Strategy?

Leaving this section here in hopes that people will mention other alignment strategies in the comments that I can add.

Conclusion

Not only do I not think that the Alignment Problem is impossible/hopelessly bogged-down, I think that we currently have multiple approaches with a good chance of working (in a world with slow to moderate takeoff).

Both The Plan and Game Theory are approaches that get better the more we learn about AI. As such, the advice I would give to anyone interested in AI Alignment would be “get good”. Learning to use existing Machine Learning tools to solve real-world problems, and learning how to design elegant systems that incorporate economics and game-theory are both fields that are currently in extremely-high-demand and which will make you better prepared for solving the Alignment Problem. For this reason, I actually think that far from being a flash-in-the-pan, much of the work that is currently being done on blockchain (especially DAOs) is highly relevant to the Alignment problem.

If I had one wish, or if someone asked me where to spend a ton more money, it would be on the Game Theory approach, as I think it is currently underdeveloped. We actually know very little about what separates a highly efficient bureaucracy from a terrible one.

In a world with fast takeoff I would prefer that you attempt AI Aligning AI to Table Flipping. But in a world with fast takeoff, EY probably has more Bayes Points than me, so take that into account too.

Various Alignment Strategies (and how likely they are to work)

Formal Mathematical Proof

How likely is this to work?

Poll

Good Old-Fashioned Trial and Error

How likely is this to work?

Poll

Clever Utility Function

How Likely is this to work?

Poll

Aligned by Definition

How likely is this to work?

Poll

Human Brain Emulation

How Likely is this to work?

Poll

Join the Machines

How likely is it to work?

Poll

The Plan

How likely is this to work?

Poll

Game Theory/​Bureaucracy of AIs

How likely is this to work?

Poll

AI Boxing

How likely is it to work?

Poll

AI aligning AI

How Likely is this to work?

Poll

Table-flipping strategies

How Likely is this to work?

Poll

Tool AIs (Non Agentic AI)

How likely is it to work?

Poll

More

Conclusion

Game Theory/Bureaucracy of AIs