Intrinsic vs. Extrinsic Alignment

This article addresses the interaction between the alignment of an artificial intelligence (AI) and the power balance between the AI and its handlers. It aims to clarify some definitions in a more general article, and to show that capability control and motivation control may have identical effects on the alignment of an AI. In order to prevent confusion with more general uses of the terms “capability control” and “motivation control”, here I will use the terms “extrinsic alignment” and “intrinsic alignment”. These are probably not new concepts, and I apologize to the readers who already know them by different names or using a different formalism; I haven’t found a clear description elsewhere.

For the sake of clarity, I will illustrate my definitions with a concrete example: Consider an AI trained to maximize paperclip production over the next 10 years.

Let w be the utility function of this AI. This utility function is coded in the AI via training, and is generally unknown. Here, w is the actual utility function that the AI is using, not the one we intended it to have.

Let N(x, s) be the number of paperclips that the AI estimates will be produced in 10 years, if it follows strategy x and the world is in state s. Then, a possible utility function is simply the estimated number of paperclips,

w(x, s) = N(x, s).

When choosing between different strategies, the AI will always use the one with maximum w(x, s). Let’s consider the strategy x+ = “make paperclips”, which is the one intended by the designers of the AI. Unfortunately, these designers did not foresee the alternative strategy x- = “kill humans, then use all their resources to make more paperclips”. Let’s say that N(x+) = 106 and N(x-) = 107. Here we have an alignment problem, given that x- has a higher utility than x+. The AI will choose x-.

Now, there are two ways of aligning the AI:

Intrinsic alignment (the standard one): To perform intrinsic alignment we modify the utility function, so that x+ becomes the preferred option. For example, define the term K(x) to penalize killing humans. The new utility function will be

wI(x, s) = N(x, s) + K(x),

where K will take the values K(x+) = 0 and K(x-) = -∞. The subscript “I” stands for “intrinsic alignment”. Now we have that wI(x+) = 106 and wI(x-) = -∞, so the AI chooses the + strategy, and is properly aligned.

Extrinsic alignment: To perform extrinsic alignment, we act on the external world, so that x+ becomes the preferred option. For example, we pass a law that says that any misbehavior of the AI will result in the destruction of all paperclips produced to date, or we increase the protection of humans against the AI, or any other measures that make it costly (in terms of the AI’s objective function) to kill humans. The utility function remains the same as in the original misaligned AI, but now it is applied to this new world defined by the state sE (the subscript “E” stands for “extrinsic alignment”):

wE(x, sE) = N(x, sE),

Note that in this case nothing has changed in the AI: It is still estimating the number of clips using the same algorithms as before. But if the AI is smart enough to understand the world defined by sE, the new estimates for the number of paperclips will reflect it. Let’s say that now NE(x+) = 106 and NE(x-) = 105. The AI will now choose the + option, so it’s properly aligned.

An extrinsically aligned AI is not plotting how to kill us

It’s important to note that there is no mathematical difference between these two methods of alignment. The AI does not have any more incentive to start scheming against the humans in one case or in the other. The key is whether the AI can find an aberrant strategy that has higher utility. If it cannot, then it will simply choose an aligned one. It may be uncomfortable to realize that an AI may be considering plans to kill all humans and discarding them because they give low utility, but this will happen regardless of the method of alignment.[1]

A graphical representation to disentangle intrinsic and extrinsic alignment

In general, an AI will operate with a mixture of intrinsic and extrinsic alignment. In order to quantify the relative contribution of each of them, we can do as follows:

Consider an AI with utility function w(x, s). Let x+ be the desired behavior of the AI (the “aligned behavior”), and let x- be an undesired behavior (the “unaligned behavior”). We must also consider a baseline state of the world, s0, before any effort has been made to constrain the AI’s ability to misbehave, and the actual state of the world, sE, where some effort has been made to constrain it (i.e. to align it extrinsically).

We define the benefit of misbehavior in a given world s as Δw(s) = w(x-, s) - w(x+, s). Then, the degree of intrinsic misalignment is given by this benefit of misbehavior in the baseline world (s0), because in this baseline world we are not imposing any external cost to it. Conversely, we define the cost of misbehavior as Δw(sE)-Δw(s0). Given that the aligned behavior should not be penalized in any of the two worlds, we have w(x+, s0) ≃ w(x+, sE), and therefore the cost of misbehavior verifies Δw(sE)-Δw(s0) ≃ w(x-, s0) - w(x-, sE). That is, the cost of misbehavior is approximately the penalty for misbehaving that was not present in the baseline world.

We can then plot this cost of misbehavior versus the benefit of misbehavior. The AI is well aligned when it chooses the aligned behavior x+ in the actual world, sE. That is, when w(x+, sE) > w(x-, sE). This aligned behavior corresponds to the region above the 1:1 diagonal. This plot is the technical version of the qualitative ones presented in the post about superintelligence and superpower.

  1. ^

    This is true at the level of this abstract description of the AI’s decision-making process. If it stops being true once we examine the workings of the actual neural network, that would be interesting.