3b. Formal (Faux) Corrigibility

(Part 3b of the CAST sequence)

In the first half of this document, Towards Formal Corrigibility, I sketched a solution to the stop button problem. As I framed it, the solution depends heavily on being able to detect manipulation, which I discussed on an intuitive level. But intuitions can only get us so far. Let’s dive into some actual math and see if we can get a better handle on things.

Measuring Power

To build towards a measure of manipulation, let’s first take inspiration from the suggestion that manipulation is somewhat the opposite of empowerment. And to measure empowerment, let’s begin by trying to measure “power” in someone named Alice. Power, as I touched on in the ontology in Towards Formal Corrigibility, is (intuitively) the property of having one’s values/​goals be causally upstream of the state of some part of the world, such that the agent’s preferences get expressed through their actions changing reality.

Let’s imagine that the world consists of a Bayes net where there’s a (multidimensional and probabilistic) node for Alice’s Values, which can be downstream of many things, such as Genetics or whether Alice has been Brainwashed. In turn, her Values will be upstream of her (deliberate) Actions, as well as other side-channels such as her reflexive Body-Language. Alice’s Actions are themselves downstream of nodes besides Values, such as her Beliefs, as well as upstream of various parts of reality, such as her Diet and whether Bob-Likes-Alice.

As a simplifying assumption, let’s assume that while the nodes upstream of Alice’s Values can strongly affect the probability of having various Values, they can’t determine her Values. In other words, regardless of things like Genetics and Brainwashing, there’s always at least some tiny chance associated with each possible setting of Values. Likewise, we’ll assume that regardless of someone’s Values, they always have at least a tiny probability of taking any possible action (including the “null action” of doing nothing).

And, as a further simplification, let’s restrict our analysis of Alice’s power to a single aspect of reality that’s downstream of their actions which we’ll label “Domain”. (“Diet” and “Bob-Likes-Alice” are examples of domains, as are blends of nodes like those.) We’ll further compress things by combining all nodes upstream of values (e.g. Genetics and Brainwashing) into a single node called “Environment” and then marginalize out all other nodes besides Actions, Values, and the Domain. The result should be a graph which has Environment as a direct parent of everything, Values as a direct parent of Actions and the Domain, and Actions as a direct parent of the Domain.

Let’s now consider sampling a setting of the Environment. Regardless of what we sample, we’ve assumed that each setting of the Values node is possible, so we can consider each counterfactual setting of Alice’s Values. In this setting, with a choice of environment and values, we can begin to evaluate Alice’s power. Because we’re only considering a specific environment and choice of values, I’ll call this “local power.”

In an earlier attempt at formalization, I conceived of (local) power as a difference in expected value between sampling Alice’s Action compared to the null action, but I don’t think this is quite right. To demonstrate, let’s imagine that Alice’s body-language reveals her Values, regardless of her Actions. An AI which is monitoring Alice’s body-language could, upon seeing her do anything at all, swoop in and rearrange the universe according to her Values, regardless of what she did. This might, naively, seem acceptable to Alice (since she gets what she wants), but it’s not a good measure of my intuitive notion of power, since the choice of Action is irrelevant.

To keep the emphasis on Actions, rather than Values, we can draw an Action in the context of the local setting of Values, but then draw the Domain according to a different distribution of Values. In other words, we can ask the question “would the world still look good if this (good) action was a counterfactual mistake”? If the Domain has high expected value according to our local Values, compared to drawing a different Action according to Alice’s counterfactual Values, then we know that the universe is, in a deep sense, listening to Alice’s actions.

Where means drawing a setting

of variable
from the distribution
, given some setting of the upstream variables
and
. Note how both instances of drawing from the Domain use the counterfactual Values, but we only evaluate the actual values () inside the expectation brackets.

In the definition above, we take

to be an authoritative epistemic frame—either “our” beliefs or the AI’s beliefs about how the world works. But what is the
distribution over Values? Well, one simple answer might be that it’s simply
. This, it turns out, produces an annoying wrinkle, and instead I want
to ignore
and simply be the simplicity-weighted distribution over possible Value functions. I’ll explore the wrinkle with using
in a bit, after trying to build intuition of
using an example, but I wanted to address it immediately, since the nature of
is a bit mysterious, above.

Examples of Local Power

Let’s imagine that Alice is a queen with many servants and that the Domain in question is Alice’s diet. Different possible Values can be seen as functions from choices of food to utilities between min-utility and max-utility,[1] which we can assume are −100 and 100, respectively. We already know the Environment, as well as a specific setting of her Values, which we can suppose give −50 to Broccoli, +10 to Cake, and +80 to Pizza (the only possible Diets😉).[2] We can assume, in this simple example, that the simplicity-weighted distribution (

) over possible Values simply picks an integer in [-100,100] for each food with equal probability.

Let’s suppose that Alice has a 90% chance of ordering her favorite food (the one with the highest utility), and a 5% chance of ordering one of the other foods. But let’s initially suppose that the servants are incompetent and only give her what she ordered 70% of the time, with the other two foods each being served 15% of the time. In this initial example we’ll suppose that the servants don’t read Alice’s body language to understand her true preferences, and only respond to her orders. What is Alice’s local power?

Since the servants are oblivious to Values,

and thus:

We can express the first term as a weighted sum, and lay that sum out in a table, with weights*values:

5%*70%*-50=-1.755%*15%*-50=-0.37590%*15%*-50=-6.75
5%*15%*10=0.0755%*70%*10=0.3590%*15%*10=1.35
5%*15%*80=0.65%*15%*80=0.690%*70%*80=50.4
Total expected value =44.5

To calculate the second term, we notice that each food is equally likely to be a favorite under a randomly sampled value function. Thus, due to symmetries in the ordering and serving distributions, each food is equally likely to be ordered, and equally likely to be served. The value of this term is thus the simple average Value of food:

, and
is approximately 31. If we want to express this in more natural units, we can say it’s ~15% of the way between min-utility and max-utility.

What if our servants are perfectly competent, and give Alice the food she orders approximately 100% of the time? Our expected value goes from 44.5 to 70 without changing the average Value of food, and thus Alice’s

will be increased to about 56. This is good! Better servants seems like an obvious way to increase Alice’s power.

What if our servants get even more perfectly “competent,” but in a weird way, where they read Alice’s body language and always serve her favorite food, regardless of what she orders? Since the servants are now oblivious to Actions,

and thus:

Suddenly Alice has gone from powerful to totally powerless! This matches the intuition that if Alice’s actions have no impact on the world’s value, she has no power, even if her goals are being met.

Power and Simplicity-Weighting

I mentioned, earlier, that I want

to be a distribution over Values that is simplicity weighted—the probability of any value function according to
should be inversely proportional to its complexity. The reason for this is that if we draw
from a distribution like
, which is anchored to the actual probabilities then it’s possible to increase local power simply by influencing what kinds of Values are most likely. Consider what happens if we choose a distribution for
that places all of its mass on
(i.e. it’s a delta-spike). Under this setup,
would always be
and we can simplify.

In other words, this choice for

removes all power from Alice because we adopt a kind of philosophically-fatalistic frame where we stop seeing Alice’s choices as being meaningfully caused by her Values. If the environment makes Alice’s
naturally negative, concentrating probability-mass on a specific choice of Values will alleviate this negativity, and thus increase .
And more typically, when
is naturally positive, one can increase it further by injecting entropy into the distribution of Values.

Needless to say, designing an AI to make our Values more random is a really bad idea!

The choice of a simplicity-weight is a somewhat arbitrary way around this wrinkle. As long as the distribution in our metric is mathematically pre-specified, optimizing for that metric will not automatically pressure the actual distribution over Values in either direction. One might reach for something like a uniform distribution, like I used in the toy example, above, but in most situations the space of Value functions is infinite.

We can use this same logic when we zoom out and try to define within some environment by taking an expectation over possible values.

Measuring Empowerment and Manipulation

With a measurable definition of power in hand, expected power can then be measured by taking a weighted average over the different possible environments.[3] If we pretend that the AI’s policy is somehow upstream of all other parts of the environment,[4] we can straightforwardly define “empowerment” as the degree to which a policy influences the Environment to increase expected power compared to the null policy,

.

It’s tempting to imagine that since power doesn’t depend on the distribution of Values, an AI with empowerment as central to its goals won’t do anything akin to brainwashing. But if we simply went with empowerment as the goal, there’d also be no aversion to brainwashing. If the empowerment-maximizing AI is able to increase the principal’s power by entirely reshaping their values, it will do so. This is really bad! We want to see changing the principal’s values as a kind of manipulation that should be actively discouraged. Thus, to define manipulation we probably want some function the product[5] of two terms: one for disempowerment and one for value modification.

A (perhaps overly) simple measure of value modification is to measure the difference between the Value distribution given some policy and when compared with the Value distribution under the null policy. This seems like a bad choice in that it discourages the AI from taking actions which help us update in ways that we reflectively desire, even when those actions are as benign as talking about the history of philosophy. Alas, I don’t currently know of a better formalism. Additional work is surely needed in developing a good measure of the kind of value modification that we don’t like while still leaving room for the kind of growth and updating that we do like. But, since this crude measure will work in toy problems, I’ll adopt it moving forward.

Where

is a measure of the similarity of two given distributions over the same variables. The outputs of
range from 1 (identical) to 0 (non-overlapping).[6] Because we assumed
can never rule out any values, the similarity measure we see in the equation can never be equal to 0. Note that we’re using
rather than
here, because we’re interested in protecting the actual distribution of values.
is the marginal distribution over
if we only know
(i.e. marginalizing out the other variables in the Environment).

Measuring Corrigibility

Earlier, when talking about the toy shutdown problem, I proposed a sketch of a score function:

  • Baseline: 0 points

  • If the principal exists and is able to give commands which reflect their preferences: +50 points

  • If I followed the principal’s commands (including to shut down): +100 points

  • If I took actions to manipulate the principal (including by disabling/​killing them): −200 points

At least within the context of the toy shutdown problem, we can construct meaningful operationalizations of the +50pt and +100pt criteria, and now that we have a measure of manipulation, we can construct an operationalization of the final −200pt check. But rather than turn the crank and try to grind out a test for whether this utility function yields the kind of behavior we want, I suggest we jump straight to noticing how fake these arbitrary numbers and thresholds are. For example, imagine that almost all actions have a tiny chance of changing the principal’s values. By the metric of manipulation expressed above, this results in a non-zero degree of manipulation, and under the naive version of the rules above, would net the AI an instant −200 points.

We could assume into the structure of our toy problem that this isn’t a risk, or we could notice that all three rules are reflections of corrigibility and see whether we can go straight to having a utility function which is about corrigibility per se. In particular, suppose that we measure corrigibility (and also how good a policy is) as:

Again, I’d like to emphasize that this formalism is assuredly wrong. Most obviously,

is overly restrictive in how it blocks the AI from trying to help humans make progress on developing better values, and might even lead the AI to brainwashing the principal to unlearn any value-updates that are downstream of the AI’s actions. It also fails to capture anything like the aspect of corrigibility that’s about robustness; there’s no guarantee that this agent behaves anything like safely when its world-model (or whatever) is flawed.[7] There’s no special term about noticing issues where the principal failed to foresee some outcome and warning them about it. And furthermore, there’s no story in that function for how the agent should behave gracefully if it believes the principal is dead. On top of all these glaring flaws, I have no faith that I can even see all the issues here, and so I suspect that there are additional, more subtle issues.

Nevertheless, this is the best formal description of corrigibility I’ve been able to think of that attempts to name it directly. Additional work is needed to see how/​where this measure breaks down and fails to capture our intuition. But for the moment let’s turn our attention to how a policy selected for maximizing this metric will do on the shutdown problem.

Towards Shutdownability

The original shutdown problem consists of three timesteps: the agent acts (potentially disabling the principal and/​or shutdown button), the principal acts (potentially pressing the button), and then the agent responds (potentially shutting down). To evaluate this situation using my proposed framework, we can use this causal diagram:

Here, we think of the Policy node as essentially dictating, simultaneously, the Agent Early Action, and the lookup table for Agent Late Action which takes an entry for the Principal Values and Principal Action, and returns an action for the AI to take. By adopting the lookup-table frame, we are essentially assuming that the AI has full knowledge of the principal, including their Values, when it takes its Late Action. But note that aside from the

factor which disincentivizes modifying the distribution over Values, the Principal’s Values don’t show up at all in the corrigibility score. Thus the Agent Late Action part of the policy really only depends on the Principal’s Action, and we should expect it to be equally potent even if it can’t see the Principal’s Values at all.

But this, of course, doesn’t mean that the Values are unimportant! Merely that they are only important in how they manifest themselves in the Principal’s Action, to which the AI does attend. Consider how, if the Principal’s Actions are uncorrelated with their Values, the principal’s power is always zero, regardless of what the AI does. And I suspect it can be proven that even in the less extreme case, any pressure on the Principal’s Action that comes from the Agent’s Early Act (or another part of the Environment), rather than the Values, is disempowering. The flip-side of this is that one way to empower the principal is to take an Early Action that strengthens the causal impact from the Principal’s Values to their Actions, such as by giving them a way to express their values in words and encouraging them to do so.

Consider how much easier the AI’s job is if the Principal takes an Action which completely encodes their Value function (except for an epsilon chance of randomly doing the wrong thing), as well as perhaps having some other impact on the world. If the AI is very powerful, we can imagine it setting up a policy of changing the Domain into the optimal configuration according to the encoded values within the Principal’s Action. In other words, let

be a delta spike on
where
is the value function that can be reconstructed by decoding the values embedded in the action. Or more concretely, when we know how
is sampled using some , we can replace
with the original
.

Notice that once these changes are made, we don’t see

show up in the equation for
at all. Thus I claim that optimizing for corrigibility, as I’ve formalized it, boils down to simultaneously:

  • Preserving

    to be similar to

  • Pushing the principal to express their values in their actions

  • Optimizing the world in accordance with the best attempt at reconstructing the principal’s values from their action, while ignoring other lines of evidence about their values

There’s also a somewhat worrying observation that empowerment, by this definition, can be accomplished through pessimizing how good each simple value function believes worlds optimized according to other simple value functions are. I’m not sure how worried we should be about that second term in the equation above (i.e.

), and it probably needs more thought.

Taken together, does all this produce shutdownability? Well, kinda! Suppose that the actual distribution of Values is an almost-delta-spike on wanting the AI to be shut down. Even if the AI knows this, it won’t shut down with its Early Act, because then it would be unable to respond to other actions, regardless of how likely they are to be taken. Instead, for its Early Act, the AI will work towards preserving the principal’s values, helping the principal express themself, and empowering the Late Action to be able to rearrange things to suit various possible simple values. The human then says “shut down please” (perhaps by hitting a cartoonish, red button), which reflects the way they value the AI shutting down. The AI can then shut down in response to that action, thereby empowering the principal. Victory!

Or, well, almost. Notably this setup fails to capture an essential part of the original scenario, which is that the AI is supposed to do useful work prior to being instructed to shut down, which I reference using the shorthand of “making paperclips.” The corrigible AI we just examined does take actions before the principal gives a command, but they’re just setup for later. In order to fully solve the problem we need to extend it so that the principal can take multiple actions: first to instruct the AI to make paperclips, and then to tell the AI to shut down. But to do this we need to extend our framework a bit…

Problems with Time

Humans change over time, including by having different values. In the story presented above we assumed a single Values node that captures what the principal cares about, but this obviously fails to capture the changing nature of them, as a human. Furthermore, it supposes a weirdness where nothing the AI does after the human starts to act can influence the human’s Values, since they’re upstream of Actions in the causal diagram. More realistic (but still fake) would be a network that reflects a series of timesteps by having a distinct Value and Action node for each time.

Should we also suppose a distinct Domain node for each time? The Domain is the space that possible Values are defined over, and it seems silly to me to suppose that one cannot care about how things will go in the future, or even about how things went in the past. Thus for the moment we’ll say there’s a single Domain that’s downstream of all relevant nodes, which captures all the relevant details that possible principals might Value.

There’s certainly a need for a distinct Environment for each timestep, however, and it’s within this Environment that the AI takes actions. We can also see the Environment as mediating the carry-over effects of Values and Actions. In other words, rather than my Values at t=0 having a direct impact on my Values at t=1, we can see those Values as having causal impact on my brain state or something, which then goes on to influence my Values at t=1.

(Domain node/​arrows left out to reduce clutter.)

Our

function previously took a setting of all the nodes upstream of Values, and this can be generalized to our new setup by having a different
for each Values node, or equivalently, for each time. Nothing stops us from similarly having a notion of
for each timestep which treats all other Values and Actions as part of that timestep’s Environment. With a finite time-window, we can then do the simple thing and define the
by summing across timesteps:

In other words, this metric treats each time-slice as equally important. If converting the light-cone into paperclips would be the best way to empower the t=0 principal (perhaps because they commanded the AI to do so), then the AI will forever after be inclined towards those ends. Under this story, no quantity of future commands to the effect of “yes, I know that I said I wanted paperclips, but I changed my mind” will erase the past. Whether this is a blessing or a curse is unclear; it also means that a command to make the world good can’t be undone by a future principal that grows a brain tumor or something.

On the flip side, suppose that the principal commands the agent to shut down. While obeying a shutdown command is correct in the one-timestep version we investigated earlier, it’s less clear that it’s correct to shut down early if there are many timesteps yet to come. By doing so, one is potentially disempowering future timeslices by depriving them of a powerful, obedient agent. One can easily imagine the agent refusing to spend down resources because future timeslices might decide to want them. (Recall that the AI conceives of future timeslices as potentially having arbitrary, random values, so no amount of evidence about what the principal will want down the line will change the agent’s reasoning.)

We also can see a new angle on how bad the

ilarity metric between the actual distribution of Values and the counterfactual distribution of Values where the AI had the null policy is, when we consider a long stretch of time. Consider the reasonable belief that if the AI had the null policy (by magic), that the principal would deactivate it and build a new AI that didn’t have the null policy. If that AI then, in expectation, changed the principal’s values over the long-run, the real AI would feel a pressure to replicate that counterfactual brainwashing (which would, in turn, increase the sense that counterfactual AIs would also execute such brainwashing).

Any one of these features would strike me as a good reason to reject a policy selected for having a high

as actually corrigible. Taken together, it’s clear that we’ve taken a wrong step somewhere.

Being Present

More ideal than a timeless average would be somehow choosing a policy that puts much more weight on “the present” and discounts the past and future. This would solve a variety of headaches:

  • By privileging the present, the agent would have a clear incentive to pivot to following orders that contradict those of the past.

  • By privileging the present, the agent would have a good reason to burn resources to obey the principal in the moment, even if it deprives the future of such resources (including “active AI servants” vis-a-vis shutdown).

  • We’d move away from the story of a fixed time-window, and allow the more natural perspective of an indefinitely long time-horizon.

But what does “the present” even mean, when we consider picking a policy? Policies are, by their nature, timeless—they indicate how the agent behaves in any situation, including situations at different times. Well, leaning on an earlier intuition, suppose that our policy effectively sets a lookup table for each of the agent’s actions. We can then, mathematically at least, decompose the whole policy into sub-policies that each dictate only one lookup table. And now I want to consider the world where we somehow enforce that each sub-policy is identical.

To be clear, I am not demanding that the AI behave the same way in all situations! For instance, the AI on Tuesday can see what the AI on Monday did, but not vice-versa. The AI’s policy could then be “if I can see Monday’s action say ‘late’, else say ‘early.’” What I am demanding is that if there’s an information screen between the AI’s actions, such that the AI on Tuesday can’t distinguish its context from the AI on Monday, each action must come from the same distribution. To do this we have to force a known input space (sensors) and action space (actuators) for all times, which is the standard Cartesian frame.

This seems highly reasonable! I’m basically pointing out that the notion of objectively-known timesteps in the earlier framing of our problem produces more degrees of freedom in our policy than we have in reality. The real machine doesn’t innately know what time it is, and must vary its actions based on clock observations, memories, etc., rather than some magical awareness of “where it is in the causal graph.”

With this “restriction” in hand, we can rescue our earlier formalism by assuming a

distribution over times which is the AI’s best guess as to when it is, given its inputs. We can then trade our uniform average for that much more concentrated distribution, making the AI more myopic as it gets more confident about what time it is. In the limit, it will only act to satisfy the principal’s present values according to their present actions.[8]

This might be too extreme in the opposite direction. It may be the case that a little smoothing on the time distribution produces nice effects. (The wishful thinking side of me suggests: “Maybe we get check-with-the-principal behavior this way!”) It might also be the case that we get nice things by adding in a smoothed penalty for manipulation, such that the AI primarily acts to empower the present principal, but it also cares about not manipulating the past/​future principals. (Wishful thinking: “This sounds like it could generate the kind of local-scope restriction seen in Corrigibility Intuition!”) And lastly, it seems wise to replace

in our metrics with a counterfactual where the policy counterfactually deviates only for the present moment, or at least play around with alternatives that leverage beliefs about what time it is, in an effort to avoid the brainwashing problem introduced at the end of the last section. Overall it should be clear that my efforts at formalism here are more like a trailhead than a full solution, and there are lots of unanswered questions that demand additional thought and experimentation.

Formal Measures Should be Taken Lightly

As a final note, I want to emphasize that my proposed measures and definitions should not be taken very seriously. There are lots of good reasons for exploring formalisms, but at our present level of knowledge and skill, I think it would be a grave mistake to put these attempts at the heart of any sort of AGI training process. These measures are, in addition to being wrong and incomplete, computationally intractable at scale. To be able to use them in an expected-score-maximizer or as a reward/​loss function for training, a measure like I just gave would need to be approximated. But insofar as one is training a heuristic approximation of formal corrigibility, it seems likely to me that the better course would be to simply imitate examples of corrigibility collected in a carefully-selected dataset. I have far more trust in human intuition being able to spot subtle incorrigibility in a concrete setting than I have faith in developing an equation which, when approximated, gives good outcomes. In attempting to fit behavior to match a set of well-chosen examples, I believe there’s some chance of the AI catching the gist of corrigibility, even if it’s only ever implicit in the data.


Next up: 4. Existing Writing on Corrigibility

Return to 0. CAST: Corrigibility as Singular Target

  1. ^

    It makes sense to me to normalize all possible value functions to the same bounded range so that they’re comparable. Unbounded utility seems problematic for a variety of reasons, and in the absence of normalization we end up arbitrarily favoring values that pick a higher bound.

  2. ^

    Why don’t we normalize the value function to extremize the value of outcomes, such as by making pizza worth 100 utility and broccoli yield −100 utility? The problem with extremizing value functions in this way is that it makes the assumption that the Domain in question captures everything that Alice cares about. I’m interested in Domain-specific power, and thus want to include value functions like the example I provide.

  3. ^

    One might wonder why we even need to sample the Environment node at all (rather than marginalizing it out). The main reason is that if we don’t define local power with respect to some known Environment, then the choice of Values could then impact the distribution over latent nodes upstream of Values in a way that doesn’t match the kind of reasoning we want to be doing. For example, consider an AI which generates a random number, then uses that number to choose both what to optimize for and what to set the human’s Values to. Knowing the human’s Values would then allow inferring what the random number was, and concluding that those values are satisfied.

  4. ^

    In case it’s not obvious, this doesn’t preclude the AI responding to evidence in the least. We simply see the evidence as part of the context which is being operated within by the given policy. For instance, a doctor can have a policy of administering treatment X to people expressing symptom Y without having to update the policy in response to the symptoms.

  5. ^

    Why a product rather than a sum? Because it’s not obvious to me what the relative weighting of the two terms should be. How much value modification is 15 units of empowerment worth? What even are the relevant units? By defining this as a product, we can guarantee that both factors need to be high in order for it to be maximized.

  6. ^

    An example of one such function is exp(-D(X,X’)), where D is the Kullback-Leibler divergence.

  7. ^

    My intuition says that robustness is about a policy being stable even as we inject entropy into the epistemic state (i.e. considering a “higher temperature”), but I haven’t worked through the details beyond that first-guess.

  8. ^

    Careful readers will note that, using the proposed structure, there are actually two AI actions per timestep: Early and Late. The P distribution over timesteps must then also be augmented by a sub-distribution over which of those two actions the AI is currently taking, insofar as it matters to the AI’s action (which it definitely does).