My decomposition of the alignment problem

Epistemic staus: Exploratory

Summary: In this post I will decompose the alignment problem into subproblems and frame existing approaches in terms of their relations to the subproblems. I will try to place a larger focus on the epistemic process as opposed to results of this particular problem factorization, where the aim is to obtain an epistemic strategy that can be generalized to new problems.

The case for problem decomposition

Degrees of freedom

One way to frame the advantage of factoring a problem is that doing so allows degrees of freedom to add up instead of multiply. If the solution space of a problem space P contains n degrees of freedom, then without decomposing the problem, we need to search through all possible combinations to find a solution. However, if we can decompose into two independent subproblems and , where the degrees of freedom for each subproblem do not affect the other subproblem, then we get to independently search through the solutions of and , which means the solution spaces of and add up instead of combinatorially multiply. It’s important to note that

  • The subproblems of our factoring needs to be approximately independent, but which degrees of freedom can be independently varied without affecting other subproblems is a feature of the problem space itself, we don’t get to choose the problem factorization

Combining forward chaining with backward chaining

Problem factoring is a form of backchaining from desired end states. In addition to this approach, we can also forward-chain from the status quo to gain information about the problem domain, which may be helpful for finding new angles of attack. However, forward chaining is most effective when we have adequate heuristics that guide our search towards insights that are more useful and generalizable. One way to develop heuristics about what insights are generalizable is to keep a wide variety of problems on which to apply new techniques to, and bias our search towards insights that are helpful for multiple problems

  • We can do this by having a wide variety of problems from different fields, but we can also do this by having a wide variety of subproblems that come from factorizing the same problem

  • Searching for insights that are useful for multiple subproblems can help us identify robust bottlenecks to alignment

Concretely, decomposing the alignment problem into subproblems means that whenever we stumble upon a new insight that may be relevant to alignment, we can try to apply it to each of the subproblems, & gain a more concrete intuition about what sorts of insights are useful. In addition, we can frame existing approaches in terms of how they can help us address subproblems of alignment, so that when we consider similar approaches, we can direct our focus onto the same set of subproblems.

Scope

In this post we will focus on a narrow class of transformative AIs that can be roughly factored into three components:

  • A world model

  • A general purpose search (GPS) module which takes a goal/​optimization target and returns a plan for achieving that goal

  • A targeting process which maps variables in the world model to the optimization target of general purpose search

While I do believe that it’s important to figure out how to align AIs with other possible architectures, we will not discuss them in this post. Nevertheless, the following are some justifications for focusing on TAIs that can be factored into a world model, a GPS module, and a targeting process:

  • A world model seems necessary for a TAI as it allows the AI to respond to unobserved parts of the world

  • General purpose search is instrumentally convergent:

    • An AI’s world model at any given point is likely incomplete: There are some causal relationships between the variables in its world model that the AI doesn’t know about yet, because the AI is smaller than the world

    • The AI may discover new instrumental subgoals when it learns about new causal relationship between variables in its world model

      • Concretely, the AI discovers a new instrumental subgoal upon learning a new causal pathway from some variable C in its world model to its terminal values

    • A powerful AI should be able to optimize for an instrumental subgoal upon discovering that subgoal as it is an effective way of achieving its terminal goals

    • For the AI to be able to optimize for a new instrumental subgoal upon discovering it, it must be capable of optimizing for a wide variety of goals beforehand, since many goals can turn out to be an instrumental subgoal

    • In addition, the AI should be able to flexibly set its optimization target so that it can optimize for new instrumental subgoals on the fly. This entails the existence of something like general purpose search.

    • If we can control the optimization target of general purpose search, we can sidestep the inner alignment problem by retargeting the search

  • We can leverage TAI’s model of human values to decide the optimization target:

    • Powerful TAIs will likely have a mechanistic model of the behavior and goals of humans, as this is helpful for making accurate predictions about humans. We might want to use information from that mechanistic model to decide what goals the AI should optimize for, since our ideal target for alignment ultimately depends on human values, and a mechanistic model of humans contains information about those values. An example of this approach is for the AI to point to human goals inside its world model, and let the AI optimize for the pointer of those goals.

    • In order to accommodate this class of approaches, the optimization target should be able to depend on variables in the world model, we call this mapping from variables in the world model to the optimization target the targeting process

    • We might also want to consider AIs that optimize for a single fixed goal independent of variables in the world model, for this type of AI we simply model its targeting process as a constant function

  • Additional Notes:

    • World model is dual-use:

      • On one hand, a better world model advances capabilities as it is used by general purpose search to select more effective actions

      • On the other hand, a better world model can allow the AI to have a better model of “what humans want”, which can lead to a more accurate optimization target given an adequate targeting process

    • Alignment is mostly about designing the targeting process, but considerations about targeting process may also influence design decisions of the world model and general purpose search

Where does the information come from, and how do we plan on using it?

Not all problems should be framed as optimization problems

There’s a tempting style of thinking which tries to frame all problems as optimization problems (see here and here). This style of thinking seems to make sense for dualistic agents: Afterall, the dualistic agent has preferences over the environment, it has well-defined input and output channels, and it can hold an entire model of the environment inside its mind. All that’s left to do is to optimize the environment against its preferences using the output channel.

However, we run into issues when we try to translate this style of thinking to embedded agents: The embedded agent has some degree of introspective uncertainty, including over its own preferences, which means it doesn’t always know what objective function to optimize for; the goals of the embedded agent may depend on information in the environment that isn’t fully accessible to the agent. For instance, an embedded agent might try to satisfy the preferences of another agent, and because the agent is logically non-omniscient and smaller than the environment, it’s not straightforward to simply calculate expected utilities over all possible worlds. As a result, embedded agents can face many problems where most of the difficulty stems from finding an adequate set of criteria to optimize against, as opposed to finding out how to optimize against a known criteria. The agent cannot just optimize against an arbitrary proxy for its objectives either, as that can lead to Goodhart failures.

The alignment problem is a central example where the main bottleneck hinges upon defining an objective as opposed to optimizing against it. And because framing all problems as optimization problems assumes that we already know the objectives, we need to find an alternative framework which helps us think about the task of formulating the problem itself.

Desiderata, sources and bridges

One way to think about alignment which I find helpful is that we have human values on one hand, and the goals or optimization targets of the AI on the other, and we want to establish a bridge which allows information to flow from the former to the latter. We might need to formulate properties that we want this bridge to have, drawing inspirations from many different places, or try to implement properties that we already think are desirable. The following are some important features of this picture that are different from the dualistic optimization viewpoint:

  • Desiderata vs objective functions: In this picture, we want to come up with desiderata which tells us things like “how can I recognize an adequate solution if I see one?” or “how can I recognize an adequate formalization of the problem if I see one?”. Although it seems like both desiderata and objective functions are to be optimized against, there are some important differences:

    • Defeasibility: For a dualistic agent, the objective function which it optimizes against can never be ‘wrong’. However, as embedded agents, we have introspective uncertainty over our own values, which means our proposed desiderata can be subjected to revision. Desiderata can be used to narrow our search space, but we should also test them by searching for counterexamples

    • Meta-ness: Desiderata doesn’t have to specify what constitutes a good solution, it can also specify what constitutes a good formulation of the problem, or specify a way to specify a good formulation of the problem, and so on and so on

      • For alignment, we can picture this as establishing a sequence of bridges, where bridge allows information to flow from humans to the goals of AI, and bridge allows information to flow from humans to bridge

      • Allowing our desiderata to be “meta” allows us to consider approaches such as indirect normativity, which may be important when it’s infeasible to formulate the object-level problems ourselves

  • Sources and bridges: For an optimization problems in the dualistic context, we have input variables which we get to vary, and our only job is to find the input which maximizes our objective function. For embedded agents, however, the problem definition itself may depend on variables in the environment which we don’t get to directly perceive, which means we not only need to consider the degrees of freedom which we get to control, but also the sources of information about where to find good solutions and desiderata. Problem solving can be thought of as establishing a bridge which flows from the sources of information and the degrees of freedom we get to control to the desiderata

    • Note that this “bridge” of information flow doesn’t have to route through us: For instance, we might design an auction with the intention of achieving efficiency, and this objective depends on the preferences of the participants. However, when we run the auction, we never observe the full preference of any participant, we merely established a way such that that information can be used to satisfy our desiderata

    • Focusing on sources of information also allows us to create more realistic bounds for an embedded agent’s performance, where we consider the best we can do given not just what we can control but also what we know.

  • Main Benefits of this framing:

    • For problems without an adequate formalization yet, this framing highlights that desiderata can be defeasible, and that we might want to use indirect approaches which operate at a meta-level

    • For high-dimensional optimization problems, this framing places the focus on identifying information about where to find good solutions

    • For problems whose definitions depend on unknown variables, this framing puts emphasis on identifying those variables using sources of information that are available to us

    • In certain cases, the problem definitions depend on variables that are unobservable to us, using this framing allows us to nevertheless consider solutions which route through those unobservables but not through us

Decomposing the AI alignment problem

Our main objective is to find optimization targets that lead to desirable outcomes when optimized against, and there are different sources of information which tell us what properties we want our optimization targets to have. To factorize the alignment problem, a natural place to start is to factorize these sources of information which can help us narrow our search space for our optimization targets.

One axis of factorization is the information that we have a priori vs a posteriori, that is, what information do we have before the AI starts developing a world model, vs after we have access to its world model? These two cases seem to be mostly independent because gaining access to an AI’s world model gives us new information that isn’t accessible to us a priori.

A priori

When we haven’t started training an AI and we don’t have access to the AI’s world model, there are two constraints that limit the information we have about what optimization targets are desirable:

  • We don’t have access to the AI’s ontology of the world, which means that if we have certain preferences over real world objects, we can’t make assumptions about how that real world object will be represented by the AI’s world model

  • The AI hasn’t developed a world model, which means it doesn’t have a mechanistic model of humans yet. As a result, we cannot leverage the AI’s model of human values to determine properties of the optimization target

When we don’t have access to certain types of information, we want to seek considerations which don’t make assumptions about them. As a result, when we don’t have access to the AI’s ontology and its model of human values, we should seek ontology-invariant and value-free considerations:

Value-free considerations

The main benefit of allowing the optimization target of an AI to depend on variables in the world model is that we can potentially “point” to human values inside the world model & set it as the optimization target. However, that information isn’t available when the AI hasn’t developed a world model yet, and our introspective uncertainty bars us from directly specifying our own values in the AI’s ontology, which means at this stage we should seek desirable properties of the optimization target that don’t depend on contingent features of human values. We call such considerations “value-free”.

Since value-free considerations don’t make assumptions about contingent properties of human values, they must be universal across a wide variety of agents. In other words, to search for value-free properties, we should focus on properties of the optimization target which are instrumentally convergent for agents with diverse values.

Examples of value-free considerations

  • Natural latents are features of the environment which a wide variety of agents would convergently model as latent variables in their ontologies, and having something as a latent variable in your ontology is a prerequisite for caring about that thing. As a result, figuring out what properties of the environment are natural latents can help us narrow down the space of things that we might want our AIs to care about, which would give us a better prior over the space of possible optimization targets. Since natural latents are instrumentally convergent, we don’t need to make assumptions about contingent properties of human values to discover them.

  • Corrigibility/​impact measures/​mild optimization: For agents with introspective uncertainty over their own values, there may be features in the environment that they “unconsciously” care about and have optimized for, but they are not fully aware of that. This means that techniques such as Corrigibility/​impact measures/​mild optimization that systematically avoid side-effects can be convergently useful for agents with introspective uncertainty, as they can help preserve the features that the agent is unaware that it cares about. Insofar as these properties are value-free, we can imbue them in the optimization target before we have a specification of human values

Ontology-invariant considerations

Not having access to the AI’s world model means that we don’t know how the internal representations of the AI correspond to physical things in the real world. This means that when we have preferences about real world objects, we don’t know how that preference should be expressed in relation to the AI’s internal representations. In other words, we don’t know how to make the AI care about apples and dogs when we don’t know which parts of the AI’s mind point to apples and dogs.

When we face such limitations, we should seek properties of the optimization target that are desirable regardless of what ontology the AI’s might end up developing; when we don’t know how the AI will describe the world, we can still implement the parts of our preferences which don’t depend on which description of the world will end up being used.

Examples of ontology invariant considerations

  • Staying in distribution: We have a preference for AIs to operate within contexts that they have been in before so that it can avoid out of distribution failures. The hope is that the concept of “out of distribution” is expressible over a wide range of world models. Insofar as this is true, we can implement this preference without knowing what specific ontology the AI will end up using beforehand.

  • Optimizing worst case performance: When we’re especially concerned about the worst-case outcome, we can design our AIs to optimize for its performance in the worst possible world. This preference can be implemented in most world models which are capable of representing uncertainties, which means we don’t need to know what specific representation our AI will use as we implement it.

Advantage

The main benefit of a priori properties of the optimization target is that they can be deployed before the AI starts developing a sophisticated world model. In other words, they are more robust to scaling down

A posteriori

We’ve discussed two limitations in the a priori stage when we don’t have access to the AI’s world model, which means the main question we should ask in the a posteriori stage is what opportunities are unlocked once those limitations are lifted? What new sources of information do we gain access to which we previously didn’t?

The AI gets to observe us

Once the AI develops a sophisticated world model, that world model will likely contain information about human values. This means that a key consideration in the a posteriori stage is how we can leverage that information to determine properties of the optimization target.

Examples

  • The Pointers Problem: We want the AI to optimize for the real world things that we care about, not just our evaluation of outcomes. In order to achieve that, we need to figure out the correspondence between latent variables of our world models and the real world variables that they represent. Formalizing this correspondence in the AI’s ontology is a prerequisite for translating our preferences into criteria about which real world outcomes are desirable

  • Ontology identification: The variables or even the structure of the world model might change as the AI receives new observations, which means that if our optimization target is expressed in terms of variables of the AI’s world model, then it needs to be robust against possible changes in the way that the AI represents the world. In other words, we need to figure out how to robustly “point” to things in the territory even when our maps can change over time

  • Simulated long reflection: In addition to translating our current values to the AI’s optimization target, we might also want to use AI’s to help us find our ideal values which we would endorse upon reflection. This will become more feasible if we can isolate a mechanistic model of humans from the AI’s world model and use that to simulate our reflection process

  • Active value learning: Science isn’t just about building models using existing observations, we also take actions or conduct experiments to gain new information about the domain we’re interested in. Given that the AI’s model of us can be imperfect, how can the AI ask the right questions/​choose the right actions to gain information about our values?

  • Type signatures and true names: In order to leverage information about human values using variables in the AI’s world model, we need to be able to locate them inside the world model and interpret them in the right ways. In other words, we need to understand the type signatures of concepts such as “values” or “agents”, so that we can look for structures in the AI’s world model which match that type signature, and decode those structures correctly

We get to observe the AI’s world model

The second limitation that’s lifted when the AI starts developing a world model is that we get to inspect the world model and gain information about the AI’s ontology. This means that in addition to the AI gaining a better understanding of our values, we can also become better at designing the targeting process ourselves by understanding the AI’s world model

Examples

  • Interpretability: If we can figure out the relationship between variables in the AI’s world model and the real-world things they correspond to, we can manually design the optimization target to point to the real world things we care about

    • This can be viewed as the dual of the pointers problem, where for interpretability we are figuring out the relationship between the AI’s world model and real world, while in the pointer’s problem we want the AI to understand the relationship between latent variables in human’s world model and real world variables

  • Accelerated reflection: Although simulated long reflection should be faster than our actual reflection process, it relies on a human model which may be inaccurate, causing possible deviations from our actual reflection process. This suggests that comparative advantages are present in both using the AI’s model of our minds for reflection & using our actual minds for reflection. We might want to combine the benefits from both using techniques such as debate, market making and cyborgism

Backpropagation

Our discussions mainly focused on considerations about the targeting process, but the targeting process is entangled with the world model and the general purpose search module. This means that we should backpropagate our desiderata for the targeting process to inform design decisions about the rest of the components. For instance, if we want our optimization target to be robust to ontology shifts, we should try to design world models which are capable of modeling the world at multiple levels of abstractions and explicitly representing the relationships between different levels.