Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Link post

A Sharp Left Turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling). This post will outline our current understanding of the most promising plan for getting through an SLT and how it could fail (conditional on an SLT occurring).

In a previous post, we broke down the SLT threat model into 3 claims:

  1. Capabilities will generalize far (i.e. to many domains)

  2. Alignment techniques that worked previously will fail during this transition

  3. Humans can’t intervene to prevent or align this transition

We proposed some possible mechanisms for Claim 1, while this post will investigate possible arguments and mechanisms for Claim 2.

Plan: we use alignment techniques to find a goal-aligned model before SLT occurs, and the model preserves its goals during the SLT.

We can try to learn a goal-aligned model before SLT occurs: a model that has beneficial goals and is able to reason about its own goals. This requires the model to have two properties: goal-directedness towards beneficial goals, and situational awareness (which enables the model to reason about its goals). Here we use the term “goal-directedness” in a weak sense (that includes humans and allows incoherent preferences) rather than a strong sense (that implies expected utility maximization).

One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don’t imply that the plan fails (e.g. it might be fine if interpretability or ELK techniques no longer work reliably during the transition if we can trust the model to manage the transition).

Step 1: Finding a goal-aligned model before SLT

We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It’s important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment.

Thus, our model search process would follow a decision tree along these lines:

  • If situational awareness is detected without goal-directedness, restart the search.

  • If undesirable goal-directedness or early signs of deceptive alignment are detected, restart the search.

  • If an upcoming phase transition in capabilities is detected, and the model is not goal-aligned, restart the search.

  • If beneficial goal-directedness is detected without situational awareness, train the model for situational awareness.

Here are some ways alignment techniques could be applied in this plan:

  • Use ELK techniques to identify whether the model has situational awareness (i.e. whether it believes it’s inside a training process) and what goals it believes itself to have.

  • Use interpretability techniques to pick up on early signs of deceptive alignment and restart and improve the search. (If we develop more advanced interpretability techniques that make it possible to retarget the internal search of the model, we can retarget a goal-directed model to a beneficial goal instead of restarting the search.)

  • Use foundational progress on abstraction to determine whether the goals learned by the system are beneficial.

  • Use foundational research insights (e.g. selection theorems) to make the search process more effective at avoiding models with undesirable goals and finding goal-aligned models

  • Use insights from studying capability transitions to predict sharp increases in problematic capabilities, and make sure we only cross that threshold after we found a goal-aligned model using the other techniques on this list.

Why alignment techniques could fail to find a goal-aligned model before SLT occurs:

  • The space of deceptively aligned models is much bigger than the space of goal-aligned models, so the latter is very hard to find.

  • The inductive biases of the search favor misaligned models (see How likely is deceptive alignment? and Will Capabilities Generalise More? for arguments). Note that these arguments indicate that the alignment techniques have to fight an uphill battle to find a goal-aligned model, rather than providing a mechanism for why they would fail.

  • Predicting capability transitions does not tell us when it will happen or whether the model undergoing the transition would be aligned. If we use these predictions to restart the search, we could rule out goal-aligned models as well.

  • Interpretability is very hard, current methods are low-level and it’s unclear how to get them to work on the high level (e.g. identifying complex goals and beliefs of a system rather than how it stores basic facts about the world).

It’s also possible that this model search process would pose an overly high alignment tax to be followed in practice (see Claim 3).

Step 2: The goal-aligned model preserves its goals during SLT (with some help from us)

This step of the plan relies on the goal-aligned model both wanting to preserve its goals during an SLT and being capable of preserving its goals. The general argument for a model wanting to preserve its own goals is that it has a convergent instrumental incentive to ensure that future versions of itself act in ways that satisfy its current preferences.

One argument for a pre-SLT system being capable of preserving its goals is by analogy to human values being preserved when humans went through the capability phase transition of developing language and writing. However, it is not clear how much the human analogy constitutes evidence for ability to preserve goals through a phase transition, since there are other possible explanations for human values being preserved (e.g. that values like “relationships” and “social status” continued to be selected for during the capability transition) that don’t rely on humans being able to deliberately preserve their values.

A key question is whether the model could reflect on its goals without already having generalized capabilities. This is related to Claim 1a in our previous post: that the model’s capabilities will all generalize at the same time. If this claim holds, in the sense that reflection on goals is part of this set of capabilities that all generalize simultaneously, then the ability to reflect on goals would only arise during the SLT and so this plan doesn’t work. However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them.

  • Humans are a proof of concept for this—many people can intuitively predict how some experiences can change their preferences (e.g. “going to Vegas might make me a gambler”) without being generally good at planning or world modeling (which require abstract thinking).

  • Large language models may be an example of this as well, since they have some capacity to reflect on goals (if prompted accordingly) without generalized planning ability.

If the model can predict that an SLT is likely to modify its goals in an undesirable way, we expect that it would choose not to undergo an SLT in the first place (e.g. by avoiding new internal algorithms or ways of thinking that could lead to an SLT).

Paul Christiano outlined some specific ways for a goal-aligned model to preserve its goals during SLT depending on the mechanism for SLT:

“Aligning the internal search [or natural selection inside the model] seems very similar to aligning SGD on the outside. [...] Because the search is on the inside, we can’t directly apply our alignment insights to align it. Instead we need to [use ELK to] ensure that SGD learns to align the search.”

“If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we’ve been succeeding at alignment so far then the model will be trying to stay aligned.”

“One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). [...] If you’ve succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success.”

We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard and an SLT would not produce a completely coherent system.

Some ways a goal-aligned model could fail to preserve its goals:

  • Model is bad at introspection and does not detect subtle changes to its goals.

  • Model is bad at planning and doesn’t foresee how some modifications will eventually lead to changes in its goals. A special case of this is if the model becomes an expected utility maximizer during the SLT and the goal-preservation process does not produce a utility function that is ok to maximize.

  • Effectively preserving beneficial goals might require the model to have a deep understanding of what humans want (rather than just a good enough proxy for that) which the model would likely not have before the capability transition.

Some ways that humans could fail to help the model to preserve its goals:

  • We can’t reliably tell whether the model learned to align the internal search.

  • We can’t communicate with the model well enough to help it overcome internal alignment difficulties.

Takeaways

The above is our current model of the most promising plan for managing an SLT and how it could fail. The overall takeaways are:

  • To evaluate how well any given alignment proposal addresses the SLT threat model, it’s useful to consider whether it aims to align a model before or during SLT.

  • It seems more promising to look for a goal-aligned model before SLT and help it stay aligned, rather than trying to align a possibly misaligned model during SLT.

The core reasons to be skeptical of this plan are:

  • (Step 1) Goal-aligned models are hard to find (and we may not develop good enough alignment techniques to find one)

  • (Step 2) Generalized planning may be required for goal-reflection (in which case goal-reflection would only arise during an SLT)

  • (Step 2) We may not be able to tell what’s going on inside the model to help it stay aligned.

If we missed any important components of this plan or ways it could fail, please let us know!