The case for a negative alignment tax

TL;DR:

Alignment researchers have historically predicted that building safe advanced AI would necessarily incur a significant alignment tax compared to an equally capable but unaligned counterfactual AI.

We put forward a case here that this prediction looks increasingly unlikely given the current ‘state of the board,’ as well as some possibilities for updating alignment strategies accordingly.

Introduction

We recently found that over one hundred grant-funded alignment researchers generally disagree with statements like:

alignment research that has some probability of also advancing capabilities should not be done (~70% somewhat or strongly disagreed)
advancing AI capabilities and doing alignment research are mutually exclusive goals (~65% somewhat or strongly disagreed)

Notably, this sample also predicted that the distribution would be significantly more skewed in the ‘hostile-to-capabilities’ direction.

See ground truth vs. predicted distributions for these statements

These results—as well as recent events and related discussions—caused us to think more about our views on the relationship between capabilities and alignment work given the ‘current state of the board,’^[1] which ultimately became the content of this post. Though we expect some to disagree with these takes, we have been pleasantly surprised by the positive feedback we’ve received from discussing these ideas in person and are excited to further stress-test them here.

Is a negative alignment tax plausible (or desirable)?

Often, capabilities and alignment are framed with reference to the alignment tax, defined as ‘the extra cost [practical, developmental, research, etc.] of ensuring that an AI system is aligned, relative to the cost of building an unaligned alternative.’

The AF/LW wiki entry on alignment taxes notably includes the following claim:

The best case scenario is No Tax: This means we lose no performance by aligning the system, so there is no reason to deploy an AI that is not aligned, i.e., we might as well align it.
The worst case scenario is Max Tax: This means that we lose all performance by aligning the system, so alignment is functionally impossible.

We speculate in this post about a different best case scenario: a negative alignment tax—namely, a state of affairs where an AI system is actually rendered more competent/performant/capable by virtue of its alignment properties.

The various predictions about the relationship between alignment and capabilities from the Max Tax, No Tax, and Negative Tax models. Note that we do not expect the Negative Tax curve to be strictly monotonic.

Why would this be even better than ‘No Tax?’ Given the clear existence of a trillion dollar attractor state towards ever-more-powerful AI, we suspect that the most pragmatic and desirable outcome would involve humanity finding a path forward that both (1) eventually satisfies the constraints of this attractor (i.e., is in fact highly capable, gets us AGI, etc.) and (2) does not pose existential risk to humanity.

Ignoring the inevitability of (1) seems practically unrealistic as an action plan at this point—and ignoring (2) could be collectively suicidal.

If P(Very capable) is high in the near future, then ensuring humanity lands in the ‘Aligned x Very capable’ quadrant seems like it must be the priority—and therefore that alignment proposals that actively increase the probability of humanity ending up in that quadrant are preferred.

Therefore, if the safety properties of such a system were also explicitly contributing to what is rendering it capable—and therefore functionally causes us to navigate away from possible futures where we build systems that are capable but unsafe—then these ‘negative alignment tax’ properties seem more like a feature than a bug.

It is also worth noting here as an empirical datapoint here that virtually all frontier models’ alignment properties have rendered them more rather than less capable (e.g., gpt-4 is far more useful and far more aligned than gpt-4-base), which is the opposite of what the ‘alignment tax’ model would have predicted.

This idea is somewhat reminiscent of differential technological development, in which Bostrom suggests “[slowing] the development of dangerous and harmful technologies, especially ones that raise the level of existential risk; and accelerating the development of beneficial technologies, especially those that reduce the existential risks posed by nature or by other technologies.” If alignment techniques were developed that could positively ‘accelerate the development of beneficial technologies’ rather than act as a functional ‘tax’ on them, we think that this would be a good thing on balance.

Of course, we certainly still do not think it is wise to plow ahead with capabilities work given the current practical absence of robust ‘negative alignment tax’ techniques—and that safetywashing capabilities gains without any true alignment benefit is a real and important ongoing concern.

However, we do think if such alignment techniques were discovered—techniques that simultaneously steered models away from dangerous behavior while also rendering them more generally capable in the process—this would probably be preferable in the status quo to alignment techniques that steered models away from dangerous behavior with no effect on capabilities (ie, techniques with no alignment tax) given the fairly-obviously-inescapable strength of the more-capable-AI attractor state.

In the limit (what might be considered the ‘best imaginable case’), we might imagine researchers discovering an alignment technique that (A) was guaranteed to eliminate x-risk and (B) improve capabilities so clearly that they become competitively necessary for anyone attempting to build AGI.

Early examples of negative alignment taxes

We want to emphasize that the examples we provide here are almost certainly not the best possible examples of negative alignment taxes—but at least provide some basic proof of concept that there already exist alignment properties that can actually bolster capabilities—if only weakly compared to the (potentially ideal) limit case.

The elephant in the lab: RLHF

RLHF is clearly not a perfect alignment technique, and it probably won’t scale. However, the fact that it both (1) clearly renders frontier^[2] models’ outputs less toxic and dangerous and (2) has also been widely adopted given the substantial associated improvements in task performance and naturalistic conversation ability seems to us like a clear (albeit nascent) example of a ‘negative alignment tax’ in action.

It also serves as a practical example of point (B) above: the key labs pushing the envelope on capabilities have all embraced RLHF to some degree, likely not out of a heartfelt concern for AI x-risk, but rather because doing so is actively competitively necessary in the status quo.

Consider the following from Anthropic’s 2022 Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback:

Our alignment interventions actually enhance the capabilities of large models, and can easily be combined with training for specialized skills (such as coding or summarization) without any degradation in alignment or performance. Models with less than about 10B parameters behave differently, paying an ‘alignment tax’ on their capabilities. This provides an example where models near the state-of-the-art may have been necessary to derive the right lessons from alignment research.
The overall picture we seem to find – that large models can learn a wide variety of skills, including alignment, in a mutually compatible way – does not seem very surprising. Behaving in an aligned fashion is just another capability, and many works have shown that larger models are more capable [Kaplan et al., 2020, Rosenfeld et al., 2019, Brown et al., 2020], finetune with greater sample efficiency [Henighan et al., 2020, Askell et al., 2021], and do not suffer significantly from forgetting [Ramasesh et al., 2022].

Cooperative/prosocial AI systems

We suspect there may be core prosocial algorithms (already running in human brains^[3]) that, if implemented into AI systems in the right ways, would also exhibit a negative alignment tax. To the degree that humans actively prefer to interface with AI that they can trust and cooperate with, embedding prosocial algorithms into AI could confer both new capabilities and favorable alignment properties.

The operative cluster of examples we are personally most excited about—things like attention schema theory, theory of mind, empathy, and self-other overlap—all basically relate to figuring out how to robustly integrate in an agent’s utility function(s) the utility of other relevant agents. If the right subset of these algorithms could be successfully integrated into an agentic AI system—and cause it to effectively and automatically predict and reason about the effects of its decisions on other agents—we would expect that, by default, it would not want to kill everyone,^[4] and that this might even scale to superintelligence given orthogonality.

In a world where the negative alignment tax model is correct, prosocial algorithms could also potentially avoid value lock-in by enabling models to continue reasoning about and updating their own values in the ‘right’ direction long after humans are capable of evaluating this ourselves. Given that leading models’ capabilities now seem to scale almost linearly with compute along two dimensions—not only during training but also during inference—getting this right may be fairly urgent.

There are some indications that LLMs already exhibit theory-of-mind-like abilities, albeit more implicitly than what we are imagining here. We suspect that discovering architectures that implement these sorts of prosocial algorithms in the right ways would represent both a capabilities gain and tangible alignment progress.

As an aside from our main argument here, we currently feel more excited about systems whose core functionality is inherently aligned/alignable (reasonable examples include: prosocial AI, safeguarded AI, agent foundations) as compared to corrigibility-style approaches that seemingly aim to optimize more for oversight, intervention, and control (as a proxy for alignment) rather than for ‘alignedness’ directly. In the face of sharp left turns or inevitable jailbreaking,^[5] it is plausible that the safest long-term solution might look something like an AI whose architecture explicitly and inextricably encodes acting in light of the utility of other agents, rather than merely ‘bolting on’ security or containment measures to an ambiguously-aligned system.

The question isn’t will your security be breached? but when? and how bad will it be?
-Bruce Schneier

Process supervision and other LLM-based interventions

OpenAI’s recent release of the o1 model series serves as the strongest evidence to date that rewarding each step of an LLM’s chain of thought rather than only the final outcome improves both capabilities and alignment in multi-step problems, notably including 4x improved safety performance on the challenging Goodness@0.1 StrongREJECT jailbreak eval. This same finding was also reported in earlier, more constrained task settings.

We suspect there are other similar interventions for LLMs that would constitute good news for both alignment and capabilities (e.g. Paul Christiano’s take that LLM agents might be net-good for alignment; the finding that more persuasive LLM debaters enables non-expert models to identify truth better; and so on).

Concluding thoughts

Ultimately, that a significant majority of the alignment researchers we surveyed don’t think capabilities and alignment are mutually exclusive indicates to us that the nature of the relationship between these two domains is itself a neglected area of research and discussion.

While there are certainly good reasons to be concerned about any capabilities improvements whatsoever in the name of safety, we think there are also good reasons to be concerned that capabilities taboos in the name of safety may backfire in actually navigating towards a future in which AGI is aligned.

While we argue for the possibility of a negative alignment tax, it’s important to note that this doesn’t eliminate all tradeoffs between performance and alignment. Even in systems benefiting from alignment-driven capabilities improvements, there may still be decisions that pit marginal gains in performance against marginal gains in alignment (see ‘Visualizing a Negative Alignment Tax’ plot above).

However, what we’re proposing is that certain alignment techniques can shift the entire tradeoff curve, resulting in systems that are both more capable and more aligned than their unaligned counterparts. This view implies that rather than viewing alignment as a pure cost to be minimized, we should seek out techniques that fundamentally improve the baseline performance-alignment tradeoff.

While the notion of a negative alignment tax is fairly speculative and optimistic, we think the theoretical case for it being a desirable and pragmatic outcome is straightforward given the current state of the board.^[6] Whether we like it or not, humanity’s current-and-seemingly-highly-stable incentive structures have us hurtling towards ever-more-capable AI without any corresponding guarantees regarding safety. We think that an underrated general strategy for contending with this reality is for researchers—and alignment startup founders—to further explore neglected alignment approaches with negative alignment taxes.

^
Monte Carlo Tree Search is a surprisingly powerful decision-making algorithm that teaches us an important lesson about the relationship between plans and the current state of the board: recompute often. Just as MCTS builds a tree of possibilities, simulating many ‘playouts’ from the current state to estimate the value of each move, and then focuses its search on the most promising branches before selecting a move and effectively starting anew from the resulting state, so too might we adopt a similar attitude for alignment research. It does not make much sense to attempt to navigate towards aligned AGI by leaning heavily on conceptual frameworks (‘past tree expansions’) generated in an earlier, increasingly unrecognizable board state—namely, the exponential increase in resources and attention being deployed towards advancing AI capabilities and the obvious advances that have already accompanied this investment. Alignment plans that do not meaningfully contend with this reality might be considered ‘outdated branches’ in the sense described above.
^
In general, there is some evidence that RLHF seems to work better on larger and more sophisticated models, though it is unclear to what extent this trend can be extrapolated.
^
“I’ve been surprised, in the past, by how many people vehemently resist the idea that they might not actually be selfish, deep down. I’ve seen some people do some incredible contortions in attempts to convince themselves that their ability to care about others is actually completely selfish. (Because iterated game theory says that if you’re in a repeated game it pays to be nice, you see!) These people seem to resist the idea that they could have selfless values on general principles, and consistently struggle to come up with selfish explanations for their altruistic behavior.”—Nate Soares, Replacing Guilt.
^
To the same degree, at least, that we would not expect other people—i.e., intelligent entities with brains running core prosocial algorithms (mostly)—to want to kill everyone.
^
…or overdependence on the seemingly-innocuous software of questionably competent actors.
^
If the current state of the board changes, this may also change. It is important to be sensitive to how negative alignment taxes may become more or less feasible/generalizable over time.