In <@part 1@>(@Reframing Impact—Part 1@) of this sequence, we saw that an event is _impactful_ if it _changes our ability to get what we want_. This part takes this understanding and applies it to AI alignment.
In the real world, there are many events that cause _objective_ negative impacts: they reduce your ability to pursue nearly any goal. An asteroid impact that destroys the Earth is going to be pretty bad for you, whether you want to promote human flourishing or to make paperclips. Conversely, there are many plans that produce objective positive impacts: for many potential goals, it’s probably a good idea to earn a bunch of money, or to learn a lot about the world, or to command a perfectly loyal army. This is particularly exacerbated when the environment contains multiple agents: for goals that benefit from having more resources, it is objectively bad for you if a different agent seizes your resources, and objectively good for you if you seize other agents’ resources.
Based on this intuitive (but certainly not ironclad) argument, we get the **Catastrophic Convergence Conjecture (CCC)**: “Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives”.
Let’s now consider a _conceptual_ version of <@Attainable Utility Preservation (AUP)@>(@Towards a New Impact Measure@): the agent optimizes a primary (possibly unaligned) goal, but is penalized for changing its “power” (in the intuitive sense). Intuitively, such an agent no longer has power-seeking incentives, and so (by the contrapositive of the CCC) it will not have a catastrophe-inducing optimal policy—exactly what we want! This conceptual version of AUP also avoids thorny problems such as ontology identification and butterfly effects, because the agent need only reason about its own beliefs, rather than having to reason directly about the external world.
Opinion:
This was my favorite part of the sequence, as it explains the conceptual case for AUP clearly and concisely. I especially liked the CCC: I believe that we should be primarily aiming to prevent an AI system “intentionally” causing catastrophe, while not attempting to guarantee an absence of “accidental” mistakes (<@1@>(@Clarifying “AI Alignment”@), <@2@>(@Techniques for optimizing worst-case performance@)), and the CCC is one way of cashing out this intuition. It’s a more crisp version of the idea that convergent instrumental subgoals are in some sense the “source” of AI accident risk, and if we can avoid instrumental subgoals we will probably have solved AI safety.
Summary for the Alignment Newsletter:
Opinion: