The final section of the sequence turns to an actual implementation of AUP, and deals with problems in how the implementation deviates from the conceptual version of AUP. We measure power by considering an set of auxiliary rewards, and measuring the change in attainable utilities of this auxiliary set as impact, and penalizing the agent for that. The first post presents some empirical results, many of which <@we’ve covered before@>(@Penalizing Impact via Attainable Utility Preservation@), but I wanted to note the new results on SafeLife. On the high-dimensional world of SafeLife, the authors train a VAE to find a good latent representation, and choose a single linear reward function on the latent representation as their auxiliary reward function: it turns out this is enough to avoid side effects in at least some case of SafeLife.
We then look at some improvements that can be made to the original AUP implementation. First, according to CCC, we only need to penalize _power_, not _impact_: as a result we can just penalize _increases_ in attainable utilities, rather than both increases and decreases as in the original version. Second, the auxiliary set of rewards only provides a _proxy_ for impact / power, which an optimal agent could game (for example, by creating subagents, summarized below). So instead, we can penalize increases in attainable utility for the _primary_ goal, rather than using auxiliary rewards. There are some other improvements that I won’t go into here.
I think the plan “ensure that the AI systems we build don’t seek power” is pretty reasonable and plausibly will be an important part of AI alignment. However, the implementation of AUP is trying to do this under the threat model of optimal agents with potentially unaligned primary goals. I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent’s beliefs _change_, which doesn’t happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining “dumber” beliefs that power is measured relative to (but this fails to leverage the AI system’s understanding of the world). See this comment for more details.
Note that the author himself is more excited about AUP as deconfusion, rather than as a solution to AI alignment, though he is more optimistic about the implementation of AUP than I am.
I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent’s beliefs change, which doesn’t happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining “dumber” beliefs that power is measured relative to (but this fails to leverage the AI system’s understanding of the world).
Summary for the Alignment Newsletter:
For the benefit of future readers, I replied to this in the newsletter’s comments.