Rohin Shah comments on Generalizing the Power-Seeking Theorems

Rohin Shah 27 Jul 2020 2:07 UTC
LW: 6 AF: 4
AF
Planned summary for the Alignment Newsletter:
<@Previously@>(@Seeking Power is Provably Instrumentally Convergent in MDPs@) we’ve seen that if we take an MDP, and have a distribution over state-based reward functions, such that the reward for two different states is iid, then farsighted (i.e. no discount) optimal agents tend to seek “power”. This post relaxes some of these requirements, giving sufficient (but not necessary) criteria for the determining instrumental convergence.
Some of these use a new kind of argument. Suppose that action A leads you to a part of the MDP modeled by a graph G1, and B leads you to a part of the MDP modeled by a graph G2. If there is a subgraph of G2 that is isomorphic to G1, then we know that whatever kinds of choices the agent would have by taking action A, the agent would also have those choices from action B, and so we know B is at least as likely as A. This matches our intuitive reasoning—collecting resources is instrumentally convergent because you can do the same things that you could if you didn’t collect resources, as well as some additional things enabled by your new resources.