We have <@previously seen@>(@Seeking Power is Provably Instrumentally Convergent in MDPs@) that if you are given an optimal policy for some reward function, but are very uncertain about that reward function (specifically, your belief assigns reward to states in an iid manner), you should expect that the optimal policy will navigate towards states with higher power in some but not all situations. This post generalizes this to non-iid reward distributions: specifically, that “at least half” of reward distributions will seek power (in particular circumstances).
The new results depend on the notion of _environment symmetries_, which arise in states in which an action a2 leads to “more options” than another action a1 (we’re going to ignore cases where a1 or a2 is a self-loop). Specifically, a1 leads to a part of the state space that is isomorphic to a subgraph of the part of the state space that a2 leads to. For example, a1 might be going to a store where you can buy books or video games, and a2 might be going to a supermarket where you can buy food, plants, cleaning supplies, tools, etc. Then, one subgraph isomorphism would be the one that maps “local store” to “supermarket”, “books” to “food”, and “video games” to “plants”. Another such isomorphism would instead map “video games” to “tools”, while keeping the rest the same.
Now this alone doesn’t mean that an optimal policy is definitely going to take a2. Maybe you really want to buy books, so a1 is the optimal choice! But for every reward function for which a1 is optimal, we can construct another reward function for which a2 is optimal, by mapping it through the isomorphism. So, if your first reward function highly valued books, this would now construct a new reward function that highly values food, and now a1 will be optimal. Thus, at least half of the possible reward functions (or distributions over reward functions) will prefer a2 over a1. Thus, in cases where these isomorphisms exist, optimal policies will tend to seek more options (which in turn means they are seeking power).
If the agent is sufficiently farsighted (i.e. the discount is near 1), then we can extend this analysis out in time, to the final cycle that an agent ends up in. (It must end up in a cycle because by assumption the state space is finite.) Any given cycle would only count as one “option”, and so ending up in any given cycle is not very likely (using a similar argument of constructing other rewards). If shutdown is modeled as a state with a single self-loop and no other actions, then this implies that optimal policies will tend to avoid entering the shutdown state.
We’ve been saying “we can construct this other reward function under which the power-seeking action is optimal”. An important caveat is that maybe we know that this other reward function is very unlikely. For example, maybe we really do just know that we’re going to like books and not care much about food, and so the argument “well, we can map the book-loving reward to a food-loving reward” isn’t that interesting, because we assign high probability to the first and low probability to the second. We can’t rule this out for what humans actually do in practice, but it isn’t as simple as “a simplicity prior would do the right thing”—for any non-power-seeking reward function, we can create a power-seeking reward function with only slightly higher complexity by having a program that searches for a subgraph isomorphism and then applies it to the non-power-seeking reward function to create a power-seeking version.
Another major caveat is that this all relies on the existence of these isomorphisms / symmetries in the environment. It is still a matter of debate whether good models of the environment will exhibit such isomorphisms.
(we’re going to ignore cases where a1 or a2 is a self-loop)
I think that a more general class of things should be ignored here. For example, if a2 is part of a 2-cycle, we get the same problem as when a2 is a self-loop. Namely, we can get that most reward functions have optimal policies that take the action a1 over a2 (when the discount rate is sufficiently close to 1), which contradicts the claim being made.
That one in particular isn’t a counterexample as stated, because you can’t construct a subgraph isomorphism for it. When writing this I thought that actually meant I didn’t need more of a caveat (contrary to what I said to you earlier), but now thinking about it a third time I really do need the “no cycles” caveat. The counterexample is:
Z <--> S --> A --> B
With every state also having a self-loop.
In this case, the involution {S:B, Z:A}, would suggest that S --> Z would have more options than S --> A, but optimal policies will take S --> A more often than S --> Z.
(The theorem effectively says “no cycles” by conditioning on the policy being S --> Z or S --> A, in which case the S --> Z --> S --> S --> … possibility is not actually possible, and the involution doesn’t actually go through.)
EDIT: I’ve changed to say that the actions lead to disjoint parts of the state space.
This particular argument is not talking about farsightedness—when we talk about having more options, each option is talking about the entire journey and exact timesteps, rather than just the destination. Since all the “journeys” starting with the S --> Z action go to Z first, and all the “journeys” starting with the S --> A action go to A first, the isomorphism has to map A to Z and vice versa, so that ϕ(T(S,a1))=T(S,a2).
(What assumption does this correspond to in the theorem? In the theorem, the involution has to map Fa to a subset of Fa′; every possibility in Fa1 starts with A, and every possibility in Fa2 starts with Z, so you need to map A to Z.)
I don’t understand what you mean. Nothing contradicts the claim, if the claim is made properly, because the claim is a theorem and always holds when its preconditions do. (EDIT: I think you meant Rohin’s claim in the summary?)
I’d say that we can just remove the quoted portion and just explain “a1 and a2 lead to disjoint sets of future options”, which automatically rules out the self-loop case. (But maybe this is what you meant, ofer?)
I was referring to the claim being made in Rohin’s summary. (I no longer see counter examples after adding the assumption that “a1 and a2 lead to disjoint sets of future options”.)
Planned summary for the Alignment Newsletter:
I’d change this to “optimizes average reward (i.e. the discount equals 1)”. Otherwise looks good!
Done :)
I think that a more general class of things should be ignored here. For example, if a2 is part of a 2-cycle, we get the same problem as when a2 is a self-loop. Namely, we can get that most reward functions have optimal policies that take the action a1 over a2 (when the discount rate is sufficiently close to 1), which contradicts the claim being made.
Thanks for the correction.
That one in particular isn’t a counterexample as stated, because you can’t construct a subgraph isomorphism for it. When writing this I thought that actually meant I didn’t need more of a caveat (contrary to what I said to you earlier), but now thinking about it a third time I really do need the “no cycles” caveat. The counterexample is:
Z <--> S --> A --> B
With every state also having a self-loop.
In this case, the involution {S:B, Z:A}, would suggest that S --> Z would have more options than S --> A, but optimal policies will take S --> A more often than S --> Z.
(The theorem effectively says “no cycles” by conditioning on the policy being S --> Z or S --> A, in which case the S --> Z --> S --> S --> … possibility is not actually possible, and the involution doesn’t actually go through.)
EDIT: I’ve changed to say that the actions lead to disjoint parts of the state space.
My take on it has been, the theorem’s bottleneck assumption implies that you can’t reach S again after taking action a1 or a2, which rules out cycles.
Yeah actually that works too
Probably not an important point, but I don’t see why we can’t use the identity isomorphism (over the part of the state space that a1 leads to).
This particular argument is not talking about farsightedness—when we talk about having more options, each option is talking about the entire journey and exact timesteps, rather than just the destination. Since all the “journeys” starting with the S --> Z action go to Z first, and all the “journeys” starting with the S --> A action go to A first, the isomorphism has to map A to Z and vice versa, so that ϕ(T(S,a1))=T(S,a2).
(What assumption does this correspond to in the theorem? In the theorem, the involution has to map Fa to a subset of Fa′; every possibility in Fa1 starts with A, and every possibility in Fa2 starts with Z, so you need to map A to Z.)
I don’t understand what you mean. Nothing contradicts the claim, if the claim is made properly, because the claim is a theorem and always holds when its preconditions do. (EDIT: I think you meant Rohin’s claim in the summary?)
I’d say that we can just remove the quoted portion and just explain “a1 and a2 lead to disjoint sets of future options”, which automatically rules out the self-loop case. (But maybe this is what you meant, ofer?)
I was referring to the claim being made in Rohin’s summary. (I no longer see counter examples after adding the assumption that “a1 and a2 lead to disjoint sets of future options”.)