Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.
My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.
Shortly after Theorem 19, the paper says: “In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7”. In particular, see Figure 19.
The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.
If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn’t in the main part of the paper, but basically you toss out transitions which aren’t part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.
The main idea, though, is that you’re reasoning about what the agent’s end goals tend to be, and then say “it’s going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)”. Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.
I think I still haven’t clearly communicated all my mental models here, but I figured I’d write a reply now while I update the paper.
Thank you for these comments, by the way. You’re pointing out important underspecifications. :)
My philosophy is that aligned/general is OK based on a shared (?) premise that,
I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.
[Deleted]
The freshly updated paper answers this question in great detail; see section 6 and also appendix B.
Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.
Power is kinda weird when defined for optimal agents, as you say—when γ=1, POWER can only decrease. See Power as Easily Exploitable Opportunities for more on this.
Shortly after Theorem 19, the paper says: “In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7”. In particular, see Figure 19.
The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.
If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn’t in the main part of the paper, but basically you toss out transitions which aren’t part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.
The main idea, though, is that you’re reasoning about what the agent’s end goals tend to be, and then say “it’s going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)”. Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.
I think I still haven’t clearly communicated all my mental models here, but I figured I’d write a reply now while I update the paper.
Thank you for these comments, by the way. You’re pointing out important underspecifications. :)
I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.