I find myself agreeing with the idea that an agent unaware of it’s task will seek power, but also conclude that an agent aware of it’s task will give-up power.
I think this is a slight misunderstanding of the theory in the paper. I’d translate the theory of the paper to English as:
If we do not know an agent’s goal, but we know that the agent knows its goal and is optimal w.r.t it, then from our perspective the agent is more likely to go to higher-power states. (From the agent’s perspective, there is no probability, it always executes the deterministic perfect policy for its reward function.)
Any time the paper talks about “distributions” over reward functions, it’s talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done “before” the reward function is drawn, but there is no step where the agent is doing optimization but doesn’t know its reward.
In your paper, theorem 19 suggests that given a choice between two sets of 1-cycles C1 and C2 the agent is more likely to select the larger set.
I’d rewrite this as:
Theorem 19 suggests that, if an agent that knows its reward is about to choose between C1 and C2, but we don’t know the reward and our prior is that it is uniformly distributed, then we will assign higher probability to the agent going to the larger set.
I do not see how the agent ‘seeks’ out powerful states because, as you say, the agent is fixed.
I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says “optimal agents seek power”, I think you should translate it as “when we don’t know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power”, even though the agent itself is not thinking “ah, this state is powerful, I’ll go there”.
Great observation. Similarly, a hypothesis called “Maximum Causal Entropy” once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don’t perpetually maximize their potential partners—they actually pick a partner, eventually.
My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.
If there’s a collection of ‘turned-off’ terminal states where the agent receives no further reward for all time then every optimized policy will try to avoid such a state.
To clarify, I don’t assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions.
But, maybe you’re saying “for realistic distributions, the agent won’t get any reward for being shut off and therefore π∗ won’t ever let itself be shut off”. I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient “obedience” rewards, what then? For some level of farsightedness (γ close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future.
The paper doesn’t draw the causal diagram “Power → instrumental convergence”, it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.
In general, I’d suspect that there are goals we could give the agent that significantly reduce our gain. However, I’d also suspect the opposite.
Yes, right. The point isn’t that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail.
It seems reasonable to argue that we would if we could guarantee r=h.
Yes, if r=h, introduce the agent. You can formalize a kind of “alignment capability” by introducing a joint distribution over the human’s goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we’d implement an agent with goal X’, and so on. You then take our expected optimal value under this distribution and find whether you’re good at alignment, or whether you’re bad and you’ll build agents whose optimal policies tend to obstruct you.
There might be a way to argue over randomness and say this would double our gain.
The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn’t.
More speculatively, what if |r−h|<ϵ?
If the rewards are ϵ-close in sup-norm, then you can get nice regret bounds, sure.
Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.
My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.
Shortly after Theorem 19, the paper says: “In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7”. In particular, see Figure 19.
The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.
If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn’t in the main part of the paper, but basically you toss out transitions which aren’t part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.
The main idea, though, is that you’re reasoning about what the agent’s end goals tend to be, and then say “it’s going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)”. Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.
I think I still haven’t clearly communicated all my mental models here, but I figured I’d write a reply now while I update the paper.
Thank you for these comments, by the way. You’re pointing out important underspecifications. :)
My philosophy is that aligned/general is OK based on a shared (?) premise that,
I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.
[Deleted]
I think this is a slight misunderstanding of the theory in the paper. I’d translate the theory of the paper to English as:
Any time the paper talks about “distributions” over reward functions, it’s talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done “before” the reward function is drawn, but there is no step where the agent is doing optimization but doesn’t know its reward.
I’d rewrite this as:
[Deleted]
I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says “optimal agents seek power”, I think you should translate it as “when we don’t know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power”, even though the agent itself is not thinking “ah, this state is powerful, I’ll go there”.
Great observation. Similarly, a hypothesis called “Maximum Causal Entropy” once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don’t perpetually maximize their potential partners—they actually pick a partner, eventually.
My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.
[Deleted]
To clarify, I don’t assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions.
But, maybe you’re saying “for realistic distributions, the agent won’t get any reward for being shut off and therefore π∗ won’t ever let itself be shut off”. I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient “obedience” rewards, what then? For some level of farsightedness (γ close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future.
The paper doesn’t draw the causal diagram “Power → instrumental convergence”, it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.
Yes, right. The point isn’t that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail.
Yes, if r=h, introduce the agent. You can formalize a kind of “alignment capability” by introducing a joint distribution over the human’s goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we’d implement an agent with goal X’, and so on. You then take our expected optimal value under this distribution and find whether you’re good at alignment, or whether you’re bad and you’ll build agents whose optimal policies tend to obstruct you.
The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn’t.
If the rewards are ϵ-close in sup-norm, then you can get nice regret bounds, sure.
[Deleted]
The freshly updated paper answers this question in great detail; see section 6 and also appendix B.
Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other.
Power is kinda weird when defined for optimal agents, as you say—when γ=1, POWER can only decrease. See Power as Easily Exploitable Opportunities for more on this.
Shortly after Theorem 19, the paper says: “In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7”. In particular, see Figure 19.
The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.
If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn’t in the main part of the paper, but basically you toss out transitions which aren’t part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.
The main idea, though, is that you’re reasoning about what the agent’s end goals tend to be, and then say “it’s going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)”. Theorem 17 tells us that in the limit, cycle reachability totally controls POWER.
I think I still haven’t clearly communicated all my mental models here, but I figured I’d write a reply now while I update the paper.
Thank you for these comments, by the way. You’re pointing out important underspecifications. :)
I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.