Sorry for the awkwardness (this comment was difficult to write). But I think it is important that people in the AI alignment community publish these sorts of thoughts. Obviously, I can be wrong about all of this.
Despite disagreeing with you, I’m glad that you published this comment and I agree that airing up disagreements is really important for the research community.
In particular, I don’t think the paper provides a simple description for the set of MDPs that the main claim in the abstract applies to (“We prove that for most prior beliefs one might have about the agent’s reward function […], one should expect optimal policies to seek power in these environments.”). Nor do I think that the paper justifies the relevance of that set of MDPs. (Why is it useful to prove things about it?)
There’s a sense in which I agree with you: AFAIK, there is no formal statement of the set of MDPs with the structural properties that Alex studies here. That doesn’t mean it isn’t relatively easy to state:
Proposition 6.9 requires that there is a state with two actions a1 and a2 such that (let’s say) a1 leads to a subMDP that can be injected/strictly injected into the subMDP that a2 leads to.
Theorems 6.12 and 6.13 require that there is a state with two actions a1 and such that (let’s say) a1 leads to a set of RSDs (final cycles that are strictly optimal for some reward function) that can be injected/strictly injected into the set of RSDs from a2.
The first set of MDPs is quite restrictive (because you need an exact injection), which is why IIRC Alex extends the results to the sets of RSDs, which captures a far larger class of MDPs. Intuitively, this is the class of MDPs such that some action leads to more infinite horizon behaviors than another for the same state. I personally find this class quite intuitive, and also I feel like it captures many real world situations where we worry about power and instrumental convergence.
Also, there may be a misconception that this paper formalizes the instrumental convergence thesis. That seems wrong, i.e. the paper does not seem to claim that several convergent instrumental values can be identified. The only convergent instrumental value that the paper attempts to address AFAICT is self-preservation (avoiding terminal states).
Once again, I agree in part with the statement that the paper doesn’t IIRC explicitly discuss different convergent instrumental goals. On the other hand, the paper explicitly says that it focus on a special case of the instrumental convergence thesis.
An action is instrumental to an objective when it helps achieve that objective. Some actions are instrumental to many objectives, making them robustly instrumental. The claim that power-seeking is robustly instrumental is a specific instance of the instrumental convergence thesis:
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents [Bostrom, 2014].
That being said, you just made me want to look more into how well power-seeking captures different convergent instrumental goals from Omohundro’s paper, so thanks for that. :)
Despite disagreeing with you, I’m glad that you published this comment and I agree that airing up disagreements is really important for the research community.
There’s a sense in which I agree with you: AFAIK, there is no formal statement of the set of MDPs with the structural properties that Alex studies here. That doesn’t mean it isn’t relatively easy to state:
Proposition 6.9 requires that there is a state with two actions a1 and a2 such that (let’s say) a1 leads to a subMDP that can be injected/strictly injected into the subMDP that a2 leads to.
Theorems 6.12 and 6.13 require that there is a state with two actions a1 and such that (let’s say) a1 leads to a set of RSDs (final cycles that are strictly optimal for some reward function) that can be injected/strictly injected into the set of RSDs from a2.
The first set of MDPs is quite restrictive (because you need an exact injection), which is why IIRC Alex extends the results to the sets of RSDs, which captures a far larger class of MDPs. Intuitively, this is the class of MDPs such that some action leads to more infinite horizon behaviors than another for the same state. I personally find this class quite intuitive, and also I feel like it captures many real world situations where we worry about power and instrumental convergence.
Once again, I agree in part with the statement that the paper doesn’t IIRC explicitly discuss different convergent instrumental goals. On the other hand, the paper explicitly says that it focus on a special case of the instrumental convergence thesis.
That being said, you just made me want to look more into how well power-seeking captures different convergent instrumental goals from Omohundro’s paper, so thanks for that. :)