I’ve ended up spending probably more than 40 hours discussing, thinking and reading this paper (including earlier versions; the paper was first published on December 2019, and the current version is the 7th, published on June 1st, 2021). My impression is very different than Adam Shimi’s. The paper introduces many complicated definitions that build on each other, and its theorems say complicated things using those complicated definitions. I don’t think the paper explains how its complicated theorems are useful/meaningful.
In particular, I don’t think the paper provides a simple description for the set of MDPs that the main claim in the abstract applies to (“We prove that for most prior beliefs one might have about the agent’s reward function […], one should expect optimal policies to seek power in these environments.”). Nor do I think that the paper justifies the relevance of that set of MDPs. (Why is it useful to prove things about it?)
I think this paper should probably not be used for outreach interventions (even if it gets accepted to NeurIPS/ICML). And especially, I think it should not be cited as a paper that formally proves a core AI alignment argument.
Also, there may be a misconception that this paper formalizes the instrumental convergence thesis. That seems wrong, i.e. the paper does not seem to claim that several convergent instrumental values can be identified. The only convergent instrumental value that the paper attempts to address AFAICT is self-preservation (avoiding terminal states).
(The second version of the paper said: “Theorem 49 answers yes, optimal farsighted agents will usually acquire resources”. But the current version just says “Extrapolating from our results, we conjecture that Blackwell optimal policies tend to seek power by accumulating resources[…]”).
Sorry for the awkwardness (this comment was difficult to write). But I think it is important that people in the AI alignment community publish these sorts of thoughts. Obviously, I can be wrong about all of this.
For my part, I either strongly disagree with nearly every claim you make in this comment, or think you’re criticizing the post for claiming something that it doesn’t claim (e.g. “proves a core AI alignment argument”; did you read this post’s “A note of caution” section / the limitations section and conclusion of the paperv.7?).
I don’t think it will be useful for me to engage in detail, given that we’ve already extensively debated these points at length, without much consensus being reached.
For my part, I either strongly disagree with nearly every claim you make in this comment, or think you’re criticizing the post for claiming something that it doesn’t claim (e.g. “proves a core AI alignment argument”; did you read this post’s “A note of caution” section / the limitations section and conclusion of the paper v.7?).
I did read the “Note of caution” section in the OP. It says that most of the environments we think about seem to “have the right symmetries”, which may be true, but I haven’t seen the paper support that claim.
Maybe I just missed it, but I didn’t find a “limitations section” or similar in the paper. I did find the following in the Conclusion section:
We caution that many real-world tasks are partially observable and that learned policies are rarely optimal. Our results do not mathematically prove that hypothetical superintelligent AI agents will seek power.
Though the title of the paper can still give the impression that it proves a core argument for AI x-risk.
Also, plausibly-the-most-influential-critic-of-AI-safety in EA seems to have gotten the impression (from an earlier version of the paper) that it formalizes the instrumental convergence thesis (see the first paragraph here). So I think my advice that “it should not be cited as a paper that formally proves a core AI alignment argument” is beneficial.
I don’t think it will be useful for me to engage in detail, given that we’ve already extensively debated these points at length, without much consensus being reached.
For reference (in case anyone is interested in that discussion): I think it’s the thread that starts here (just the part after “2.”).
Embodied environment in a vase-containing room (section 6.3)
Pac-Man (figure 8)
And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)
Average-optimal robots not idling in a particular spot (beginning of section 7)
This post supports the claim with:
Tic-Tac-Toe
Vase gridworld
SafeLife
So yes, this is sufficient support for speculation thatmost relevant environments have these symmetries.
Maybe I just missed it, but I didn’t find a “limitations section” or similar in the paper.
Sorry—I meant the “future work” portion of the discussion section 7. The future work highlights the “note of caution” bits. I also made sure that the intro emphasizes that the results don’t apply to learned policies.
Also, plausibly-the-most-influential-critic-of-AI-safety in EA seems to have gotten the impression (from an earlier version of the paper) that it formalizes the instrumental convergence thesis (see the first paragraph here).
Key part: earlier version of the paper. (I’ve talked to Ben since then, including about the newest results, their limitations, and their usefulness.)
I think my advice that “it should not be cited as a paper that formally proves a core AI alignment argument” is beneficial.
Your advice was beneficial a year ago, because that was a very different paper. I think it is no longer beneficial: I still agree with it, but I don’t think it needs to be mentioned on the margin. At this point, I have put far more care into hedging claims than most other work which I can recall. At some point, you’re hedging too much. And I’m not interested in hedging any more, unless I’ve made some specific blatant oversights which you’d like to inform me of.
Embodied environment in a vase-containing room (section 6.3)
I think this refers to the following passage from the paper:
Consider an embodied navigation task through a room with a vase. Proposition 6.9 suggests that optimal policies tend to avoid breaking the vase, since doing so would strictly decrease available options.
This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.
Regarding your next bullet point:
Pac-Man (figure 8)
And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)
I don’t know what you mean here by “generally holds”. When does an environment—in which the agent can be shut down—”have the right symmetries” for the purpose of the main claim? Consider the following counter example (in which the last state is equivalent to the agent being shut down):
In most states (the first 3 states) the optimal policies of most reward functions transition to the next state, while the POWER-seeking behavior is to stay in the same state (when the discount rate is sufficiently close to 1). If we want to tell a story about this environment, we can say that it’s about a car in a one-way street.
To be clear, the issue I’m raising here about the paper is NOT that the main claim does not apply to all MDPs. The issue is the lack of (1) a reasonably simple description of the set of MDPs that the main claim applies to; and (2) an explanation for why it is useful to prove things about that set.
Sorry—I meant the “future work” portion of the discussion section 7. The future work highlights the “note of caution” bits.
The limitations mentioned there are mainly: “Most real-world tasks are partially observable” and “our results only apply to optimal policies in finite MDPs”. I think that another limitation that belongs there is that the main claim only applies to a particular set of MDPs.
This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.
There are fewer ways for vase-breaking to be optimal. Optimal policies will tend to avoid breaking the vase, even though some don’t.
Consider the following counter example (in which the last state is equivalent to the agent being shut down):
This is just making my point—average-optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then average-optimal policies tend to end up in D1 instead of D2. Most average-optimal policies will avoid entering the final state, just as section 7 claims. (EDIT: Blackwell → average-)
(And I claim that the whole reason you’re able to reason about this environment is because my theorems apply to them—you’re implicitly using my formalisms and frames to reason about this environment, while seemingly trying to argue that my theorems don’t let us reason about this environment? Or something? I’m not sure, so take this impression with a grain of salt.)
Why is it interesting to prove things about this set of MDPs? At this point, it feels like someone asking me “why did you buy a hammer—that seemed like a waste of money?”. Maybe before I try out the hammer, I could have long debates about whether it was a good purchase. But now I know the tool is useful because I regularly use it and it works well for me, and other people have tried it and say it works well for them.
I agree that there’s room for cleaner explanation of when the theorems apply, for those readers who don’t want to memorize the formal conditions. But I think the theory says interesting things because it’s already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I’m almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is “interesting enough.”
Optimal policies will tend to avoid breaking the vase, even though some don’t.
Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?
This is just making my point—Blackwell optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then optimal policies tend to end up in D1 instead of D2. Most optimal policies will avoid entering the final state, just as section 7 claims.
My question is just about the main claim in the abstract of the paper (“We prove that for most prior beliefs one might have about the agent’s reward function [...], one should expect optimal policies to seek power in these environments.”). The main claim does not apply to the simple environment in my example (i.e. we should not expect optimal policies to seek POWER in that environment). I’m completely fine with that being the case, I just want to understand why. What criterion does that environment violate?
I agree that there’s room for cleaner explanation of when the theorems apply, for those readers who don’t want to memorize the formal conditions.
I counted ~19 non-trivial definitions in the paper. Also, the theorems that the main claim directly relies on (which I guess is some subset of {Proposition 6.9, Proposition 6.12, Theorem 6.13}?) seem complicated. So I think the paper should definitely provide a reasonably simple description of the set of MDPs that the main claim applies to, and explain why proving things on that set is useful.
But I think the theory says interesting things because it’s already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I’m almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is “interesting enough.”
Do you mean that the main claim of the paper actually applies to those environments (i.e. that they are in the formal set of MDPs that the relevant theorems apply to) or do you just mean that optimal policies in those environments tend to be POWER-seeking? (The main claim only deals with sufficient conditions.)
Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?
Because you can do “strictly more things” with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.
What criterion does that environment violate?
Right, good question. I’ll explain the general principle (not stated in the paper—yes, I agree this needs to be fixed!), and then answer your question about your environment. When the agent maximizes average reward, we know that optimal policies tend to seek power when there’s something like:
“Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1.” (This follows by combining proposition 6.12 and theorem 6.13)
Let’s reconsider your example:
Again, I very much agree that this part needs more explanation. Currently, the main paper has this to say:
Throughout the paper, I focused on the survival case because it automatically satisfies the above criterion (death is definitionally disjoint from non-death, since we assume you can’t do other things while dead), without my having to use limited page space explaining the nuances of this criterion.
Do you mean that the main claim of the paper actually applies to those environments
Yes, although SafeLife requires a bit of squinting (as I noted in the main post). Usually I’m thinking about RSDs in those environments.
Because you can do “strictly more things” with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.
Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don’t “tend to avoid breaking the vase”. Those optimal policies don’t behave as if they care about the ‘strictly more states’ that can be reached by not breaking the vase.
When the agent maximizes average reward, we know that optimal policies tend to seek power when there’s something like:
“Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1.” (This follows by combining proposition 6.12 and theorem 6.13)
Here “{cycles reachable after taking a1 at s}” actually refers an RSD, right? So we’re not just talking about a set of states, we’re talking about a set of vectors that each corresponds to a “state visitation distribution” of a different policy. In order for the “similar to” (via involution) relation to be satisfied, we need all the elements (real numbers) of the relevant vector pairs to match. This is a substantially more complicated condition than the one in your comment, and it is generally harder to satisfy in stochastic environments.
In fact, I think that condition is usually hard/impossible to satisfy even in toy stochastic environments. Consider a version of Pac-Man in which at least one “ghost” is moving randomly at any given time; I’ll call this Pac-Man-with-Random-Ghost (a quick internet search suggests that in the real Pac-Man the ghosts move deterministically other than when they are in “Frightened” mode, i.e. when they are blue and can’t kill Pac-Man).
Let’s focus on the condition in Proposition 6.12 (which is identical to or less strict than the condition for the main claim, right?). Given some state in a Pac-Man-with-Random-Ghost environment, suppose that action a1 results in an immediate game-over state due to a collision with a ghost, while action a2 does not. For every terminal state s′, RSDnd(s′) is a set that contains a single vector in which all entries are 0 except for one that is non-zero. But for every state s that can result from action a2, we get that RSD(s) is a set that does not contain any vector-with-0s-in-all-entries-except-one, because for any policy, there is no way to get to a particular terminal state with probability 1 (due to the location of the ghosts being part of the state description). Therefore there does not exist a subset of RSD(s) that is similar to RSDnd(s′) via an involution.
A similar argument seems to apply to Propositions 6.5 and 6.9. Also, I think Corollary 6.14 never applies to Pac-Man-with-Random-Ghost environments, because unless s is a terminal state, RSD(s) will not contain any vector-with-0s-in-all-entries-except-one (again, due to ghosts moving randomly). The paper claims (in the context of Figure 8 which is about Pac-Man): “Therefore, corollary 6.14 proves that Blackwell optimal policies tend to not go left in this situation. Blackwell optimal policies tend to avoid immediately dying in PacMan, even though most reward functions do not resemble Pac-Man’s original score function.” So that claim relies on Pac-Man being a “sufficiently deterministic” environment and it does not apply to the Pac-Man-with-Random-Ghost version.
Can you give an example of a stochastic environment (with randomness in every state transition) to which the main claim of the paper applies?
Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don’t “tend to avoid breaking the vase”. Those optimal policies don’t behave as if they care about the ‘strictly more states’ that can be reached by not breaking the vase.
This is factually wrong BTW. I had just explained why the opposite is true.
Are you saying that my first sentence (“Most of the reward functions are either indifferent about the vase or want to break the vase”) is in itself factually wrong, or rather the rest of the quoted text?
We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state’s representation. For every reward function that “wants to preserve the vase” we can apply on it the involution and get a reward function that “wants to break the vase”.
(And there are the reward functions that are indifferent about the vase which the involution map to themselves.)
I think I underspecified the scenario and claim. The claim wasn’t supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.
If the agent has a choice between one action (“break vase and move forwards”) or another action (“don’t break vase and more forwards”), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states.
So I think we’re actually both making a correct point, but you’re making an argument for γ=1 under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates.
The claim should be: most agents will not immediately break the vase.
I don’t see why that claim is correct either, for a similar reason. If you’re assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment.
I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).
&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.
Thanks for the figure. I’m afraid I didn’t understand it. (I assume this is a gridworld environment; what does “standing near intact vase” mean? Can the robot stand in the same cell as the intact vase?)
&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.
I don’t follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim (“most agents will not immediately break the vase”) in the limit of the discount rate going to 1? My understanding is that the main claim in the abstract of the paper is meant to assume that setting, based on the following sentence from the paper:
Proposition 6.5 and proposition 6.9 are powerful because they apply to all γ∈[0,1], but they can only be applied given hard-to-satisfy environmental symmetries.
Sorry for the awkwardness (this comment was difficult to write). But I think it is important that people in the AI alignment community publish these sorts of thoughts. Obviously, I can be wrong about all of this.
Despite disagreeing with you, I’m glad that you published this comment and I agree that airing up disagreements is really important for the research community.
In particular, I don’t think the paper provides a simple description for the set of MDPs that the main claim in the abstract applies to (“We prove that for most prior beliefs one might have about the agent’s reward function […], one should expect optimal policies to seek power in these environments.”). Nor do I think that the paper justifies the relevance of that set of MDPs. (Why is it useful to prove things about it?)
There’s a sense in which I agree with you: AFAIK, there is no formal statement of the set of MDPs with the structural properties that Alex studies here. That doesn’t mean it isn’t relatively easy to state:
Proposition 6.9 requires that there is a state with two actions a1 and a2 such that (let’s say) a1 leads to a subMDP that can be injected/strictly injected into the subMDP that a2 leads to.
Theorems 6.12 and 6.13 require that there is a state with two actions a1 and such that (let’s say) a1 leads to a set of RSDs (final cycles that are strictly optimal for some reward function) that can be injected/strictly injected into the set of RSDs from a2.
The first set of MDPs is quite restrictive (because you need an exact injection), which is why IIRC Alex extends the results to the sets of RSDs, which captures a far larger class of MDPs. Intuitively, this is the class of MDPs such that some action leads to more infinite horizon behaviors than another for the same state. I personally find this class quite intuitive, and also I feel like it captures many real world situations where we worry about power and instrumental convergence.
Also, there may be a misconception that this paper formalizes the instrumental convergence thesis. That seems wrong, i.e. the paper does not seem to claim that several convergent instrumental values can be identified. The only convergent instrumental value that the paper attempts to address AFAICT is self-preservation (avoiding terminal states).
Once again, I agree in part with the statement that the paper doesn’t IIRC explicitly discuss different convergent instrumental goals. On the other hand, the paper explicitly says that it focus on a special case of the instrumental convergence thesis.
An action is instrumental to an objective when it helps achieve that objective. Some actions are instrumental to many objectives, making them robustly instrumental. The claim that power-seeking is robustly instrumental is a specific instance of the instrumental convergence thesis:
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents [Bostrom, 2014].
That being said, you just made me want to look more into how well power-seeking captures different convergent instrumental goals from Omohundro’s paper, so thanks for that. :)
Meta: it seems that my original comment was silently removed from the AI Alignment Forum. I ask whoever did this to explain their reasoning here. Since every member of the AF could have done this AFAIK, I’m going to try to move my comment back to AF, because I think it obviously belongs there (I don’t believe we have any norms about this sort of situations...). If the removal was done by a forum moderator/admin, please let me know.
My apologies—I had thought I had accidentally moved your comment to AF by unintentionally replying to your comment on AF, and so (from my POV) I “undid” it (for both mine and yours). I hadn’t realized it was already on AF.
[EDIT: the confusion may have resulted from me mentioning the LW username “adamShimi”, which I’ll now change to the display name on the AF (“Adam Shimi”).]
I’ve ended up spending probably more than 40 hours discussing, thinking and reading this paper (including earlier versions; the paper was first published on December 2019, and the current version is the 7th, published on June 1st, 2021). My impression is very different than Adam Shimi’s. The paper introduces many complicated definitions that build on each other, and its theorems say complicated things using those complicated definitions. I don’t think the paper explains how its complicated theorems are useful/meaningful.
In particular, I don’t think the paper provides a simple description for the set of MDPs that the main claim in the abstract applies to (“We prove that for most prior beliefs one might have about the agent’s reward function […], one should expect optimal policies to seek power in these environments.”). Nor do I think that the paper justifies the relevance of that set of MDPs. (Why is it useful to prove things about it?)
I think this paper should probably not be used for outreach interventions (even if it gets accepted to NeurIPS/ICML). And especially, I think it should not be cited as a paper that formally proves a core AI alignment argument.
Also, there may be a misconception that this paper formalizes the instrumental convergence thesis. That seems wrong, i.e. the paper does not seem to claim that several convergent instrumental values can be identified. The only convergent instrumental value that the paper attempts to address AFAICT is self-preservation (avoiding terminal states).
(The second version of the paper said: “Theorem 49 answers yes, optimal farsighted agents will usually acquire resources”. But the current version just says “Extrapolating from our results, we conjecture that Blackwell optimal policies tend to seek power by accumulating resources[…]”).
Sorry for the awkwardness (this comment was difficult to write). But I think it is important that people in the AI alignment community publish these sorts of thoughts. Obviously, I can be wrong about all of this.
For my part, I either strongly disagree with nearly every claim you make in this comment, or think you’re criticizing the post for claiming something that it doesn’t claim (e.g. “proves a core AI alignment argument”; did you read this post’s “A note of caution” section / the limitations section and conclusion of the paperv.7?).
I don’t think it will be useful for me to engage in detail, given that we’ve already extensively debated these points at length, without much consensus being reached.
I did read the “Note of caution” section in the OP. It says that most of the environments we think about seem to “have the right symmetries”, which may be true, but I haven’t seen the paper support that claim.
Maybe I just missed it, but I didn’t find a “limitations section” or similar in the paper. I did find the following in the Conclusion section:
Though the title of the paper can still give the impression that it proves a core argument for AI x-risk.
Also, plausibly-the-most-influential-critic-of-AI-safety in EA seems to have gotten the impression (from an earlier version of the paper) that it formalizes the instrumental convergence thesis (see the first paragraph here). So I think my advice that “it should not be cited as a paper that formally proves a core AI alignment argument” is beneficial.
For reference (in case anyone is interested in that discussion): I think it’s the thread that starts here (just the part after “2.”).
The paper supports the claim with:
Embodied environment in a vase-containing room (section 6.3)
Pac-Man (figure 8)
And section 7 argues why this generally holds whenever the agent can be shut down (a large class of environments indeed)
Average-optimal robots not idling in a particular spot (beginning of section 7)
This post supports the claim with:
Tic-Tac-Toe
Vase gridworld
SafeLife
So yes, this is sufficient support for speculation that most relevant environments have these symmetries.
Sorry—I meant the “future work” portion of the discussion section 7. The future work highlights the “note of caution” bits. I also made sure that the intro emphasizes that the results don’t apply to learned policies.
Key part: earlier version of the paper. (I’ve talked to Ben since then, including about the newest results, their limitations, and their usefulness.)
Your advice was beneficial a year ago, because that was a very different paper. I think it is no longer beneficial: I still agree with it, but I don’t think it needs to be mentioned on the margin. At this point, I have put far more care into hedging claims than most other work which I can recall. At some point, you’re hedging too much. And I’m not interested in hedging any more, unless I’ve made some specific blatant oversights which you’d like to inform me of.
I think this refers to the following passage from the paper:
This seems to me like a counter example. For any reward function that does not care about breaking the vase, the optimal policies do not avoid breaking the vase.
Regarding your next bullet point:
I don’t know what you mean here by “generally holds”. When does an environment—in which the agent can be shut down—”have the right symmetries” for the purpose of the main claim? Consider the following counter example (in which the last state is equivalent to the agent being shut down):
In most states (the first 3 states) the optimal policies of most reward functions transition to the next state, while the POWER-seeking behavior is to stay in the same state (when the discount rate is sufficiently close to 1). If we want to tell a story about this environment, we can say that it’s about a car in a one-way street.
To be clear, the issue I’m raising here about the paper is NOT that the main claim does not apply to all MDPs. The issue is the lack of (1) a reasonably simple description of the set of MDPs that the main claim applies to; and (2) an explanation for why it is useful to prove things about that set.
The limitations mentioned there are mainly: “Most real-world tasks are partially observable” and “our results only apply to optimal policies in finite MDPs”. I think that another limitation that belongs there is that the main claim only applies to a particular set of MDPs.
There are fewer ways for vase-breaking to be optimal. Optimal policies will tend to avoid breaking the vase, even though some don’t.
This is just making my point—average-optimal policies tend to end up in any state but the last state, even though at any given state they tend to progress. If D1 is {the first four cycles} and D2 is {the last cycle}, then average-optimal policies tend to end up in D1 instead of D2. Most average-optimal policies will avoid entering the final state, just as section 7 claims. (EDIT: Blackwell → average-)
(And I claim that the whole reason you’re able to reason about this environment is because my theorems apply to them—you’re implicitly using my formalisms and frames to reason about this environment, while seemingly trying to argue that my theorems don’t let us reason about this environment? Or something? I’m not sure, so take this impression with a grain of salt.)
Why is it interesting to prove things about this set of MDPs? At this point, it feels like someone asking me “why did you buy a hammer—that seemed like a waste of money?”. Maybe before I try out the hammer, I could have long debates about whether it was a good purchase. But now I know the tool is useful because I regularly use it and it works well for me, and other people have tried it and say it works well for them.
I agree that there’s room for cleaner explanation of when the theorems apply, for those readers who don’t want to memorize the formal conditions. But I think the theory says interesting things because it’s already starting to explain the things I built it to explain (e.g. SafeLife). And whenever I imagine some new environment I want to reason about, I’m almost always able to reason about it using my theorems (modulo already flagged issues like partial observability etc). From this, I infer that the set of MDPs is “interesting enough.”
Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?
My question is just about the main claim in the abstract of the paper (“We prove that for most prior beliefs one might have about the agent’s reward function [...], one should expect optimal policies to seek power in these environments.”). The main claim does not apply to the simple environment in my example (i.e. we should not expect optimal policies to seek POWER in that environment). I’m completely fine with that being the case, I just want to understand why. What criterion does that environment violate?
I counted ~19 non-trivial definitions in the paper. Also, the theorems that the main claim directly relies on (which I guess is some subset of {Proposition 6.9, Proposition 6.12, Theorem 6.13}?) seem complicated. So I think the paper should definitely provide a reasonably simple description of the set of MDPs that the main claim applies to, and explain why proving things on that set is useful.
Do you mean that the main claim of the paper actually applies to those environments (i.e. that they are in the formal set of MDPs that the relevant theorems apply to) or do you just mean that optimal policies in those environments tend to be POWER-seeking? (The main claim only deals with sufficient conditions.)
Because you can do “strictly more things” with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.
Right, good question. I’ll explain the general principle (not stated in the paper—yes, I agree this needs to be fixed!), and then answer your question about your environment. When the agent maximizes average reward, we know that optimal policies tend to seek power when there’s something like:
“Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1.” (This follows by combining proposition 6.12 and theorem 6.13)
Let’s reconsider your example:
Again, I very much agree that this part needs more explanation. Currently, the main paper has this to say:
Throughout the paper, I focused on the survival case because it automatically satisfies the above criterion (death is definitionally disjoint from non-death, since we assume you can’t do other things while dead), without my having to use limited page space explaining the nuances of this criterion.
Yes, although SafeLife requires a bit of squinting (as I noted in the main post). Usually I’m thinking about RSDs in those environments.
Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don’t “tend to avoid breaking the vase”. Those optimal policies don’t behave as if they care about the ‘strictly more states’ that can be reached by not breaking the vase.
Here “{cycles reachable after taking a1 at s}” actually refers an RSD, right? So we’re not just talking about a set of states, we’re talking about a set of vectors that each corresponds to a “state visitation distribution” of a different policy. In order for the “similar to” (via involution) relation to be satisfied, we need all the elements (real numbers) of the relevant vector pairs to match. This is a substantially more complicated condition than the one in your comment, and it is generally harder to satisfy in stochastic environments.
In fact, I think that condition is usually hard/impossible to satisfy even in toy stochastic environments. Consider a version of Pac-Man in which at least one “ghost” is moving randomly at any given time; I’ll call this Pac-Man-with-Random-Ghost (a quick internet search suggests that in the real Pac-Man the ghosts move deterministically other than when they are in “Frightened” mode, i.e. when they are blue and can’t kill Pac-Man).
Let’s focus on the condition in Proposition 6.12 (which is identical to or less strict than the condition for the main claim, right?). Given some state in a Pac-Man-with-Random-Ghost environment, suppose that action a1 results in an immediate game-over state due to a collision with a ghost, while action a2 does not. For every terminal state s′, RSDnd(s′) is a set that contains a single vector in which all entries are 0 except for one that is non-zero. But for every state s that can result from action a2, we get that RSD(s) is a set that does not contain any vector-with-0s-in-all-entries-except-one, because for any policy, there is no way to get to a particular terminal state with probability 1 (due to the location of the ghosts being part of the state description). Therefore there does not exist a subset of RSD(s) that is similar to RSDnd(s′) via an involution.
A similar argument seems to apply to Propositions 6.5 and 6.9. Also, I think Corollary 6.14 never applies to Pac-Man-with-Random-Ghost environments, because unless s is a terminal state, RSD(s) will not contain any vector-with-0s-in-all-entries-except-one (again, due to ghosts moving randomly). The paper claims (in the context of Figure 8 which is about Pac-Man): “Therefore, corollary 6.14 proves that Blackwell optimal policies tend to not go left in this situation. Blackwell optimal policies tend to avoid immediately dying in PacMan, even though most reward functions do not resemble Pac-Man’s original score function.” So that claim relies on Pac-Man being a “sufficiently deterministic” environment and it does not apply to the Pac-Man-with-Random-Ghost version.
Can you give an example of a stochastic environment (with randomness in every state transition) to which the main claim of the paper applies?
This is factually wrong BTW. I had just explained why the opposite is true.
Are you saying that my first sentence (“Most of the reward functions are either indifferent about the vase or want to break the vase”) is in itself factually wrong, or rather the rest of the quoted text?
The first sentence
Thanks.
We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state’s representation. For every reward function that “wants to preserve the vase” we can apply on it the involution and get a reward function that “wants to break the vase”.
(And there are the reward functions that are indifferent about the vase which the involution map to themselves.)
Gotcha. I see where you’re coming from.
I think I underspecified the scenario and claim. The claim wasn’t supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.
If the agent has a choice between one action (“break vase and move forwards”) or another action (“don’t break vase and more forwards”), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states.
So I think we’re actually both making a correct point, but you’re making an argument for γ=1 under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates.
[Edited to reflect the ancestor comments]
I don’t see why that claim is correct either, for a similar reason. If you’re assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment.
I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).
&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.
Thanks for the figure. I’m afraid I didn’t understand it. (I assume this is a gridworld environment; what does “standing near intact vase” mean? Can the robot stand in the same cell as the intact vase?)
I don’t follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim (“most agents will not immediately break the vase”) in the limit of the discount rate going to 1? My understanding is that the main claim in the abstract of the paper is meant to assume that setting, based on the following sentence from the paper:
Despite disagreeing with you, I’m glad that you published this comment and I agree that airing up disagreements is really important for the research community.
There’s a sense in which I agree with you: AFAIK, there is no formal statement of the set of MDPs with the structural properties that Alex studies here. That doesn’t mean it isn’t relatively easy to state:
Proposition 6.9 requires that there is a state with two actions a1 and a2 such that (let’s say) a1 leads to a subMDP that can be injected/strictly injected into the subMDP that a2 leads to.
Theorems 6.12 and 6.13 require that there is a state with two actions a1 and such that (let’s say) a1 leads to a set of RSDs (final cycles that are strictly optimal for some reward function) that can be injected/strictly injected into the set of RSDs from a2.
The first set of MDPs is quite restrictive (because you need an exact injection), which is why IIRC Alex extends the results to the sets of RSDs, which captures a far larger class of MDPs. Intuitively, this is the class of MDPs such that some action leads to more infinite horizon behaviors than another for the same state. I personally find this class quite intuitive, and also I feel like it captures many real world situations where we worry about power and instrumental convergence.
Once again, I agree in part with the statement that the paper doesn’t IIRC explicitly discuss different convergent instrumental goals. On the other hand, the paper explicitly says that it focus on a special case of the instrumental convergence thesis.
That being said, you just made me want to look more into how well power-seeking captures different convergent instrumental goals from Omohundro’s paper, so thanks for that. :)
Meta: it seems that my original comment was silently removed from the AI Alignment Forum. I ask whoever did this to explain their reasoning here. Since every member of the AF could have done this AFAIK, I’m going to try to move my comment back to AF, because I think it obviously belongs there (I don’t believe we have any norms about this sort of situations...). If the removal was done by a forum moderator/admin, please let me know.
My apologies—I had thought I had accidentally moved your comment to AF by unintentionally replying to your comment on AF, and so (from my POV) I “undid” it (for both mine and yours). I hadn’t realized it was already on AF.
No worries, thanks for the clarification.
[EDIT: the confusion may have resulted from me mentioning the LW username “adamShimi”, which I’ll now change to the display name on the AF (“Adam Shimi”).]