As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more “options” available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.
That is not what the theorems in the paper show at all (it’s not just a matter of details). The relevant theorems require a much stronger and more complicated condition than having more “options” after action 1 than after action 2. They require the existence of an involution between two sets of real vectors where each vector corresponds to a “state visitation distribution” of a different policy.
To demonstrate that this is not just a matter of “details”: Your description suggests that generally there is no problem to apply the theorems in stochastic environments (the paper deals with stochastic MDPs). But since the actual condition is much stronger than what you described here, the theorems almost never apply in stochastic environments!
It’s usually impossible to construct a useful involution, as required by the theorems, in stochastic environments. The paper (and the accompanyingposts) use the Pac-Man environment as an example, which is a stochastic environment. But the reason that the theorems can apply there a lot is that usually that environment behaves deterministically. The ghosts always move deterministically unless they are in “blue mode” (i.e. when they can’t kill Pac-Man) in which they sometimes move randomly. This arbitrary quirk of the Pac-Man environment is what allows the theorems to show that “Blackwell optimal policies tend to avoid immediately dying in Pac-Man” as the paper claims. (Whenever Pac-Man can immediately die, the ghosts are not in “blue mode” and thus the environment behaves deterministically).
The point of using scare quotes is to abstract away that part. So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.
So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.
I think the quoted description is not at all what the theorems in the paper show, no matter what concept the word “options” (in scare quotes) refers to. In order to apply the theorems we need to show that an involution with certain properties exist; not that <some set of things after action 1> is larger than <some set of things after action 2>.
To be more specific, the concept that the word “options” refers to here is recurrent state distributions. If the quoted description was roughly correct, there would not be a problem with applying the theorems in stochastic environments. But in fact the theorems can almost never be applied in stochastic environments. For example, suppose action 1 leads to more available “options”, and action 2 causes “immediate death” with probability 0.7515746, and that precise probability does not appear in any transition that follows action 1. We cannot apply the theorems because no involution with the necessary properties exists.
You’re being unhelpfully pedantic. The quoted portion even includes the phrase “As a quick summary (read the paper and sequence if you want more details)”! This reads to me as an attempted pre-emption of “gotcha” comments.
The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn’t about the stochastic sensitivity issue, and I don’t think it should have to talk about the sensitivity issue.
The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn’t about the stochastic sensitivity issue, and I don’t think it should have to talk about the sensitivity issue.
I noticed that after my previous comment you’ve edited your comment to include the page number and the link. Thanks.
I still couldn’t find in the paper (top of page 9) an explanation for the “stochastic sensitivity issue”. Perhaps you were referring to the following:
randomly generated MDPs are unlikely to satisfy our sufficient conditions for POWER-seeking tendencies
But the issue is with stochastic MDPs, not randomly generated MDPs.
Re the linked post section, I couldn’t find there anything about stochastic MDPs.
For (3), environments which “almost” have the right symmetries should also “almost” obey the theorems. To give a quick, non-legible sketch of my reasoning:
For the uniform distribution over reward functions on the unit hypercube ([0,1]|S|), optimality probability should be Lipschitz continuous on the available state visit distributions (in some appropriate sense). Then if the theorems are “almost” obeyed, instrumentally convergent actions still should have extremely high probability, and so most of the orbits still have to agree.
So I don’t currently view (3) as a huge deal. I’ll probably talk more about that another time.
That quote does not seem to mention the “stochastic sensitivity issue”. In the post that you linked to, “(3)” refers to:
Not all environments have the right symmetries
But most ones we think about seem to
So I’m still not sure what you meant when you wrote “The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads.”
(Again, I’m not aware of any previous mention of the “stochastic sensitivity issue” other than in my comment here.)
The phenomena you discuss are explainted in the paper, and in other posts, and discussed at length in other comment threads.
I haven’t found an explanation about the “stochastic sensitivity issue” in the paper, can you please point me to a specific section/page/quote? All that I found about this in the paper was the sentence:
Our theorems apply to stochastic environments, but we present a deterministic case study for clarity.
(I’m also not aware of previous posts/threads that discuss this, other than my comment here.)
I brought up this issue as a demonstration of the implications of incorrectly assuming that the theorems in the paper apply when there are more “options” available after action 1 than after action 2.
(I argue that this issue shows that the informal description in the OP does not correctly describe the theorems in the paper, and it’s not just a matter of omitting details.)
As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more “options” available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.
Also, this claim is missing the “disjoint requirement” and so it is incorrect even without the “they show that” part (i.e. it’s not just that the theorems in the paper don’t show the thing that is being claimed, but rather the thing that is being claimed is incorrect). Consider the following example where action 1 leads to more “options” but most optimal policies choose action 2:
That is not what the theorems in the paper show at all (it’s not just a matter of details). The relevant theorems require a much stronger and more complicated condition than having more “options” after action 1 than after action 2. They require the existence of an involution between two sets of real vectors where each vector corresponds to a “state visitation distribution” of a different policy.
To demonstrate that this is not just a matter of “details”: Your description suggests that generally there is no problem to apply the theorems in stochastic environments (the paper deals with stochastic MDPs). But since the actual condition is much stronger than what you described here, the theorems almost never apply in stochastic environments!
It’s usually impossible to construct a useful involution, as required by the theorems, in stochastic environments. The paper (and the accompanying posts) use the Pac-Man environment as an example, which is a stochastic environment. But the reason that the theorems can apply there a lot is that usually that environment behaves deterministically. The ghosts always move deterministically unless they are in “blue mode” (i.e. when they can’t kill Pac-Man) in which they sometimes move randomly. This arbitrary quirk of the Pac-Man environment is what allows the theorems to show that “Blackwell optimal policies tend to avoid immediately dying in Pac-Man” as the paper claims. (Whenever Pac-Man can immediately die, the ghosts are not in “blue mode” and thus the environment behaves deterministically).
(I elaborated more on this here).
The point of using scare quotes is to abstract away that part. So I think it is an accurate description, in that it flags that “options” is not just the normal intuitive version of options.
I think the quoted description is not at all what the theorems in the paper show, no matter what concept the word “options” (in scare quotes) refers to. In order to apply the theorems we need to show that an involution with certain properties exist; not that <some set of things after action 1> is larger than <some set of things after action 2>.
To be more specific, the concept that the word “options” refers to here is recurrent state distributions. If the quoted description was roughly correct, there would not be a problem with applying the theorems in stochastic environments. But in fact the theorems can almost never be applied in stochastic environments. For example, suppose action 1 leads to more available “options”, and action 2 causes “immediate death” with probability 0.7515746, and that precise probability does not appear in any transition that follows action 1. We cannot apply the theorems because no involution with the necessary properties exists.
You’re being unhelpfully pedantic. The quoted portion even includes the phrase “As a quick summary (read the paper and sequence if you want more details)”! This reads to me as an attempted pre-emption of “gotcha” comments.
The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads. But this post isn’t about the stochastic sensitivity issue, and I don’t think it should have to talk about the sensitivity issue.
I noticed that after my previous comment you’ve edited your comment to include the page number and the link. Thanks.
I still couldn’t find in the paper (top of page 9) an explanation for the “stochastic sensitivity issue”. Perhaps you were referring to the following:
But the issue is with stochastic MDPs, not randomly generated MDPs.
Re the linked post section, I couldn’t find there anything about stochastic MDPs.
That quote does not seem to mention the “stochastic sensitivity issue”. In the post that you linked to, “(3)” refers to:
So I’m still not sure what you meant when you wrote “The phenomena you discuss are explained in the paper (EDIT: top of page 9), and in other posts, and discussed at length in other comment threads.”
(Again, I’m not aware of any previous mention of the “stochastic sensitivity issue” other than in my comment here.)
I haven’t found an explanation about the “stochastic sensitivity issue” in the paper, can you please point me to a specific section/page/quote? All that I found about this in the paper was the sentence:
(I’m also not aware of previous posts/threads that discuss this, other than my comment here.)
I brought up this issue as a demonstration of the implications of incorrectly assuming that the theorems in the paper apply when there are more “options” available after action 1 than after action 2.
(I argue that this issue shows that the informal description in the OP does not correctly describe the theorems in the paper, and it’s not just a matter of omitting details.)
Also, this claim is missing the “disjoint requirement” and so it is incorrect even without the “they show that” part (i.e. it’s not just that the theorems in the paper don’t show the thing that is being claimed, but rather the thing that is being claimed is incorrect). Consider the following example where action 1 leads to more “options” but most optimal policies choose action 2: