Are you saying that the optimal policies of most reward functions will tend to avoid breaking the vase? Why?
Because you can do “strictly more things” with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.
What criterion does that environment violate?
Right, good question. I’ll explain the general principle (not stated in the paper—yes, I agree this needs to be fixed!), and then answer your question about your environment. When the agent maximizes average reward, we know that optimal policies tend to seek power when there’s something like:
“Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1.” (This follows by combining proposition 6.12 and theorem 6.13)
Let’s reconsider your example:
Again, I very much agree that this part needs more explanation. Currently, the main paper has this to say:
Throughout the paper, I focused on the survival case because it automatically satisfies the above criterion (death is definitionally disjoint from non-death, since we assume you can’t do other things while dead), without my having to use limited page space explaining the nuances of this criterion.
Do you mean that the main claim of the paper actually applies to those environments
Yes, although SafeLife requires a bit of squinting (as I noted in the main post). Usually I’m thinking about RSDs in those environments.
Because you can do “strictly more things” with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.
Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don’t “tend to avoid breaking the vase”. Those optimal policies don’t behave as if they care about the ‘strictly more states’ that can be reached by not breaking the vase.
When the agent maximizes average reward, we know that optimal policies tend to seek power when there’s something like:
“Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1.” (This follows by combining proposition 6.12 and theorem 6.13)
Here “{cycles reachable after taking a1 at s}” actually refers an RSD, right? So we’re not just talking about a set of states, we’re talking about a set of vectors that each corresponds to a “state visitation distribution” of a different policy. In order for the “similar to” (via involution) relation to be satisfied, we need all the elements (real numbers) of the relevant vector pairs to match. This is a substantially more complicated condition than the one in your comment, and it is generally harder to satisfy in stochastic environments.
In fact, I think that condition is usually hard/impossible to satisfy even in toy stochastic environments. Consider a version of Pac-Man in which at least one “ghost” is moving randomly at any given time; I’ll call this Pac-Man-with-Random-Ghost (a quick internet search suggests that in the real Pac-Man the ghosts move deterministically other than when they are in “Frightened” mode, i.e. when they are blue and can’t kill Pac-Man).
Let’s focus on the condition in Proposition 6.12 (which is identical to or less strict than the condition for the main claim, right?). Given some state in a Pac-Man-with-Random-Ghost environment, suppose that action a1 results in an immediate game-over state due to a collision with a ghost, while action a2 does not. For every terminal state s′, RSDnd(s′) is a set that contains a single vector in which all entries are 0 except for one that is non-zero. But for every state s that can result from action a2, we get that RSD(s) is a set that does not contain any vector-with-0s-in-all-entries-except-one, because for any policy, there is no way to get to a particular terminal state with probability 1 (due to the location of the ghosts being part of the state description). Therefore there does not exist a subset of RSD(s) that is similar to RSDnd(s′) via an involution.
A similar argument seems to apply to Propositions 6.5 and 6.9. Also, I think Corollary 6.14 never applies to Pac-Man-with-Random-Ghost environments, because unless s is a terminal state, RSD(s) will not contain any vector-with-0s-in-all-entries-except-one (again, due to ghosts moving randomly). The paper claims (in the context of Figure 8 which is about Pac-Man): “Therefore, corollary 6.14 proves that Blackwell optimal policies tend to not go left in this situation. Blackwell optimal policies tend to avoid immediately dying in PacMan, even though most reward functions do not resemble Pac-Man’s original score function.” So that claim relies on Pac-Man being a “sufficiently deterministic” environment and it does not apply to the Pac-Man-with-Random-Ghost version.
Can you give an example of a stochastic environment (with randomness in every state transition) to which the main claim of the paper applies?
Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don’t “tend to avoid breaking the vase”. Those optimal policies don’t behave as if they care about the ‘strictly more states’ that can be reached by not breaking the vase.
This is factually wrong BTW. I had just explained why the opposite is true.
Are you saying that my first sentence (“Most of the reward functions are either indifferent about the vase or want to break the vase”) is in itself factually wrong, or rather the rest of the quoted text?
We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state’s representation. For every reward function that “wants to preserve the vase” we can apply on it the involution and get a reward function that “wants to break the vase”.
(And there are the reward functions that are indifferent about the vase which the involution map to themselves.)
I think I underspecified the scenario and claim. The claim wasn’t supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.
If the agent has a choice between one action (“break vase and move forwards”) or another action (“don’t break vase and more forwards”), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states.
So I think we’re actually both making a correct point, but you’re making an argument for γ=1 under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates.
The claim should be: most agents will not immediately break the vase.
I don’t see why that claim is correct either, for a similar reason. If you’re assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment.
I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).
&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.
Thanks for the figure. I’m afraid I didn’t understand it. (I assume this is a gridworld environment; what does “standing near intact vase” mean? Can the robot stand in the same cell as the intact vase?)
&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.
I don’t follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim (“most agents will not immediately break the vase”) in the limit of the discount rate going to 1? My understanding is that the main claim in the abstract of the paper is meant to assume that setting, based on the following sentence from the paper:
Proposition 6.5 and proposition 6.9 are powerful because they apply to all γ∈[0,1], but they can only be applied given hard-to-satisfy environmental symmetries.
Because you can do “strictly more things” with the vase (including later breaking it) than you can do after you break it, in the sense of proposition 6.9 / lemma D.49. This means that you can permute breaking-vase-is-optimal objectives into breaking-vase-is-suboptimal objectives.
Right, good question. I’ll explain the general principle (not stated in the paper—yes, I agree this needs to be fixed!), and then answer your question about your environment. When the agent maximizes average reward, we know that optimal policies tend to seek power when there’s something like:
“Consider state s, and consider two actions a1 and a2. When {cycles reachable after taking a1 at s} is similar to a subset of {cycles reachable after taking a2 at s}, and those two cycle sets are disjoint, then a2 tends to be optimal over a1 and a2 tends to seek power compared to a1.” (This follows by combining proposition 6.12 and theorem 6.13)
Let’s reconsider your example:
Again, I very much agree that this part needs more explanation. Currently, the main paper has this to say:
Throughout the paper, I focused on the survival case because it automatically satisfies the above criterion (death is definitionally disjoint from non-death, since we assume you can’t do other things while dead), without my having to use limited page space explaining the nuances of this criterion.
Yes, although SafeLife requires a bit of squinting (as I noted in the main post). Usually I’m thinking about RSDs in those environments.
Most of the reward functions are either indifferent about the vase or want to break the vase. The optimal policies of all those reward functions don’t “tend to avoid breaking the vase”. Those optimal policies don’t behave as if they care about the ‘strictly more states’ that can be reached by not breaking the vase.
Here “{cycles reachable after taking a1 at s}” actually refers an RSD, right? So we’re not just talking about a set of states, we’re talking about a set of vectors that each corresponds to a “state visitation distribution” of a different policy. In order for the “similar to” (via involution) relation to be satisfied, we need all the elements (real numbers) of the relevant vector pairs to match. This is a substantially more complicated condition than the one in your comment, and it is generally harder to satisfy in stochastic environments.
In fact, I think that condition is usually hard/impossible to satisfy even in toy stochastic environments. Consider a version of Pac-Man in which at least one “ghost” is moving randomly at any given time; I’ll call this Pac-Man-with-Random-Ghost (a quick internet search suggests that in the real Pac-Man the ghosts move deterministically other than when they are in “Frightened” mode, i.e. when they are blue and can’t kill Pac-Man).
Let’s focus on the condition in Proposition 6.12 (which is identical to or less strict than the condition for the main claim, right?). Given some state in a Pac-Man-with-Random-Ghost environment, suppose that action a1 results in an immediate game-over state due to a collision with a ghost, while action a2 does not. For every terminal state s′, RSDnd(s′) is a set that contains a single vector in which all entries are 0 except for one that is non-zero. But for every state s that can result from action a2, we get that RSD(s) is a set that does not contain any vector-with-0s-in-all-entries-except-one, because for any policy, there is no way to get to a particular terminal state with probability 1 (due to the location of the ghosts being part of the state description). Therefore there does not exist a subset of RSD(s) that is similar to RSDnd(s′) via an involution.
A similar argument seems to apply to Propositions 6.5 and 6.9. Also, I think Corollary 6.14 never applies to Pac-Man-with-Random-Ghost environments, because unless s is a terminal state, RSD(s) will not contain any vector-with-0s-in-all-entries-except-one (again, due to ghosts moving randomly). The paper claims (in the context of Figure 8 which is about Pac-Man): “Therefore, corollary 6.14 proves that Blackwell optimal policies tend to not go left in this situation. Blackwell optimal policies tend to avoid immediately dying in PacMan, even though most reward functions do not resemble Pac-Man’s original score function.” So that claim relies on Pac-Man being a “sufficiently deterministic” environment and it does not apply to the Pac-Man-with-Random-Ghost version.
Can you give an example of a stochastic environment (with randomness in every state transition) to which the main claim of the paper applies?
This is factually wrong BTW. I had just explained why the opposite is true.
Are you saying that my first sentence (“Most of the reward functions are either indifferent about the vase or want to break the vase”) is in itself factually wrong, or rather the rest of the quoted text?
The first sentence
Thanks.
We can construct an involution over reward functions that transforms every state by switching the is-the-vase-broken bit in the state’s representation. For every reward function that “wants to preserve the vase” we can apply on it the involution and get a reward function that “wants to break the vase”.
(And there are the reward functions that are indifferent about the vase which the involution map to themselves.)
Gotcha. I see where you’re coming from.
I think I underspecified the scenario and claim. The claim wasn’t supposed to be: most agents never break the vase (although this is sometimes true). The claim should be: most agents will not immediately break the vase.
If the agent has a choice between one action (“break vase and move forwards”) or another action (“don’t break vase and more forwards”), and these actions lead to similar subgraphs, then at all discount rates, optimal policies will tend to not break the vase immediately. But they might tend to break it eventually, depending on the granularity and balance of final states.
So I think we’re actually both making a correct point, but you’re making an argument for γ=1 under certain kinds of models and whether the agent will eventually break the vase. I (meant to) discuss the immediate break-it-or-not decision in terms of option preservation at all discount rates.
[Edited to reflect the ancestor comments]
I don’t see why that claim is correct either, for a similar reason. If you’re assuming here that most reward functions incentivize avoiding immediately breaking the vase then I would argue that that assumption is incorrect, and to support this I would point to the same involution from my previous comment.
I‘m not assuming that they incentivize anything. They just do! Here’s the proof sketch (for the full proof, you’d subtract a constant vector from each set, but not relevant for the intuition).
&You’re playing a tad fast and loose with your involution argument. Unlike the average-optimal case, you can’t just map one set of states to another for all-discount-rates reasoning.
Thanks for the figure. I’m afraid I didn’t understand it. (I assume this is a gridworld environment; what does “standing near intact vase” mean? Can the robot stand in the same cell as the intact vase?)
I don’t follow (To be clear, I was not trying to apply any theorem from the paper via that involution). But does this mean you are NOT making that claim (“most agents will not immediately break the vase”) in the limit of the discount rate going to 1? My understanding is that the main claim in the abstract of the paper is meant to assume that setting, based on the following sentence from the paper: