That one in particular isn’t a counterexample as stated, because you can’t construct a subgraph isomorphism for it. When writing this I thought that actually meant I didn’t need more of a caveat (contrary to what I said to you earlier), but now thinking about it a third time I really do need the “no cycles” caveat. The counterexample is:
Z <--> S --> A --> B
With every state also having a self-loop.
In this case, the involution {S:B, Z:A}, would suggest that S --> Z would have more options than S --> A, but optimal policies will take S --> A more often than S --> Z.
(The theorem effectively says “no cycles” by conditioning on the policy being S --> Z or S --> A, in which case the S --> Z --> S --> S --> … possibility is not actually possible, and the involution doesn’t actually go through.)
EDIT: I’ve changed to say that the actions lead to disjoint parts of the state space.
This particular argument is not talking about farsightedness—when we talk about having more options, each option is talking about the entire journey and exact timesteps, rather than just the destination. Since all the “journeys” starting with the S --> Z action go to Z first, and all the “journeys” starting with the S --> A action go to A first, the isomorphism has to map A to Z and vice versa, so that ϕ(T(S,a1))=T(S,a2).
(What assumption does this correspond to in the theorem? In the theorem, the involution has to map Fa to a subset of Fa′; every possibility in Fa1 starts with A, and every possibility in Fa2 starts with Z, so you need to map A to Z.)
Thanks for the correction.
That one in particular isn’t a counterexample as stated, because you can’t construct a subgraph isomorphism for it. When writing this I thought that actually meant I didn’t need more of a caveat (contrary to what I said to you earlier), but now thinking about it a third time I really do need the “no cycles” caveat. The counterexample is:
Z <--> S --> A --> B
With every state also having a self-loop.
In this case, the involution {S:B, Z:A}, would suggest that S --> Z would have more options than S --> A, but optimal policies will take S --> A more often than S --> Z.
(The theorem effectively says “no cycles” by conditioning on the policy being S --> Z or S --> A, in which case the S --> Z --> S --> S --> … possibility is not actually possible, and the involution doesn’t actually go through.)
EDIT: I’ve changed to say that the actions lead to disjoint parts of the state space.
My take on it has been, the theorem’s bottleneck assumption implies that you can’t reach S again after taking action a1 or a2, which rules out cycles.
Yeah actually that works too
Probably not an important point, but I don’t see why we can’t use the identity isomorphism (over the part of the state space that a1 leads to).
This particular argument is not talking about farsightedness—when we talk about having more options, each option is talking about the entire journey and exact timesteps, rather than just the destination. Since all the “journeys” starting with the S --> Z action go to Z first, and all the “journeys” starting with the S --> A action go to A first, the isomorphism has to map A to Z and vice versa, so that ϕ(T(S,a1))=T(S,a2).
(What assumption does this correspond to in the theorem? In the theorem, the involution has to map Fa to a subset of Fa′; every possibility in Fa1 starts with A, and every possibility in Fa2 starts with Z, so you need to map A to Z.)