Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.
Given some existing policy π, EDT+SSA recommends that upon receiving observation o we should choose an action from
argmaxa∑s1...snn∑i=1SSA(si in s1...sn∣o,πo→a)U(s1...sn).
(For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, πo→a(o′)=a if o=o′ and πo→a(o′)=π(o′) otherwise.SSA(si in s1...sn∣o,πo→a) is the SSA probability of being in state si of the environment trajectory s1...sn given the observation o and the fact that one uses the policy πo→a.
The SSA probability SSA(si in s1,...,sn∣o,πo→a) is zero if m(si)≠o and
SSA(si in s1...sn∣o,πo→a)=P(s1...sn∣πo→a,o)1#(o,s1...sn)
otherwise. Here, #(o,s1...sn)=∑ni=1[m(si)=o] is the number of times o occurs in s1...sn. Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1⁄2 probability to tails in the Sleeping Beauty problem and sticks with 1⁄2 if it’s told that it’s Monday). P(s1...sn∣πo→a,o) is the (regular, non-anthropic) probability of the sequence of states s1...sn, given that πo→a is played and o is observed at least once. If (as in the sum above) o is observed at least once in s1...sn, we can rewrite this as
P(s1...sn∣πo→a,o)=P(s1...sn∣πo→a)P(o∣πo→a).
Importantly, note that P(o∣πo→a) is constant in a, i.e., the probability that you observe o at least once cannot (in the present setting) depend on what you would do when you observe o.
Inserting this into the above, we get
argmaxa∑s1...snn∑i=1SSA(si in s1...sn∣o,πo→a)U(s1...sn)=argmaxa∑s1...sn with o∑i=1...n,m(si)=oP(s1...sn∣πo→a)#(o,s1...sn)P(o∣πo→a)U(s1...sn),
where the first sum on the right-hand side is over all histories that give rise to observation o at some point. Dividing by the number of agents with observation o in a history and setting the policy for all agents at the same time cancel each other out, such that this equals
argmaxa1P(o∣πo→a)∑s1...sn with oP(s1...sn∣πo→a)U(s1...sn)=argmaxa∑s1...sn with oP(s1...sn∣πo→a)U(s1...sn)=argmaxa∑s1...snP(s1...sn∣πo→a)U(s1...sn).
Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; if there are multiple observations, then multiple policies might satisfy the right-hand side of this equation and not all of these are optimal.
Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely “isomorphic”/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution is to identify with all objects in the world that are perfectly correlated with you. However, the underlying motivation is unrelated to Conitzer’s example.
[2] Vincent Conitzer: A Dutch Book against Sleeping Beauties Who Are
Evidential Decision Theorists. Synthese, Volume 192, Issue 9, pp. 2887-2899, October 2015. https://arxiv.org/pdf/1705.03560.pdf
I noticed that the sum inside argmaxa∑s1,...,sn∑ni=1SSA(si in s1,...,sn∣o,πo→a)U(sn) is not actually an expected utility, because the SSA probabilities do not add up to 1 when there is more than one possible observation. The issue is that conditional on making an observation, the probabilities for the trajectories not containing that observation become 0, but the other probabilities are not renormalized. So this seems to be part way between “real” EDT and UDT (which does not set those probabilities to 0 and of course also does not renormalize).
This zeroing of probabilities of trajectories not containing the current observation (and renormalizing, if one was to do that) seems at best useless busywork, and at worst prevents coordination between agents making different observations. In this formulation of EDT, such coordination is ruled out in another way, namely by specifying that conditional on o→a, the agent is still sure the rest of π is unchanged (i.e., copies of itself receiving other observations keep following π). If we remove the zeroing/renormalizing and say that the agent ought to have more realistic beliefs conditional on o→a, I think we end up with something close to UDT1.0 (modulo differences in the environment model from the original UDT).
(Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)
On the first point: Good point! I’ve now finally fixed the SSA probabilities so that they sum up to 1, which really they should, to really have a version of EDT.
>prevents coordination between agents making different observations.
Yeah, coordination between different observations is definitely not optimal in this case. But I don’t see an EDT way of doing it well. After all, there are cases where given one observation, you prefer one policy and given another observation you favor another policy. So I think you need the ex ante perspective to get consistent preferences over entire policies.
>(Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)
The only significance is to get a version of EDT, which we would traditionally assume to have self-locating beliefs. From a purely mathematical point of view, I think it’s nonsense.
Since Briggs [1] shows that EDT+SSA and CDT+SIA are both ex-ante-optimal policies in some class of cases, one might wonder whether the result of this post transfers to EDT+SSA. I.e., in memoryless POMDPs, is every (ex ante) optimal policy also consistent with EDT+SSA in a similar sense. I think it is, as I will try to show below.
Given some existing policy π, EDT+SSA recommends that upon receiving observation o we should choose an action from argmaxa∑s1...snn∑i=1SSA(si in s1...sn∣o,πo→a)U(s1...sn). (For notational simplicity, I’ll assume that policies are deterministic, but, of course, actions may encode probability distributions.) Here, πo→a(o′)=a if o=o′ and πo→a(o′)=π(o′) otherwise.SSA(si in s1...sn∣o,πo→a) is the SSA probability of being in state si of the environment trajectory s1...sn given the observation o and the fact that one uses the policy πo→a.
The SSA probability SSA(si in s1,...,sn∣o,πo→a) is zero if m(si)≠o and SSA(si in s1...sn∣o,πo→a)=P(s1...sn∣πo→a,o)1#(o,s1...sn) otherwise. Here, #(o,s1...sn)=∑ni=1[m(si)=o] is the number of times o occurs in s1...sn. Note that this is the minimal reference class version of SSA, also known as the double-halfer rule (because it assigns 1⁄2 probability to tails in the Sleeping Beauty problem and sticks with 1⁄2 if it’s told that it’s Monday). P(s1...sn∣πo→a,o) is the (regular, non-anthropic) probability of the sequence of states s1...sn, given that πo→a is played and o is observed at least once. If (as in the sum above) o is observed at least once in s1...sn, we can rewrite this as P(s1...sn∣πo→a,o)=P(s1...sn∣πo→a)P(o∣πo→a). Importantly, note that P(o∣πo→a) is constant in a, i.e., the probability that you observe o at least once cannot (in the present setting) depend on what you would do when you observe o.
Inserting this into the above, we get argmaxa∑s1...snn∑i=1SSA(si in s1...sn∣o,πo→a)U(s1...sn)=argmaxa∑s1...sn with o∑i=1...n,m(si)=oP(s1...sn∣πo→a)#(o,s1...sn)P(o∣πo→a)U(s1...sn), where the first sum on the right-hand side is over all histories that give rise to observation o at some point. Dividing by the number of agents with observation o in a history and setting the policy for all agents at the same time cancel each other out, such that this equals argmaxa1P(o∣πo→a)∑s1...sn with oP(s1...sn∣πo→a)U(s1...sn)=argmaxa∑s1...sn with oP(s1...sn∣πo→a)U(s1...sn)=argmaxa∑s1...snP(s1...sn∣πo→a)U(s1...sn). Obviously, any optimal policy chooses in agreement with this. But the same disclaimers apply; if there are multiple observations, then multiple policies might satisfy the right-hand side of this equation and not all of these are optimal.
[1] Rachael Briggs (2010): Putting a value on Beauty. In Tamar Szabo Gendler and John Hawthorne, editors, Oxford Studies in Epistemology: Volume 3, pages 3–34. Oxford University Press, 2010. http://joelvelasco.net/teaching/3865/briggs10-puttingavalueonbeauty.pdf
Caveat: The version of EDT provided above only takes dependences between instances of EDT making the same observation into account. Other dependences are possible because different decision situations may be completely “isomorphic”/symmetric even if the observations are different. It turns out that the result is not valid once one takes such dependences into account, as shown by Conitzer [2]. I propose a possible solution in https://casparoesterheld.com/2017/10/22/a-behaviorist-approach-to-building-phenomenological-bridges/ . Roughly speaking, my solution is to identify with all objects in the world that are perfectly correlated with you. However, the underlying motivation is unrelated to Conitzer’s example.
[2] Vincent Conitzer: A Dutch Book against Sleeping Beauties Who Are Evidential Decision Theorists. Synthese, Volume 192, Issue 9, pp. 2887-2899, October 2015. https://arxiv.org/pdf/1705.03560.pdf
I noticed that the sum inside argmaxa∑s1,...,sn∑ni=1SSA(si in s1,...,sn∣o,πo→a)U(sn) is not actually an expected utility, because the SSA probabilities do not add up to 1 when there is more than one possible observation. The issue is that conditional on making an observation, the probabilities for the trajectories not containing that observation become 0, but the other probabilities are not renormalized. So this seems to be part way between “real” EDT and UDT (which does not set those probabilities to 0 and of course also does not renormalize).
This zeroing of probabilities of trajectories not containing the current observation (and renormalizing, if one was to do that) seems at best useless busywork, and at worst prevents coordination between agents making different observations. In this formulation of EDT, such coordination is ruled out in another way, namely by specifying that conditional on o→a, the agent is still sure the rest of π is unchanged (i.e., copies of itself receiving other observations keep following π). If we remove the zeroing/renormalizing and say that the agent ought to have more realistic beliefs conditional on o→a, I think we end up with something close to UDT1.0 (modulo differences in the environment model from the original UDT).
(Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)
Sorry for taking an eternity to reply (again).
On the first point: Good point! I’ve now finally fixed the SSA probabilities so that they sum up to 1, which really they should, to really have a version of EDT.
>prevents coordination between agents making different observations.
Yeah, coordination between different observations is definitely not optimal in this case. But I don’t see an EDT way of doing it well. After all, there are cases where given one observation, you prefer one policy and given another observation you favor another policy. So I think you need the ex ante perspective to get consistent preferences over entire policies.
>(Oh, I ignored the splitting up of probabilities of trajectories into SSA probabilities and then adding them back up again, which may have some intuitive appeal but ends up being just a null operation. Does anyone see a significance to that part?)
The only significance is to get a version of EDT, which we would traditionally assume to have self-locating beliefs. From a purely mathematical point of view, I think it’s nonsense.
I now have a draft for a paper that gives this result and others.
Elsewhere, I illustrate this result for the absent-minded driver.