Consider a one-shot decision theory setting. There is a set of unobservable states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent has some belief β∈□S[1], and it chooses the action a∗:=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of (p,s) is “the unobservable state is s and Omega predicts the agent will take action p”. We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′ by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.
The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an infra-Bayesian representation of the belief “Omega will make prediction p”. For any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be interpreted as the belief “assuming Omega is accurate, the expected reward will be at least u”.
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯ψ can be interpreted as ”ϕ implies ψ“[2], the meet operator ∧ can be interpreted as logical conjunction and the join operator ∨ can be interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in which case ∧ is intersection of sets and ⪯ is set containment, but it’s probably true in general.)
Now, β′∧Ωa⪯Ru can be interpreted as “the conjunction of the belief β′ and Ωa implies Ru”. Roughly speaking, “according to β′, if the predicted action is a then the expected reward is at least u”. So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but “holds” is better thought of as “is provable”, since we’re talking about the agent’s belief). Which is exactly the decision rule of MUDT!
Technically it’s better to think of it as ”ψ is true in the context of ϕ”, since it’s not another infradistribution so it’s not a genuine implication operator.
There is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory (MUDT).
Consider a one-shot decision theory setting. There is a set of unobservable states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent has some belief β∈□S[1], and it chooses the action a∗:=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of (p,s) is “the unobservable state is s and Omega predicts the agent will take action p”. We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′ by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.
The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an infra-Bayesian representation of the belief “Omega will make prediction p”. For any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be interpreted as the belief “assuming Omega is accurate, the expected reward will be at least u”.
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯ψ can be interpreted as ”ϕ implies ψ“[2], the meet operator ∧ can be interpreted as logical conjunction and the join operator ∨ can be interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in which case ∧ is intersection of sets and ⪯ is set containment, but it’s probably true in general.)
Now, β′∧Ωa⪯Ru can be interpreted as “the conjunction of the belief β′ and Ωa implies Ru”. Roughly speaking, “according to β′, if the predicted action is a then the expected reward is at least u”. So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but “holds” is better thought of as “is provable”, since we’re talking about the agent’s belief). Which is exactly the decision rule of MUDT!
Apologies for the potential confusion between □ as “space of infradistrubutions” and the □ of modal logic (not used in this post).
Technically it’s better to think of it as ”ψ is true in the context of ϕ”, since it’s not another infradistribution so it’s not a genuine implication operator.