Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.
First, background about fuzzy beliefs:
Let E be the space of environments (defined as the space of instrumental states in Definition 9 here). A fuzzy belief is a concave function ϕ:E→[0,1] s.t. supϕ=1. We can think of it as the membership function of a fuzzy set. For an incomplete model Φ⊆E, the corresponding ϕ is the concave hull of the characteristic function of Φ (i.e. the minimal concave ϕ s.t. ϕ≥χΦ).
Let γ be the geometric discount parameter and U(γ):=(1−γ)∑∞n=0γnrn be the utility function. Given a policy π (EDIT: in general, we allow our policies to explicitly depend on γ), the value of π at ϕ is defined by
Vπ(ϕ,γ):=1+infμ∈E(Eμπ[U(γ)]−ϕ(μ))
The optimal policy and the optimal value for ϕ are defined by
π∗ϕ,γ:=argmaxπVπ(ϕ,γ)V(ϕ,γ):=maxπVπ(ϕ,γ)
Given a policy π, the regret of π at ϕ is defined by
Rgπ(ϕ,γ):=V(ϕ,γ)−Vπ(ϕ,γ)
π is said to learnϕ when it is asymptotically optimal for ϕ when γ→1, that is
limγ→1Rgπ(ϕ,γ)=0
Given ζ a probability measure over the space fuzzy hypotheses, the Bayesian regret of π at ζ is defined by
BRgπ(ζ,γ):=Eϕ∼ζ[Rgπ(ϕ,γ)]
π is said to learnζ when
limγ→1BRgπ(ζ,γ)=0
If such a π exists, ζ is said to be learnable. Analogously to Bayesian RL, ζ is learnable if and only if it is learned by a specific policy π∗ζ (the Bayes-optimal policy). To define it, we define the fuzzy belief ϕζ by
ϕζ(μ):=sup(σ:suppζ→E):Eϕ∼ζ[σ(ϕ)]=μEϕ∼ζ[ϕ(σ(ϕ))]
We now define π∗ζ:=ϕ∗ϕζ.
Now, updating:
(EDIT: the definition was needlessly complicated, simplified)
Consider a history h∈(A×O)∗ or h∈(A×O)∗×A. Here A is the set of actions and O is the set of observations. Define μ∗ϕ by
μ∗ϕ:=argmaxμ∈Eminπ(ϕ(μ)−Eμπ[U])
Let E′ be the space of “environments starting from h”. That is, if h∈(A×O)∗ then E′=E and if h∈(A×O)∗×A then E′ is slightly different because the history now begins with an observation instead of with an action.
For any μ∈E,ν∈E′ we define [ν]hμ∈E by
[ν]hμ(o∣h′):={ν(o∣h′′) if h′=hh′′μ(o∣h′) otherwise
Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.
First, background about fuzzy beliefs:
Let E be the space of environments (defined as the space of instrumental states in Definition 9 here). A fuzzy belief is a concave function ϕ:E→[0,1] s.t. supϕ=1. We can think of it as the membership function of a fuzzy set. For an incomplete model Φ⊆E, the corresponding ϕ is the concave hull of the characteristic function of Φ (i.e. the minimal concave ϕ s.t. ϕ≥χΦ).
Let γ be the geometric discount parameter and U(γ):=(1−γ)∑∞n=0γnrn be the utility function. Given a policy π (EDIT: in general, we allow our policies to explicitly depend on γ), the value of π at ϕ is defined by
Vπ(ϕ,γ):=1+infμ∈E(Eμπ[U(γ)]−ϕ(μ))
The optimal policy and the optimal value for ϕ are defined by
π∗ϕ,γ:=argmaxπVπ(ϕ,γ) V(ϕ,γ):=maxπVπ(ϕ,γ)
Given a policy π, the regret of π at ϕ is defined by
Rgπ(ϕ,γ):=V(ϕ,γ)−Vπ(ϕ,γ)
π is said to learn ϕ when it is asymptotically optimal for ϕ when γ→1, that is
limγ→1Rgπ(ϕ,γ)=0
Given ζ a probability measure over the space fuzzy hypotheses, the Bayesian regret of π at ζ is defined by
BRgπ(ζ,γ):=Eϕ∼ζ[Rgπ(ϕ,γ)]
π is said to learn ζ when
limγ→1BRgπ(ζ,γ)=0
If such a π exists, ζ is said to be learnable. Analogously to Bayesian RL, ζ is learnable if and only if it is learned by a specific policy π∗ζ (the Bayes-optimal policy). To define it, we define the fuzzy belief ϕζ by
ϕζ(μ):=sup(σ:suppζ→E):Eϕ∼ζ[σ(ϕ)]=μEϕ∼ζ[ϕ(σ(ϕ))]
We now define π∗ζ:=ϕ∗ϕζ.
Now, updating: (EDIT: the definition was needlessly complicated, simplified)
Consider a history h∈(A×O)∗ or h∈(A×O)∗×A. Here A is the set of actions and O is the set of observations. Define μ∗ϕ by
μ∗ϕ:=argmaxμ∈Eminπ(ϕ(μ)−Eμπ[U])
Let E′ be the space of “environments starting from h”. That is, if h∈(A×O)∗ then E′=E and if h∈(A×O)∗×A then E′ is slightly different because the history now begins with an observation instead of with an action.
For any μ∈E,ν∈E′ we define [ν]hμ∈E by
[ν]hμ(o∣h′):={ν(o∣h′′) if h′=hh′′μ(o∣h′) otherwise
Then, the updated fuzzy belief is
(ϕ∣h)(ν):=ϕ([ν]hμ∗ϕ)+constant