From the perspective of full agency (ie, the negation of partial agency), a system which needs a protected epistemic layer sounds suspiciously like a system that can’t tile. You look at the world, and you say: “how can I maximize utility?” You look at your beliefs, and you say: “how can I maximize accuracy?” That’s not a consequentialist agent; that’s two different consequentialist agents!
For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn’t exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.
Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.
First, background about fuzzy beliefs:
Let E be the space of environments (defined as the space of instrumental states in Definition 9 here). A fuzzy belief is a concave function ϕ:E→[0,1] s.t. supϕ=1. We can think of it as the membership function of a fuzzy set. For an incomplete model Φ⊆E, the corresponding ϕ is the concave hull of the characteristic function of Φ (i.e. the minimal concave ϕ s.t. ϕ≥χΦ).
Let γ be the geometric discount parameter and U(γ):=(1−γ)∑∞n=0γnrn be the utility function. Given a policy π (EDIT: in general, we allow our policies to explicitly depend on γ), the value of π at ϕ is defined by
Vπ(ϕ,γ):=1+infμ∈E(Eμπ[U(γ)]−ϕ(μ))
The optimal policy and the optimal value for ϕ are defined by
π∗ϕ,γ:=argmaxπVπ(ϕ,γ)V(ϕ,γ):=maxπVπ(ϕ,γ)
Given a policy π, the regret of π at ϕ is defined by
Rgπ(ϕ,γ):=V(ϕ,γ)−Vπ(ϕ,γ)
π is said to learnϕ when it is asymptotically optimal for ϕ when γ→1, that is
limγ→1Rgπ(ϕ,γ)=0
Given ζ a probability measure over the space fuzzy hypotheses, the Bayesian regret of π at ζ is defined by
BRgπ(ζ,γ):=Eϕ∼ζ[Rgπ(ϕ,γ)]
π is said to learnζ when
limγ→1BRgπ(ζ,γ)=0
If such a π exists, ζ is said to be learnable. Analogously to Bayesian RL, ζ is learnable if and only if it is learned by a specific policy π∗ζ (the Bayes-optimal policy). To define it, we define the fuzzy belief ϕζ by
ϕζ(μ):=sup(σ:suppζ→E):Eϕ∼ζ[σ(ϕ)]=μEϕ∼ζ[ϕ(σ(ϕ))]
We now define π∗ζ:=ϕ∗ϕζ.
Now, updating:
(EDIT: the definition was needlessly complicated, simplified)
Consider a history h∈(A×O)∗ or h∈(A×O)∗×A. Here A is the set of actions and O is the set of observations. Define μ∗ϕ by
μ∗ϕ:=argmaxμ∈Eminπ(ϕ(μ)−Eμπ[U])
Let E′ be the space of “environments starting from h”. That is, if h∈(A×O)∗ then E′=E and if h∈(A×O)∗×A then E′ is slightly different because the history now begins with an observation instead of with an action.
For any μ∈E,ν∈E′ we define [ν]hμ∈E by
[ν]hμ(o∣h′):={ν(o∣h′′) if h′=hh′′μ(o∣h′) otherwise
For reinforcement learning with incomplete/fuzzy hypotheses, this separation doesn’t exist, because the update rule for fuzzy beliefs depends on the utility function and in some sense even on the actual policy.
How does that work?
Actually I was somewhat confused about what the right update rule for fuzzy beliefs is when I wrote that comment. But I think I got it figured out now.
First, background about fuzzy beliefs:
Let E be the space of environments (defined as the space of instrumental states in Definition 9 here). A fuzzy belief is a concave function ϕ:E→[0,1] s.t. supϕ=1. We can think of it as the membership function of a fuzzy set. For an incomplete model Φ⊆E, the corresponding ϕ is the concave hull of the characteristic function of Φ (i.e. the minimal concave ϕ s.t. ϕ≥χΦ).
Let γ be the geometric discount parameter and U(γ):=(1−γ)∑∞n=0γnrn be the utility function. Given a policy π (EDIT: in general, we allow our policies to explicitly depend on γ), the value of π at ϕ is defined by
Vπ(ϕ,γ):=1+infμ∈E(Eμπ[U(γ)]−ϕ(μ))
The optimal policy and the optimal value for ϕ are defined by
π∗ϕ,γ:=argmaxπVπ(ϕ,γ) V(ϕ,γ):=maxπVπ(ϕ,γ)
Given a policy π, the regret of π at ϕ is defined by
Rgπ(ϕ,γ):=V(ϕ,γ)−Vπ(ϕ,γ)
π is said to learn ϕ when it is asymptotically optimal for ϕ when γ→1, that is
limγ→1Rgπ(ϕ,γ)=0
Given ζ a probability measure over the space fuzzy hypotheses, the Bayesian regret of π at ζ is defined by
BRgπ(ζ,γ):=Eϕ∼ζ[Rgπ(ϕ,γ)]
π is said to learn ζ when
limγ→1BRgπ(ζ,γ)=0
If such a π exists, ζ is said to be learnable. Analogously to Bayesian RL, ζ is learnable if and only if it is learned by a specific policy π∗ζ (the Bayes-optimal policy). To define it, we define the fuzzy belief ϕζ by
ϕζ(μ):=sup(σ:suppζ→E):Eϕ∼ζ[σ(ϕ)]=μEϕ∼ζ[ϕ(σ(ϕ))]
We now define π∗ζ:=ϕ∗ϕζ.
Now, updating: (EDIT: the definition was needlessly complicated, simplified)
Consider a history h∈(A×O)∗ or h∈(A×O)∗×A. Here A is the set of actions and O is the set of observations. Define μ∗ϕ by
μ∗ϕ:=argmaxμ∈Eminπ(ϕ(μ)−Eμπ[U])
Let E′ be the space of “environments starting from h”. That is, if h∈(A×O)∗ then E′=E and if h∈(A×O)∗×A then E′ is slightly different because the history now begins with an observation instead of with an action.
For any μ∈E,ν∈E′ we define [ν]hμ∈E by
[ν]hμ(o∣h′):={ν(o∣h′′) if h′=hh′′μ(o∣h′) otherwise
Then, the updated fuzzy belief is
(ϕ∣h)(ν):=ϕ([ν]hμ∗ϕ)+constant