% operators that are separated from the operand by a space
% autosize deliminaters
% operators that require brackets
% operators that require parentheses
% Paper specific
We derive a regret bound for DRL reflecting dependence on:
Number of hypotheses
Mixing time of MDP hypotheses
The probability with which the advisor takes optimal actions
That is, the regret bound we get is fully explicit up to a multiplicative constant (which can also be made explicit). Currently we focus on plain (as opposed to catastrophe) and uniform (finite number of hypotheses, uniform prior) DRL, although this result can and should be extended to the catastrophe and/or non-uniform settings.
Appendix A contains the proofs, Appendix B recalls a few lemmas we need from previous essays.
##Results
First, we briefly recall some properties of Markov chains.
#Definition 1
Consider S a finite set and T:Sk→S. We say that k∈N+ is a period of T when there is s∈S an essential state of T (that is, T∞(s∣s)>0) s.t.k is its period, i.e. k=gcd{n∈N+∣Tn(s∣s)>0}. We denote PT the set of periods of T.
The following is a corollary of the Perron-Frobenius theorem which we give without proof. [I believe this is completely standard and would be grateful to get a source for this which treats the reducible case; of course I can produce the proof but it seems redundant.]
#Proposition 1
Consider S a finite set and T:Sk→S. Then, there are F∈(0,∞), λ∈(0,1), ζ:Sk→PT and ξ:S×PTk→S s.t. for any s∈S
∀k∈PT:Tkξ(s,k)=ξ(s,k)
∀n∈N:dtv(Tn(s),Ek∼ζ(s)[Tnξ(s,k)])≤Fλn
For the purpose of this essay, we will use a definition of local sanity slightly stronger than what previously appeared as “Definition 4.” We think this strengthening is not substantial, but also the current analysis can be generalized to the weaker case by adding a term proportional to the 2nd derivative of V (or the 2nd moment of the mixing time). We leave out the details for the time being.
We will use the notation AωM(s):=⋂k∈NAkM(s).
#Definition 2
Let υ=(μ,r) be a universe and ϵ∈(0,1). A policy σ is said to be locally ϵ-sane for υ when there are M, S and UM⊆SM (the set of uncorrupt states) s.t.υ is an O-realization of M with state function S, S(λ)∈UM and for any h∈hdomμ, if S(h)∈UM then
i. ∀a∈suppσ(h):TM(UM∣S(h),a)=1
ii. suppσ(h)⊆A0M(S(h))
iii. ∃a∈AωM(S(h)):σ(a∣h)>ϵ
Recall that for any MPD M, there is γM∈(0,1) s.t. for any γ∈[γM,1), a∈AωM(s) if and only if QM(s,a,γ)=VM(s,γ).
We are now ready to state the regret bound.
#Theorem 1
There is some C∈(0,∞) s.t. the following holds.
Consider I an interface, α,ϵ∈(0,1), H={υk=(μk,rk)∈ΥI}k∈[N] for some N∈N. For each k∈[N], let σk be a locally ϵ-sane policy for υk. For each k∈[N], let Mk be the corresponding MDP and Uk⊆SMk the corresponding set of uncorrupt states. Assume that
i. For each k∈[N], α≤1−γMk.
ii. α≤ϵN2lnN
iii. For any k,j∈[N] and h∈hdomμk∩hdomμj, if Sk(h)∈Uk and Sj(h)∈Uj, then rk(h)=rj(h).
The factor maxs∈Uksupγ∈[1−α,1]∣∣∣dVMj(s,γ)dγ∣∣∣ might seem difficult to understand. However, it can be bounded as follows.
#Proposition 2
Let M be an MDP, π a Blackwell optimal policy for M and F∈(0,∞), λ∈(0,1), P⊆N+ as in Proposition 1 applied to the Markov chain TMπ. Then
maxs∈SMsupγ∈[γM,1)∣∣∣dVM(s,γ)dγ∣∣∣≤F1+λ1−λ+maxP
Theorem 1 and Proposition 2 immediately give the following:
#Corollary 1
There is some C′∈(0,∞) s.t. the following holds.
Under the conditions of Theorem 1, let πk:SMk→A be s.t. for any s∈Uk
i. πk(s)∈AωMk(s)
ii. TMkπk(Uk∣s)=1
Assuming w.l.o.g. that all uncorrupt states are reachable from Sk(λ), πk is guaranteed to exist thanks to condition iii of Definition 2 (if some uncorrupt state is unreachable, we can consider it to be corrupt.) Let Fk∈(0,∞), λk∈(0,1) and Pk⊆N+ be as in Proposition 1, for the Markov chain TMkπk:Ukk→Uk. Then, there is an ¯I-policy π∗ s.t. for any k∈[N]
The proof of Theorem 1 mostly repeats the proof of the previous “Theorem 1”, except that we keep track of the bounds more carefully.
#Proof of Theorem 1
Fix η∈(0,N−1) and T∈N+. Denote νk:=¯μk[σkSk]. To avoid cumbersome notation, whenever Mk should appear a subscript, we will replace it by k. Let (Ω,P∈ΔΩ) be a probability space\Comment{ and {Fn⊆2Ω}n∈N⊔{−1} a filtration of Ω}. Let K:Ω→[N] be \Comment{measurable w.r.t. F−1}a random variable and the following be stochastic processes\Comment{ adapted to F}
Zn,~Zn:Ω→Δ[N]
Jn:Ω→[N]
Ψn:Ω→A
An:Ω→¯A
Θn:Ω→¯O
We also define AΘ:n:Ω→¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ by
AΘ:n:=A0Θ0A1Θ1…An−1Θn−1
(The following conditions on A and Θ imply that the range of the above is indeed in ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗.) Let D and D!k be as in Proposition B.1 (the form of the bound we are proving allows assuming w.l.o.g. that ϵ<1|A|). By condition iii of Definition 2, there is πk:hdomμk→A s.t. for any h∈hdomμk, if Sk(h)∈Uk then
σk(πk(h)∣h)>ϵ
πk(h)∈Aωk(Sk(h))
We construct Ω\Comment{, F}, K, Z, ~Z, J, Ψ, A and Θ s.t K is uniformly distributed and for any k∈[N], l∈N, m∈[T] and o∈O, denoting n=lT+m
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.
We now construct the ¯I-policy π∗ s.t. for any n∈N, h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t.Pr[AΘ:n=h]>0 and a∈¯A
π∗(a∣h):=Pr[An=a∣AΘ:n=h]
That is, we perform Thompson sampling at time intervals of size T, moderated by the delegation routine D, and discard from our belief state hypotheses whose probability is below η and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that D refused to perform.
In order to prove π∗ has the desired property, we will define the stochastic processes Z!, ~Z!, J!, Ψ!, A! and Θ!, each process of the same type as its shriekless counterpart (thus Ω is constructed to accommodate them). These processes are required to satisfy the following:
Here, the factor of 2 comes from the difference between the equations for Zn and Z!n (we can construct and intermediate policy between π∗ and π?k and use the triangle inequality for dtv). We conclude
Without loss of generality, we can assume this to be consistent with the assumption η<1N since the bound contains a term of N2η anyway. Similarly, the second term in the bound implies we can assume that 1−(1−α)T≪1 and therefore 1−(1−α)T=O(Tα). Finally, note that assumption ii implies that the expression we round to get T is ≥1. We get
EU∗υk(1−α)−EUπ∗¯υk[σk](1−α)=O((αN6lnNϵ−1¯τ)1/4)
#Proposition A.1
Fix n∈N, m∈[n]. Then, for x∈(0,1) we have
∣∣∣ddx(xm1−x1−xn)∣∣∣≤1
#Proof of Proposition A.1
ddx(xm1−x1−xn)=mxm−11−x1−xn+xmddx(1−x1−xn)
We will establish the claim by showing that the first term is in [0,1] and the second term is in [−1,0].
For the first term, we have (assuming w.l.o.g. that m≥1)
Applying Proposition A.2 to the first term, we get
∣∣
∣∣dVperM(s,γ)dγ∣∣
∣∣≤F(λ⋅11−λ+11−γλ)≤F1+λ1−λ
Collecting the terms, we have
∣∣∣dVM(s,γ)dγ∣∣∣≤Ek∼ζ(s)[k]+F1+λ1−λ≤maxP+F1+λ1−λ
##Appendix B
The following appeared in a previous essay as “Proposition C.1”.
#Proposition B.1
Fix an interface I=(A,O), N∈N, ϵ∈(0,1|A|), η∈(0,1N). Consider some {σk:(A×O)∗k→A}k∈[N]. Then, there exist D:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×A→¯A and {D!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×A→¯A}k∈[N] with the following properties. Given x∈(Aׯ¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)∗, we denote x–– its projection to ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗. Thus, x––––∈(A×O)∗.
Given μ an I-environment, π:hdomμk→A, D′:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×A→¯A and k∈[N], we can define Ξ[μ,σk,D′,π]∈Δ(Aׯ¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)ω as follows
We require that for every π, μ and k as above, the following conditions hold
i. 1NN−1∑j=0Ex∼Ξ[μ,σj,D!j,π][∣∣{n∈N∣xn∈A×⊥ׯO}∣∣]≤lnNηln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)=O(lnNηϵ)
ii. dtv(1N∑N−1j=0Ξ[μ,σj,D,π],1N∑N−1j=0Ξ[μ,σj,D!j,π])≤(N−1)η
iii. For all x∈hdom¯μ[σk], if D!k(x,π(x––))≠⊥ then σk(D!k(x,π(x––))∣x––)>0
iv. For all x∈hdom¯μ[σk], if D!k(x,π(x––))∉{π(x––),⊥} then σk(π(x––)∣x––)≤ϵ
The following is a simple modification of what appeared there as “Proposition B.2” (the corresponding modification of the proof is trivial and we leave it out.)
#Proposition B.2
Consider some γ∈(0,1), τ∈(0,∞), T∈N+, a universe υ=(μ,r) that is an O-realization of M with state function S, some π∗:hdomμ→A and some π0:hdomμk→A. Assume that γ≥γM. For any n∈N, let π∗n be an I-policy s.t. for any h∈hdomμ
The following appeared previously as “Proposition A.1”.
#Proposition B.3
Consider a probability space (Ω,P∈ΔΩ), N∈N, R⊆[0,1] a finite set and random variables U:Ω→R, K:Ω→[N] and J:Ω→[N]. Assume that K∗P=J∗P=ζ∈Δ[N] and I[K;J]=0. Then
More precise regret bound for DRL
% operators that are separated from the operand by a space
% autosize deliminaters
% operators that require brackets
% operators that require parentheses
% Paper specific
We derive a regret bound for DRL reflecting dependence on:
Number of hypotheses
Mixing time of MDP hypotheses
The probability with which the advisor takes optimal actions
That is, the regret bound we get is fully explicit up to a multiplicative constant (which can also be made explicit). Currently we focus on plain (as opposed to catastrophe) and uniform (finite number of hypotheses, uniform prior) DRL, although this result can and should be extended to the catastrophe and/or non-uniform settings.
Appendix A contains the proofs, Appendix B recalls a few lemmas we need from previous essays.
##Results
First, we briefly recall some properties of Markov chains.
#Definition 1
Consider S a finite set and T:Sk→S. We say that k∈N+ is a period of T when there is s∈S an essential state of T (that is, T∞(s∣s)>0) s.t.k is its period, i.e. k=gcd{n∈N+∣Tn(s∣s)>0}. We denote PT the set of periods of T.
The following is a corollary of the Perron-Frobenius theorem which we give without proof. [I believe this is completely standard and would be grateful to get a source for this which treats the reducible case; of course I can produce the proof but it seems redundant.]
#Proposition 1
Consider S a finite set and T:Sk→S. Then, there are F∈(0,∞), λ∈(0,1), ζ:Sk→PT and ξ:S×PTk→S s.t. for any s∈S
∀k∈PT:Tkξ(s,k)=ξ(s,k)
∀n∈N:dtv(Tn(s),Ek∼ζ(s)[Tnξ(s,k)])≤Fλn
For the purpose of this essay, we will use a definition of local sanity slightly stronger than what previously appeared as “Definition 4.” We think this strengthening is not substantial, but also the current analysis can be generalized to the weaker case by adding a term proportional to the 2nd derivative of V (or the 2nd moment of the mixing time). We leave out the details for the time being.
We will use the notation AωM(s):=⋂k∈NAkM(s).
#Definition 2
Let υ=(μ,r) be a universe and ϵ∈(0,1). A policy σ is said to be locally ϵ-sane for υ when there are M, S and UM⊆SM (the set of uncorrupt states) s.t.υ is an O-realization of M with state function S, S(λ)∈UM and for any h∈hdomμ, if S(h)∈UM then
i. ∀a∈suppσ(h):TM(UM∣S(h),a)=1
ii. suppσ(h)⊆A0M(S(h))
iii. ∃a∈AωM(S(h)):σ(a∣h)>ϵ
Recall that for any MPD M, there is γM∈(0,1) s.t. for any γ∈[γM,1), a∈AωM(s) if and only if QM(s,a,γ)=VM(s,γ).
We are now ready to state the regret bound.
#Theorem 1
There is some C∈(0,∞) s.t. the following holds.
Consider I an interface, α,ϵ∈(0,1), H={υk=(μk,rk)∈ΥI}k∈[N] for some N∈N. For each k∈[N], let σk be a locally ϵ-sane policy for υk. For each k∈[N], let Mk be the corresponding MDP and Uk⊆SMk the corresponding set of uncorrupt states. Assume that
i. For each k∈[N], α≤1−γMk.
ii. α≤ϵN2lnN
iii. For any k,j∈[N] and h∈hdomμk∩hdomμj, if Sk(h)∈Uk and Sj(h)∈Uj, then rk(h)=rj(h).
Then, there is an ¯I-policy π∗ s.t. for any k∈[N]
EU∗υk(1−α)−EUπ∗¯υk[σk](1−α)≤C(αN5lnN(ϵ−1+|A|)N−1∑j=0(maxs∈Ujsupγ∈[1−α,1)∣∣∣dVMj(s,γ)dγ∣∣∣+1))1/4
The factor maxs∈Uksupγ∈[1−α,1]∣∣∣dVMj(s,γ)dγ∣∣∣ might seem difficult to understand. However, it can be bounded as follows.
#Proposition 2
Let M be an MDP, π a Blackwell optimal policy for M and F∈(0,∞), λ∈(0,1), P⊆N+ as in Proposition 1 applied to the Markov chain TMπ. Then
maxs∈SMsupγ∈[γM,1)∣∣∣dVM(s,γ)dγ∣∣∣≤F1+λ1−λ+maxP
Theorem 1 and Proposition 2 immediately give the following:
#Corollary 1
There is some C′∈(0,∞) s.t. the following holds.
Under the conditions of Theorem 1, let πk:SMk→A be s.t. for any s∈Uk
i. πk(s)∈AωMk(s)
ii. TMkπk(Uk∣s)=1
Assuming w.l.o.g. that all uncorrupt states are reachable from Sk(λ), πk is guaranteed to exist thanks to condition iii of Definition 2 (if some uncorrupt state is unreachable, we can consider it to be corrupt.) Let Fk∈(0,∞), λk∈(0,1) and Pk⊆N+ be as in Proposition 1, for the Markov chain TMkπk:Ukk→Uk. Then, there is an ¯I-policy π∗ s.t. for any k∈[N]
EU∗υk(1−α)−EUπ∗¯υk[σk](1−α)≤C′(αN5lnN(ϵ−1+|A|)N−1∑j=0(Fj1−λj+maxPj))1/4
##Appendix A
The proof of Theorem 1 mostly repeats the proof of the previous “Theorem 1”, except that we keep track of the bounds more carefully.
#Proof of Theorem 1
Fix η∈(0,N−1) and T∈N+. Denote νk:=¯μk[σkSk]. To avoid cumbersome notation, whenever Mk should appear a subscript, we will replace it by k. Let (Ω,P∈ΔΩ) be a probability space\Comment{ and {Fn⊆2Ω}n∈N⊔{−1} a filtration of Ω}. Let K:Ω→[N] be \Comment{measurable w.r.t. F−1}a random variable and the following be stochastic processes\Comment{ adapted to F}
Zn,~Zn:Ω→Δ[N]
Jn:Ω→[N]
Ψn:Ω→A
An:Ω→¯A
Θn:Ω→¯O
We also define AΘ:n:Ω→¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ by
AΘ:n:=A0Θ0A1Θ1…An−1Θn−1
(The following conditions on A and Θ imply that the range of the above is indeed in ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗.) Let D and D!k be as in Proposition B.1 (the form of the bound we are proving allows assuming w.l.o.g. that ϵ<1|A|). By condition iii of Definition 2, there is πk:hdomμk→A s.t. for any h∈hdomμk, if Sk(h)∈Uk then
σk(πk(h)∣h)>ϵ
πk(h)∈Aωk(Sk(h))
We construct Ω\Comment{, F}, K, Z, ~Z, J, Ψ, A and Θ s.t K is uniformly distributed and for any k∈[N], l∈N, m∈[T] and o∈O, denoting n=lT+m
~Z0(k)≡1N
Zn(k)=~Zn(k)[[~Zn(k)≥η]]∑N−1j=0~Zn(j)[[~Zn(j)≥η]]
Pr[Jl=k∣ZlT]=ZlT(k)
Ψn=πJl(AΘ:n)
Pr[Θn=o∣AΘ:n]=νK(o∣AΘ:n)
An=D(AΘ:n,Ψn)
~Zn+1(k)N−1∑j=0Zn(j)[[An=D!j(AΘ:n,Ψn)]]νj(Θn∣AΘ:nAn)=Zn(k)[[An=D!k(AΘ:n,Ψn)]]νk(Θn∣AΘ:nAn)
Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.
We now construct the ¯I-policy π∗ s.t. for any n∈N, h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t.Pr[AΘ:n=h]>0 and a∈¯A
π∗(a∣h):=Pr[An=a∣AΘ:n=h]
That is, we perform Thompson sampling at time intervals of size T, moderated by the delegation routine D, and discard from our belief state hypotheses whose probability is below η and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that D refused to perform.
In order to prove π∗ has the desired property, we will define the stochastic processes Z!, ~Z!, J!, Ψ!, A! and Θ!, each process of the same type as its shriekless counterpart (thus Ω is constructed to accommodate them). These processes are required to satisfy the following:
~Z!0(k)≡1N
Z!n(k)=~Z!n(k)[[~Z!n(k)≥η]]∑N−1j=0~Z!n(j)[[~Z!n(j)≥η]][[~Z!n(K)≥η]]+[[K=k]]⋅[[~Z!n(K)<η]]
Pr[J!l=k∣Z!lT]=Z!lT(k)
Ψ!n=πJ!l(AΘ!:n)
Pr[Θ!n=o∣AΘ!:n]=νK(o∣AΘ!:n)
A!n=D!K(AΘ!:n,Ψ!n)
~Z!n+1(k)=Z!n(k)[[A!n=D!k(AΘ!:n,Ψ!n)]]νk(Θ!n∣AΘ!:nA!n)∑N−1j=0Z!n(j)[[A!n=D!j(AΘ!:n,Ψ!n)]]νj(Θ!n∣AΘ!:nA!n)
For any k∈[N], we construct the ¯I-policy π?k s.t. for any n∈N, h∈¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ s.t.Pr[AΘ!:n=h, K=k]>0 and a∈¯A
π?k(a∣h):=Pr[A!n=a∣AΘ!:n=h, K=k]
Given any ¯I-policy π and I-policy σ we define ασπ:(A×O)∗k→¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗ by
ασπ(g∣h):=[[h=g–]]Ch|h|−1∏n=0∑a∈A([[gn∈⊥aO]]π(⊥∣g:n)σ(a∣h:n)+[[gn∈a⊥O]]π(a∣g:n))
Here, Ch∈R is a constant defined s.t. the probabilities sum to 1. We define the I-policy [σ]π–– by
[σ]π––(a∣h):=Prg∼ασπ(h)[π(g)=a∨(π(g)=⊥∧σ(h)=a)]
Define τk∈(0,1) by
τk:=maxs∈Uksupγ∈[1−α,1)∣∣∣dVk(s,γ)dγ∣∣∣+1
Condition iii of Proposition B.1 and condition ii of Definition 2 imply that for any h∈hdomμk
supp[σk]π––?k(h)⊆A0k(Sk(h))
This means we can apply Proposition B.2 and get
EU∗υk(1−α)−EUπ?k¯υk[σk](1−α)≤α∞∑n=0T−1∑m=0(1−α)nT+m(Ex∼μk⋈π∗kn[r(x:nT+m)]−Ex∼νk⋈π?k[r(x––:nT+m)])+2τkα1−(1−α)T
Here, the I-policy π∗kn is defined as π∗n in Proposition B.2, with πk in place of π∗. We also define the ¯I-policies π!kn and π!!kn by
π!kn(a∣h):={π?k(a∣h) if |h|<nTPr[A!|h|=a∣AΘ!:|h|=h, K=k, J!n=k] otherwise
π!!kn(a∣h):=⎧⎪⎨⎪⎩π?k(a∣h) if |h|<nTπ!kn(a∣h)+π!kn(⊥∣h)⋅πkn(a∣h––) if |h|≥nT and a≠⊥0 if |h|≥nT and a=⊥
Denote
ρ∗kn:=μk⋈π∗kn
ρ!!kn:=νk⋈π!!kn
ρ!kn:=νk⋈π!kn
ρ?k:=νk⋈π?k
R?k:=EU∗υk(1−α)−EUπ?k¯υk[σk](1−α)
For each n∈N, denote
EU∗kn:=α1−(1−α)TT−1∑m=0(1−α)mEx∼ρ∗kn[r(x:nT+m)]
EU!!kn:=α1−(1−α)TT−1∑m=0(1−α)mEx∼ρ!!kn[r(x––:nT+m)]
EU!kn:=α1−(1−α)TT−1∑m=0(1−α)mEx∼ρ!kn[r(x––:nT+m)]
EU?kn:=α1−(1−α)TT−1∑m=0(1−α)mEx∼ρ?k[r(x––:nT+m)]
We have
R?k≤(1−(1−α)T)∞∑n=0(1−α)nT(EU∗kn−EU?kn)+2τkα1−(1−α)T
R?k≤(1−(1−α)T)∞∑n=0(1−α)nT(EU∗kn−EU!!kn+EU!!kn−EU!kn+EU!kn−EU?kn)+2τkα1−(1−α)T
Condition iv of Proposition B.1 implies that, given h∈hdomνk s.t. |h|≥nT
suppπ!kn(h)⊆{πk(Sk(h)),⊥}
π!!kn(πk(Sk(h))∣h)=1
Therefore, π!!kn=π∗kn, and we remain with
R?k≤(1−(1−α)T)∞∑n=0(1−α)nT(EU!!kn−EU!kn+EU!kn−EU?kn)+2τkα1−(1−α)T
We have
∣∣EU!!kn−EU!kn∣∣≤Prx∼ρ!kn[∃m∈[T]:xnT+m∈⊥¯O]
Since Z!nT(K)≥η, it follows that
∣∣EU!!kn−EU!kn∣∣≤1ηPrx∼ρ?k[∃m∈[T]:xnT+m∈⊥¯O]≤1ηEx∼ρ?k[∣∣{m∈[T]∣xnT+m∈⊥¯O}∣∣]
∞∑n=0∣∣EU!!kn−EU!kn∣∣≤1ηEx∼ρ?k[∣∣{n∈N∣xn∈⊥¯O}∣∣]
Using condition i of Proposition B.1, we conclude
1NN−1∑k=0R?k≤1−(1−α)TNN−1∑k=0∞∑n=0(1−α)nT(EU!kn−EU?kn)+O((1−(1−α)T)lnNη2ϵ+τkα1−(1−α)T)
Define the random variables {U!n:Ω→[0,1]}n∈N by
U!n:=α1−(1−α)TT−1∑m=0(1−α)mr(AΘ!:nT+m)
Denote
¯τ:=1NN−1∑k=0τk
β:=(1−(1−α)T)lnNη2ϵ+¯τα1−(1−α)T
We get
1NN−1∑k=0R?k≤(1−(1−α)T)∞∑n=0(1−α)nTE[E[U!n∣J!n=K, Z!nT]−E[U!n∣Z!nT]]+O(β)
1NN−1∑k=0R?k= ⎷(1−(1−α)T)∞∑n=0(1−α)nTE[(E[U!n∣J!n=K, Z!nT]−E[U!n∣Z!nT])2]+O(β)
We apply Proposition B.3 to each term in the sum over n.
1NN−1∑k=0R?k= ⎷(1−(1−α)T)∞∑n=0(1−α)nTE[12ηI[K;J!n,U!n∣Z!nT]]+O(β)
1NN−1∑k=0R?k= ⎷1−(1−α)T2η∞∑n=0E[H(Z!nT)−H(Z!(n+1)T)]+O(β)
1NN−1∑k=0R?k=O(√1−(1−α)TηlnN+(1−(1−α)T)lnNη2ϵ+¯τα1−(1−α)T)
Condition ii of Proposition B.1 implies that
dtv(1NN−1∑k=0¯νk[σk]⋈π∗, 1NN−1∑k=0¯νk[σk]⋈π?k)≤2(N−1)η
Here, the factor of 2 comes from the difference between the equations for Zn and Z!n (we can construct and intermediate policy between π∗ and π?k and use the triangle inequality for dtv). We conclude
1NN−1∑k=0(EU∗υk(1−α)−EUπ∗¯υk[σk](1−α))=O(Nη+√1−(1−α)TηlnN+(1−(1−α)T)lnNη2ϵ+¯τα1−(1−α)T)
For each k∈[N], we get
EU∗υk(1−α)−EUπ∗¯υk[σk](1−α)=N⋅O(Nη+√1−(1−α)TηlnN+(1−(1−α)T)lnNη2ϵ+¯τα1−(1−α)T)
Now we set
η:=α1/4N−1/2(lnN)1/4ϵ−1/4¯τ1/4
T:=⌈α−1/4N−1/2(lnN)−1/4ϵ1/4¯τ3/4⌉
Without loss of generality, we can assume this to be consistent with the assumption η<1N since the bound contains a term of N2η anyway. Similarly, the second term in the bound implies we can assume that 1−(1−α)T≪1 and therefore 1−(1−α)T=O(Tα). Finally, note that assumption ii implies that the expression we round to get T is ≥1. We get
EU∗υk(1−α)−EUπ∗¯υk[σk](1−α)=O((αN6lnNϵ−1¯τ)1/4)
#Proposition A.1
Fix n∈N, m∈[n]. Then, for x∈(0,1) we have
∣∣∣ddx(xm1−x1−xn)∣∣∣≤1
#Proof of Proposition A.1
ddx(xm1−x1−xn)=mxm−11−x1−xn+xmddx(1−x1−xn)
We will establish the claim by showing that the first term is in [0,1] and the second term is in [−1,0].
For the first term, we have (assuming w.l.o.g. that m≥1)
0<mxm−11−x1−xn=mxm−1∑n−1k=0xk≤mxm−1∑m−1k=0xk≤mxm−1∑m−1k=0xm−1=1
For the second term, we have
ddx(1−x1−xn)=(−1)⋅(1−xn)−(1−x)⋅(−nxn−1)(1−xn)2=−(n−1)xn+nxn−1−1(1−xn)2
Assume w.l.o.g. that n≥2. First, we show that the numerator (and hence the entire fraction) is negative. Denote p(x):=−(n−1)xn+nxn−1−1. We have
dp(x)dx=n(n−1)(xn−2−xn−1)
Therefore, p(x) is increasing in [0,1]. Observing that p(1)=0, we conclude that p is negative in (0,1).
Now we show that the fraction is ≥−1. Equivalently, we need to show that p(x)+(1−xn)2≥0. We have
p(x)+(1−xn)2=−(n−1)xn+nxn−1−1+1−2xn+x2n=x2n−(n+1)xn+nxn−1=xn−1(xn+1−(n+1)x+n)
Denote q(x):=xn+1−(n+1)x+n. We have
dq(x)dx=(n+1)(xn−1)
Therefore, q(x) is decreasing in [0,1]. Observing that q(1)=0 we conclude that q is positive in (0,1), establishing the desired result.
#Proposition A.2
For any x,y∈(0,1):
1−x(1−xy)2≤11−y
#Proof of Proposition A.2
Consider p(z):=z3−3z+2. We have
dp(z)dz=3z2−3
p(z) has a local minimum at z=1 and a local maximum at z=−1. Therefore, for z≥−1, p(z)≥p(1)=0. For z≥0, we get
z2(z2−3)+2z=z4−3z2+2z=zp(z)≥0
Taking z:=12(x+y), we get
(x+y2)2((x+y2)2−3)+x+y≥0
Now, 12(x+y)≥√xy and t(t−3) is a decreasing function of t for t∈[0,32]. We conclude
xy(xy−3)+x+y≥0
x2y2−3xy+x+y≥0
x2y2−2xy+1≥xy−x−y+1
(1−xy)2≥(1−x)(1−y)
11−y≥1−x(1−xy)2
#Proof of Proposition 2
VM(s,γ)=(1−γ)∞∑n=0γn∑t∈SMTnMπ(t∣s)RM(t)
By Proposition 1, we have
∑t∈SMTnMπ(t∣s)RM(t)=Ek∼ζ(s)⎡⎣∑t∈SMTnMπξ(t∣s,k)RM(t)⎤⎦+δn
where, |δn|≤Fλn. Denote
VperM(s,γ):=(1−γ)∞∑n=0γnδn
VcycMk(s,γ):=(1−γ)∞∑n=0γn∑t∈SMTnMπξ(t∣s,k)RM(t)
We get
VM(s,γ)=Ek∼ζ(s)[VcycMk(s,γ)]+VperM(s,γ)
dVM(s,γ)dγ=Ek∼ζ(s)[dVcycMk(s,γ)dγ]+dVperM(s,γ)dγ
We now analyze each term separately.
Denote
rkn:=∑t∈SMTnMπξ(t∣s,k)RM(t)
Note that, due to the property of ξ, rkn is periodic in n with period k. We have
VcycMk(s,γ):=(1−γ)∞∑n=0γnrkn=(1−γ)k−1∑i=0∞∑n=0γnk+irki=k−1∑i=0(1−γ)γi1−γkrki
∣∣ ∣∣dVcycMk(s,γ)dγ∣∣ ∣∣≤k−1∑i=0∣∣ ∣∣ddγ((1−γ)γi1−γk)∣∣ ∣∣rki
By Proposition A.1, we get
∣∣ ∣∣dVcycMk(s,γ)dγ∣∣ ∣∣≤k−1∑i=0rki≤k
Now we analyze the second term.
VperM(s,γ)=δ0+∞∑n=0γn+1(δn+1−δn)
dVperM(s,γ)dγ=∞∑n=0(n+1)γn(δn+1−δn)=∞∑n=0(nγn−1−(n+1)γn)δn=∞∑n=0((1−γ)nγn−1−γn)δn
∣∣ ∣∣dVperM(s,γ)dγ∣∣ ∣∣≤F∞∑n=0((1−γ)nγn−1+γn)λn=F((1−γ)∞∑n=0nγn−1λn+11−γλ)
∣∣ ∣∣dVperM(s,γ)dγ∣∣ ∣∣≤F((1−γ)ddγ∞∑n=0γnλn+11−γλ)=F((1−γ)ddγ(11−γλ)+11−γλ)
∣∣ ∣∣dVperM(s,γ)dγ∣∣ ∣∣≤F(λ⋅1−γ(1−γλ)2+11−γλ)
Applying Proposition A.2 to the first term, we get
∣∣ ∣∣dVperM(s,γ)dγ∣∣ ∣∣≤F(λ⋅11−λ+11−γλ)≤F1+λ1−λ
Collecting the terms, we have
∣∣∣dVM(s,γ)dγ∣∣∣≤Ek∼ζ(s)[k]+F1+λ1−λ≤maxP+F1+λ1−λ
##Appendix B
The following appeared in a previous essay as “Proposition C.1”.
#Proposition B.1
Fix an interface I=(A,O), N∈N, ϵ∈(0,1|A|), η∈(0,1N). Consider some {σk:(A×O)∗k→A}k∈[N]. Then, there exist D:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×A→¯A and {D!k:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×A→¯A}k∈[N] with the following properties. Given x∈(Aׯ¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)∗, we denote x–– its projection to ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗. Thus, x––––∈(A×O)∗. Given μ an I-environment, π:hdomμk→A, D′:¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O∗×A→¯A and k∈[N], we can define Ξ[μ,σk,D′,π]∈Δ(Aׯ¯¯¯¯¯¯¯¯¯¯¯¯¯¯A×O)ω as follows
Ξ[μ,σk,D′,π](b,a,o∣x):=π(b∣x––––)D′(a∣x––,b)¯μ[σk](o∣x––a)
We require that for every π, μ and k as above, the following conditions hold
i. 1NN−1∑j=0Ex∼Ξ[μ,σj,D!j,π][∣∣{n∈N∣xn∈A×⊥ׯO}∣∣]≤lnNηln(1+ϵ(1−ϵ)(1−ϵ)/ϵ)=O(lnNηϵ)
ii. dtv(1N∑N−1j=0Ξ[μ,σj,D,π],1N∑N−1j=0Ξ[μ,σj,D!j,π])≤(N−1)η
iii. For all x∈hdom¯μ[σk], if D!k(x,π(x––))≠⊥ then σk(D!k(x,π(x––))∣x––)>0
iv. For all x∈hdom¯μ[σk], if D!k(x,π(x––))∉{π(x––),⊥} then σk(π(x––)∣x––)≤ϵ
The following is a simple modification of what appeared there as “Proposition B.2” (the corresponding modification of the proof is trivial and we leave it out.)
#Proposition B.2
Consider some γ∈(0,1), τ∈(0,∞), T∈N+, a universe υ=(μ,r) that is an O-realization of M with state function S, some π∗:hdomμ→A and some π0:hdomμk→A. Assume that γ≥γM. For any n∈N, let π∗n be an I-policy s.t. for any h∈hdomμ
π∗n(h):={π0(h) if |h|<nTπ∗(h) otherwise
Assume that for any h∈hdomμ
i. π∗(s)∈AωM(S(h))
ii. suppπ0(h)⊆A0M(S(h))
iii. For any θ∈(γ,1) ∣∣∣dVM(S(h),θ)dθ∣∣∣≤τ
Then
EU∗υ(γ)−EUπ0υ(γ)≤(1−γ)∞∑n=0T−1∑m=0γnT+m(Ex∼μ⋈π∗n[r(x:nT+m)]−Ex∼μ⋈π0[r(x:nT+m)])+2τγT(1−γ)1−γT
The following appeared previously as “Proposition A.1”.
#Proposition B.3
Consider a probability space (Ω,P∈ΔΩ), N∈N, R⊆[0,1] a finite set and random variables U:Ω→R, K:Ω→[N] and J:Ω→[N]. Assume that K∗P=J∗P=ζ∈Δ[N] and I[K;J]=0. Then
I[K;J,U]≥2(mini∈[N]ζ(i))(E[U∣J=K]−E[U])2