TLDR: I derive a variant of the RL regret bound by Osband and Van Roy (2014), that applies to learning without resets of environments without traps. The advantage of this regret bound over those known in the literature, is that it scales with certain learning-theoretic dimensions rather than number of states and actions. My goal is building on this result to derive this type of regret bound for DRL, and, later on, other settings interesting from an AI alignment perspective.

Previously I derived a regret bound for deterministic environments that scales with prior entropy and “prediction dimension”. That bound behaves as $O (\sqrt{1 - γ})$ in the episodic setting but only as $O (\sqrt[3]{1 - γ})$ in the setting without resets. Moreover, my attempts to generalize the result to stochastic environments led to bounds that are even weaker (have a lower exponent). Therefore, I decided to put that line of attack on hold, and use Osband and Van Roy’s technique instead, leading to an $O (\sqrt{1 - γ})$ bound in the stochastic setting without resets. This bound doesn’t scale down with the entropy of the prior, but this does not seem as important as dependence on $γ$ .

Results

Russo and Van Roy (2013) introduced the concept of “eluder dimension” for the benefit of the multi-armed bandit setting, and Osband and Van Roy extended it to a form suitable for studying reinforcement learning. We will consider the following, slightly modified version of their definition.

Given a real vector space $Y$ , we will use $P S D (Y)$ to denote the set of positive semidefinite bilinear forms on $Y$ and $P D (Y)$ to denote the set of positive definite bilinear forms on $Y$ . Given a bilinear form $B : Y \times Y \to R$ , we will slightly abuse notation by also regarding it as a linear functional $B : Y \otimes Y \to R$ . Thereby, we have $B (v \otimes w) = B (v, w)$ and $B v^{\otimes 2} = B (v, v)$ . Also, if $Y$ is finite-dimensional and $B$ is non-degenerate, we will denote $B^{- 1} : Y^{*} \times Y^{*} \to R$ the unique bilinear form which satisfies

$\forall v \in Y, α \in Y^{*} : (\forall β \in Y^{*} : B^{- 1} (α, β) = β v) ⟺ (\forall w \in Y : B (v, w) = α w)$

Definition 1

Consider a set $X$ , a real vector space $Y$ , some $F \subseteq {X \to Y}$ and a family ${B_{x} \in P S D (Y)}_{x \in X}$ . Consider also $n \in N$ , a sequence ${x_{k} \in X}_{k \in [n]}$ and $x^{*} \in X$ . $x^{*}$ is said to be $(F, B)$ -dependant on ${x_{k}}$ when, for any $f, ~ f \in F$

$n - 1 \sum k = 0 B_{x_{k}} {(f (x_{k}) - ~ f (x_{k}))}^{\otimes 2} \leq 1 ⟹ B_{x^{*}} {(f (x^{*}) - ~ f (x^{*}))}^{\otimes 2} \leq 1$

Otherwise, $x^{*}$ is said to be $(F, B)$ -independent of ${x_{k}}$ .

Definition 2

Consider a set $X$ , a real vector space $Y$ and some $F \subseteq {X \to Y}$ . The Russo-Van Roy-Osband dimension (RVO dimension) ${dim}_{R V O} F$ is the supremum of the set of $n \in N$ for which there is ${x_{k} \in X}_{k \in [n]}$ and $B$ s.t. for all $m \in [n]$ , $x_{m}$ is $(F, B)$ -independent of ${x_{k}}_{k \in [m]}$ .

We have the following basic bounds on RVO dimension.

Proposition 1

Consider a set $X$ , a real vector space $Y$ and some $F \subseteq {X \to Y}$ . Then

${dim}_{R V O} F \leq | X |$

Proposition 2

Consider a set $X$ , a real vector space $Y$ and some $F \subseteq {X \to Y}$ . Then

${dim}_{R V O} F \leq \frac{| F | (| F | - 1)}{2}$

Another concept we need to formulate the regret bound is the Minkowski–Bouligand dimension.

Definition 3

Consider a set $X$ , a real vector space $Y$ , some $F \subseteq {X \to Y}$ and a family ${B_{x} \in P S D (Y)}_{x \in X}$ . A set $A \subseteq F$ is said to be a $B$ -covering of $F$ when

$\forall f \in F \exists ~ f \in A : sup x \in X B_{x} {(f (x) - ~ f (x))}^{\otimes 2} < 1$

Definition 4

Consider a set $X$ , a real vector space $Y$ , some $F \subseteq {X \to Y}$ and a family ${B_{x} \in P S D (Y)}_{x \in X}$ . The covering number $N (F, B)$ is the infimum of the set of $n \in N$ for which there is a $B$ -covering of $F$ of size $n$ .

Definition 5

Consider a finite set $X$ , a finite-dimensional real vector product space $Y$ and some $F \subseteq {X \to Y}$ . Fix any ${B_{x} \in P D (Y)}_{x \in X}$ . The Minkowski–Bouligand dimension (MB dimension) of $F$ is defined by

${dim}_{M B} F := limsup ϵ \to 0 \frac{ln N (F, ϵ^{- 2} B)}{ln \frac{1}{ϵ}}$

It is easy to see the above is indeed well-defined, i.e. doesn’t depend on the choice of $B$ . This is because given any $B, ~ B$ , there are constants $c_{1}, c_{2} \in R^{+}$ s.t. for all $F$ and $ϵ$

$c_{1} N (F, ϵ^{- 2} B) \leq N (F, ϵ^{- 2} ~ B) \leq c_{2} N (F, ϵ^{- 2} B)$

Similarly, for any ${B_{x} \in P S D (Y)}_{x \in X}$ , we have

${dim}_{M B} F \geq limsup ϵ \to 0 \frac{ln N (F, ϵ^{- 2} B)}{ln \frac{1}{ϵ}}$

For finite $F$ and $ϵ ≪ 1$ , it’s obvious that $N (F, ϵ^{- 2} B) \leq | F |$ , and in particular ${dim}_{M B} F = 0$ . It is also possible to show that, for any bounded $F$ , ${dim}_{M B} F \leq | X | dim Y$ .

Note that, in general, MB dimension is fractional.

We will need yet another (but rather simple) notion of “dimension”.

Definition 6

Consider a set $X$ , a vector space $Y$ and some $F \subseteq {X \to Y}$ . The local dimension of $F$ is defined by

${dim}_{l o c} F := max x \in X dim span {f (x) | f \in F}$

Obviously ${dim}_{l o c} F \leq | F |$ and ${dim}_{l o c} F \leq dim Y$ .

Consider finite non-empty sets $S$ (states) and $A$ (actions). Observe that $Δ S$ can be regarded as a subset of the vector space $R^{S}$ . This allows us to speak of the RVO, MB and local dimensions of a hypothesis class of transition kernels $H \subseteq {S \times A k \to S}$ (in this case $X = S \times A$ and $Y = R^{S}$ ).

We can now formulate the regret bound.

Theorem 1

There is some $C \in R^{+}$ s.t. the following holds.

Consider any finite non-empty sets $S$ and $A$ , $R : S \to [0, 1]$ , closed set $H \subseteq {S \times A k \to S}$ and Borel probability measure $ζ$ on $H$ (prior). We define the maximal bias span $τ_{ζ R}$ by

$τ_{ζ R} := limsup γ \to 1 max T \in H \frac{{max}_{s \in S} V_{T R} (s, γ) - {min}_{s \in S} V_{T R} (s, γ)}{1 - γ}$

Denote $D_{l o c} := {dim}_{l o c} H$ , $D_{M B} := {dim}_{M B} H$ and $D_{R V O} := {dim}_{R V O} H$ . Then, there is a family of policies ${π_{γ}^{†} : S^{*} \times S k \to A}_{γ \in (0, 1)}^{*}$ s.t.

$limsup γ \to 1 \frac{E T \sim ζ [{E U}_{T R}^{*} (γ) - {E U}_{T R}^{π_{γ}^{†}} (γ)]}{τ_{ζ R} D_{l o c} \sqrt{(D_{M B} + 1) D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} \leq C$

Here (like in previous essays), $V_{T R} (s, γ)$ is the value function for transition kernel $T$ , reward function $R$ , state $s$ and geometric time discount parameter $γ$ ; ${E U}_{T R}^{π} (γ)$ is the expected utility for policy $π$ ; ${E U}_{T R}^{*} (γ)$ is the maximal expected utility over policies. The expression in the numerator is, thereby, the Bayesian regret.

The implicit condition $τ_{ζ R} < \infty$ implies that for any $T \in H$ and $s, s^{'} \in S$ , $V_{T R}^{0} (s) = V_{T R}^{0} (s^{'})$ . This is a no-traps condition stronger than the condition $A_{T R}^{0} (s) = A$ we used before: not only that long-term value cannot be lost in expectation, it cannot be lost at all. For any finite $H$ satisfying this no-traps condition, we have

$τ_{ζ R} = max T \in H (max s \in S V_{T R}^{1} (s) - min s \in S V_{T R}^{1} (s)) < \infty$

A few directions for improving on this result:

It is not hard to see from the proof that it is also possible to write down a concrete bound for fixed $γ$ (rather than considering the $γ \to 1$ limit), but its form is somewhat convoluted.
It is probably possible to get an anytime policy with this form of regret, using PSRL with dynamic episode duration.
It is interesting to try to make do with the weaker no-traps condition $A_{T R}^{0} (s) = A$ , especially since a stronger no-traps conditions would translate to a stronger condition on the advisor in DRL.
It is interesting to study RVO dimension in more detail. For example, I’m not even sure whether Proposition 2 is the best possible bound in terms of $| F |$ or it’s e.g. possible to get a linear bound.
It seems tempting to generalize local dimension by allowing the values of $f (x)$ to lie on some nonlinear manifold of given dimension for any given $x$ . This approach, if workable, might require a substantially more difficult proof.
The “cellular decision processes” discussed previously in “Proposition 3″ have exponentially high local dimension, meaning that this regret bound is ineffective for them. We can consider a variant in which, on every time step, only one cell or a very small number of cells (possibly chosen randomly) change. This would have low local dimension. One way to interpret it is, as a continuous time process in which each cell has a certain rate of changing its state. EDIT (2019-07-06): It is actually possible to use Theorem 1 to get an effective regret bound for CDPs, by replacing the CDP with a different but equivalent MDP where each time step is replaced by a series of time steps in which the cells are “decided” one by one. It would be interesting to (i) have an explicit expression for the best possible bound among those generated by such “isomorphisms” for general MDPs (ii) derive bounds on RVO dimension for stochastic analogues of CDPs (possibly defined using Markov random fields).

Proofs

Proof of Proposition 1

Consider some ${x_{k} \in X}_{k \in [m]}$ and $x^{*} \in X$ which is $(F, B)$ -independent of ${x_{k}}$ . Then, there are some $f, ~ f \in F$ s.t. the following two inequalities hold

$m - 1 \sum k = 0 B_{x_{k}} {(f (x_{k}) - ~ f (x_{k}))}^{\otimes 2} \leq 1$

$B_{x^{*}} {(f (x^{*}) - ~ f (x^{*}))}^{\otimes 2} > 1$

The first inequality implies that, for any $k \in [m]$ , $B_{x_{k}} {(f (x_{k}) - ~ f (x_{k}))}^{\otimes 2} \leq 1$ . Comparing with the second inequality, we conclude that $x^{*} \neq x_{k}$ .

Now suppose that ${x_{k} \in X}_{k \in [n]}$ is s.t. for all $m \in [n]$ , $x_{m}$ is $(F, B)$ -independent of ${x_{k}}_{k \in [m]}$ . By the previous paragraph, it follows that $x_{m} \neq x_{k}$ for any $m, k \in [n]$ s.t. $k < m$ . Hence, $n \leq | X |$ , Q.E.D.

Proof of Proposition 2

Suppose that ${x_{k} \in X}_{k \in [n]}$ is s.t. for all $m \in [n]$ , $x_{m}$ is $(F, B)$ -independent of ${x_{k}}_{k \in [m]}$ . Then, for each $m \in [n]$ , we can choose $f_{m}, {~ f}_{m} \in F$ s.t. the following two inequalities hold

$m - 1 \sum k = 0 B_{x_{k}} {(f_{m} (x_{k}) - {~ f}_{m} (x_{k}))}_{m}^{\otimes 2} \leq 1$

$B_{x_{m}} {(f_{m} (x_{m}) - {~ f}_{m} (x_{m}))}_{m}^{\otimes 2} > 1$

In particular, the second inequality implies that $f_{m} \neq {~ f}_{m}$ .

Now, consider some $k \in [m]$ . By the first inequality

$B_{x_{k}} {(f_{m} (x_{k}) - {~ f}_{m} (x_{k}))}_{m}^{\otimes 2} \leq 1$

On the other hand, the second inequality with $k$ instead of $m$ gives

$B_{x_{k}} {(f_{k} (x_{k}) - {~ f}_{k} (x_{k}))}_{k}^{\otimes 2} > 1$

It follows that ${f_{m}, {~ f}_{m}} \neq {f_{k}, {~ f}_{k}}$ . Therefore, $n$ cannot exceed the number of unordered pairs of distinct elements in $F$ , Q.E.D.

Given a measurable space $X$ , some $μ, ν \in Δ X$ and a measurable function $f : X \to R$ , we will use the notation

$E x \sim μ - ν [f (x)] := E x \sim μ [f (x)] - E x \sim ν [f (x)]$

Given $X, Y$ measurable spaces, some $K, L : X k \to Y$ and a measurable function $f : Y \to R$ , we will use the notation

$E y \sim K [f (y) | x] := E y \sim K (x) [f (y)]$

$E y \sim K - L [f (y) | x] := E y \sim K (x) - L (x) [f (y)]$

Given finite sets $S, A$ , some $T : S \times A k \to S$ and $π : S \to A$ , we will use the notation $T π : S k \to S$ defined by

$T π (s) := T (s, π (s))$

Proposition A.1

In the setting of Theorem 1, fix $T \in N^{+}$ and $γ \in (0, 1)$ . Let $Π : H \times S \to A$ be a Borel measurable mapping s.t. $Π (T)$ is an optimal policy, i.e.

${E U}_{T R}^{Π (T)} (γ) = {E U}_{T R}^{*} (γ)$

Let $π_{ζ Π T}^{PS} : S^{*} \times S k \to A$ be the policy implemented by a PSRL algorithm with prior $ζ$ and episode length $T$ , s.t. whenever hypothesis $T$ is sampled, policy $Π (T)$ is followed. Denote the Bayesian regret by

$R (γ) := E T \sim ζ [{E U}_{T R}^{*} (γ) - {E U}_{T R}^{π_{ζ Π T}^{PS}} (γ)]$

Let $(Ω, P)$ be a probability space governing both the uncertainty about the true hypothesis, the stochastic behavior of the environment and the random sampling inside the algorithm. [See the proof of “Lemma 1” in this previous essay or the proof of “Theorem 1″ in another previous essay.] Furthermore, let $H_{*} : Ω \to H$ be a random variable representing the true hypothesis, ${H_{n} : Ω \to H}_{n \in N}$ be the random variables s.t. $H_{n}$ represents the hypothesis sampled at time $n$ (i.e. during episode number $⌊ n / T ⌋$ ), ${Θ_{n} : Ω \to S}_{n \in N}$ be random variables s.t. $Θ_{n}$ represents the state at time $n$ and ${A_{n} : Ω \to A}_{n \in N}$ be random variables s.t. $A_{n}$ represents the action taken at time $n$ . Denote $E_{l} := H_{*} Π (H_{l T})$ and ${¯ E}_{l} := H_{l T} Π (H_{*})$ . Then

$R (γ) = \infty \sum n = 0 γ^{n + 1} E [E H_{n} - H_{*} [V_{H_{n} R} (γ) | Θ_{n}, A_{n}]] + \infty \sum l = 0 γ^{(l + 1) T} E [E {¯ E}_{l}^{T} - E_{l}^{T} [V_{H_{*} R} (γ) | Θ_{l T}]]$

Here, $E_{l}^{T}$ means raising $E_{l}$ to the $T$ -th power w.r.t. Markov kernel composition, and the same for ${¯ E}_{l}^{T}$ .

We will use the notation $V_{T π R} (s, γ)$ to stand for the expected utility achieved by policy $π$ when starting from state $s$ with transition kernel $T$ , reward function $R$ and time discount parameter $γ$ .

Proof of Proposition A.1

For any $n \in N$ , define $Π_{n} : H \times H \times S^{*} \times S k \to A$ as follows.

$Π_{n} (T_{1}, T_{2}, h, s) := {\begin{matrix} Π (T_{1}, s) if | h | < n Π (T_{2}, s) if | h | \geq n \end{matrix}$

That is, $Π_{n} (T_{1}, T_{2})$ is a policy that follows $Π (T_{1})$ for time $n$ and $Π (T_{2})$ afterwards.

In the following, we use the shorthand notation

$V_{*} (s) := V_{H_{*} R} (s, γ)$

$V_{l} (s) := V_{H_{l T} R} (s, γ)$

$V_{l k} (s) := V_{H_{*} Π_{T - k} (H_{l T}, H_{*}) R} (s, γ)$

It is easy to see that

$R (γ) = \infty \sum l = 0 γ^{l T} E [V_{*} (Θ_{l T}) - V_{l 0} (Θ_{l T})]$

The above is essentially what appeared as “Proposition B.1” before, for the special case of PSRL, and where we regard every episode as a single time step in a new MDP in which every action is a policy for the duration of an episode in the original MDP.

By definition, $H_{l T}$ and $H_{*}$ have the same distribution even when conditioned by the history up to $l T$ . Therefore

$E [V_{*} (Θ_{l T})] = E [V_{l} (Θ_{l T})]$

It follows that

$R (γ) = \infty \sum l = 0 γ^{l T} E [V_{l} (Θ_{l T}) - V_{l 0} (Θ_{l T})]$

We now prove by induction on $k \in [T + 1]$ that

$E [V_{l} (Θ_{l T}) - V_{l 0} (Θ_{l T})] = l T + k - 1 \sum n = l T γ^{n - l T + 1} E [E H_{n} - H_{*} [V_{l} | Θ_{n}, A_{n}]] + γ^{k} E [V_{l} (Θ_{l T + k}) - V_{l k} (Θ_{l T + k})]$

For $k = 0$ this is a tautology. For any $k \in [T]$ , the Bellman equation says that

$V_{l} (s) = (1 - γ) R (s) + γ E H_{l T} Π (H_{l T}) [V_{l} | s]$

$V_{l k} (s) = (1 - γ) R (s) + γ E H_{*} Π (H_{l T}) [V_{l, k + 1} | s]$

It follows that

$E [V_{l} (Θ_{l T + k}) - V_{l k} (Θ_{l T + k})] = γ E [E H_{l T} Π (H_{l T}) [V_{l} | Θ_{l T + k}] - E H_{*} Π (H_{l T}) [V_{l, k + 1} | Θ_{l T + k}]]$

Since $Π (H_{l T})$ is exactly the policy followed by PSRL at time $l T + k$ , we get

$E [V_{l} (Θ_{l T + k}) - V_{l k} (Θ_{l T + k})] = γ E [E H_{l T} [V_{l} | Θ_{l T + k}, A_{l T + k}] - E H_{*} [V_{l, k + 1} | Θ_{l T + k}, A_{l T + k}]]$

We now subtract and add $E H_{*} [V_{l} | Θ_{l T + k}, A_{l T + k}]$ , and use the fact that $H_{*} (Θ_{l T + k}, A_{l T + k})$ is the conditional distribution of $Θ_{l T + k + 1}$ .

$E [V_{l} (Θ_{l T + k}) - V_{l k} (Θ_{l T + k})] = γ E [E H_{l T} - H_{*} [V_{l} | Θ_{l T + k}, A_{l T + k}] + V_{l} (Θ_{l T + k + 1}) - V_{l, k + 1} (Θ_{l T + k + 1})]$

Applying this identity to the second second term on the right hand side of the induction hypothesis, we prove the induction step. For $k = T$ , we get, denoting $n_{0} := l T$ and $n_{1} := (l + 1) T$

$E [V_{l} (Θ_{n_{0}}) - V_{l 0} (Θ_{n_{0}})] = n_{1} - 1 \sum n = n_{0} γ^{n - n_{0} + 1} E [E H_{n} - H_{*} [V_{l} | Θ_{n}, A_{n}]] + γ^{T} E [V_{l} (Θ_{n_{1}}) - V_{*} (Θ_{n_{1}})]$

Clearly

$E [V_{l} (Θ_{n_{1}})] = E [E E_{l}^{T} [V_{l} ∣ ∣ Θ_{n_{0}}]]$

$E [V_{*} (Θ_{n_{1}})] = E [E E_{l}^{T} [V_{*} ∣ ∣ Θ_{n_{0}}]]$

Using the definition of PSRL, we can exchange and true and sampled hypothesis and get

$E [V_{l} (Θ_{n_{1}})] = E [E {¯ E}_{l}^{T} [V_{*} ∣ ∣ Θ_{n_{0}}]]$

It follows that

$E [V_{l} (Θ_{n_{0}}) - V_{l 0} (Θ_{n_{0}})] = n_{1} - 1 \sum n = n_{0} γ^{n - n_{0} + 1} E [E H_{n} - H_{*} [V_{l} | Θ_{n}, A_{n}]] + γ^{T} E [E {¯ E}_{l}^{T} - E_{l}^{T} [V_{*} ∣ ∣ Θ_{n_{0}}]]$

Applying this to each term in the earlier expression for $R (γ)$ , we get the desired result. Q.E.D.

Proposition A.2

Consider a real finite-dimensional normed vector space $Y$ and a linear subspace $Z \subseteq Y$ . Then, there exists $B \in P S D (Y)$ s.t.

For any $v \in Y$ , $B v^{\otimes 2} \leq {∥ v ∥}^{2}$
For any $v \in Z$ , ${(dim Z)}^{2} B v^{\otimes 2} \geq {∥ v ∥}^{2}$

Proof of Proposition A.2

We assume $dim Z > 0$ since otherwise the claim is trivial (take $B = 0$ ).

By Theorem B.1 (see Appendix), there is $~ B \in P S D (Z)$ s.t. for any $v \in Z$

$~ B v^{\otimes 2} \leq {∥ v ∥}^{2} \leq dim Z \cdot ~ B v^{\otimes 2}$

By Corollary B.1 (see Appendix), there is a projection operator $P : Y \to Y$ s.t. $im Y = Z$ and $∥ P ∥ \leq \sqrt{dim Z}$ . We define $B \in P S D (Y)$ by

$B (v, w) := \frac{~ B (P v, P w)}{dim Z}$

For any $v \in Y$ , we have

$B v^{\otimes 2} = \frac{~ B (P v, P v)}{dim Z} \leq \frac{{∥ P v ∥}^{2}}{dim Z} \leq \frac{{∥ P ∥}^{2} {∥ v ∥}^{2}}{dim Z} \leq {∥ v ∥}^{2}$

For any $v \in Z$ , we have $P v = v$ and therefore

${∥ v ∥}^{2} \leq dim Z \cdot ~ B v^{\otimes 2} = {(dim Z)}^{2} \frac{~ B (P v, P v)}{dim Z} = {(dim Z)}^{2} B v^{\otimes 2}$

Q.E.D.

Proposition A.3

Consider a finite-dimensional real vector space $Y$ , some $B \in P D (Y)$ and a Borel probability measure $μ \in Δ Y$ s.t. $Pr y \sim μ [B y^{\otimes 2} \leq 1] = 1$ . Let $y_{0} := E y \sim μ [y]$ and $σ := 2 \sqrt{2}$ . Then, $μ$ is $σ$ -sub-Gaussian w.r.t. $B$ , i.e., for any $α \in Y^{*}$

$E y \sim μ [exp (α (y - y_{0}))] \leq exp (\frac{σ^{2} B^{- 1} α^{\otimes 2}}{2})$

Proof of Proposition A.3

By isomorphism, it is sufficient to consider the case $Y = R^{n}$ , $B (v, w) = v \cdot w$ , $α (v) = t v_{0}$ for some $n \in N$ and $t \in [0, \infty)$ . For this form of $α$ it is sufficient to consider the case $n = 1$ . It remains to show that

$E y \sim μ [e^{t (y - y_{0})}] \leq e^{\frac{1}{2} σ^{2} t^{2}} = e^{4 t^{2}}$

Here, we assume that $Pr y \sim μ [| y | \leq 1] = 1$ .

We consider separately the cases $t \geq \frac{1}{2}$ and $t < \frac{1}{2}$ . In the first case

$E y \sim μ [e^{t (y - y_{0})}] \leq max y \in [- 1, + 1] e^{t (y - y_{0})} \leq e^{2 t} \leq e^{(2 t)^{2}} = e^{4 t^{2}}$

In the second case, we use that for any $x \in (- \infty, + 1]$

$e^{x} < 1 + x + \frac{e}{2} x^{2}$

The above holds because, at $x = 0$ the left hand side and the right hand have the same value and first derivative, and for any $x \in (- \infty, + 1)$ , the second derivative of the left hand side is less than the second derivative of the right hand side. We get

$E y \sim μ [e^{t (y - y_{0})}] \leq E y \sim μ [1 + t (y - y_{0}) + \frac{e}{2} t^{2} {(y - y_{0})}_{0}^{2}] = 1 + \frac{e}{2} t^{2} Var y \sim μ [y] \leq 1 + \frac{e}{2} t^{2} \leq e^{\frac{e}{2} t^{2}} \leq e^{4 t^{2}}$

Q.E.D.

Definition A.1

Consider a set $X$ , a finite-dimensional real vector space $Y$ , some $F \subseteq {X \to Y}$ and a family ${B_{x} \in P S D (Y)}_{x \in X}$ . Assume $F$ is compact w.r.t. the product topology on $X \to Y ≅ \prod_{x \in X} Y$ . Consider also some $n \in N$ , $x \in X^{n}$ and $y \in Y^{n}$ . We then use the notation

${L S}^{F} [x y, B] := a r g m i n f \in F n - 1 \sum m = 0 B_{x_{m}} {(f (x_{m}) - y_{m})}_{m}^{\otimes 2}$

${C S}^{F} [x y, B] := {f \in F ∣ ∣ ∣ ∣ n - 1 \sum m = 0 B_{x_{m}} {(f (x_{m}) - {L S}^{F} [x y, B] (x_{m}))}^{\otimes 2} \leq 1}$

I chose the notation $L S$ as an abbreviation of “least squares” and $C S$ as an abbreviation of “confidence set”. Note that $L S$ is somewhat ambiguous (and therefore, so is $C S$ ) since there might be multiple minima, but this will not be important in the following (i.e. an arbitrary minimum can be chosen).

Proposition A.4

There is some $C_{A .4} \in R^{+}$ s.t. the following holds.

Consider finite sets $X, S$ , some $F \subseteq {X k \to S}$ and a family ${B_{x} \in P S D (R^{S})}_{x \in X}$ . Assume that for any $x \in X$ and $φ \in Δ S$ , $B_{x} φ^{\otimes 2} \leq 1$ . Let ${H_{n} \subseteq P (X^{ω} \times S^{ω})}_{n \in N}$ be the canonical filtration, i.e.

$H_{n} := {A^{'} \subseteq X^{ω} \times S^{ω} ∣ ∣ A^{'} = {x s | {x s}_{: n} \in A}, A \subseteq X^{n} \times S^{n}}$

Consider also $f^{*} \in F$ and $μ \in Δ (X^{ω} \times S^{ω})$ s.t. for any $n \in N$ , $x \in X$ , and $s \in S$

$Pr x s \sim μ [s_{n} = s | x_{n} = x, H_{n}] = f^{*} (s ∣ x)$

Fix $ϵ \in R^{+}$ , $δ \in (0, 1)$ . Denote

$β (t) := C_{A .4} (ln \frac{N (F, ϵ^{- 2} B)}{δ} + ϵ t ln \frac{e t}{δ})$

Then,

$Pr x s \sim μ [f^{*} \notin \infty ⋂ n = 0 {C S}^{F} [{x s}_{: n}, β (n + 1)^{- 1} B]] \leq δ$

Proof of Proposition A.4

For each $x \in X$ , choose a finite set $E_{x}$ and a linear mapping $P_{x} : R^{S} \to R^{E_{x}}$ s.t. $B_{x} (v, w) = P_{x} (v) \cdot P_{x} (w)$ . Let $~ Y := ⨁_{x \in X} R^{E_{x}}$ . Define $~ F \subseteq {X \to ~ Y}$ by

$~ F := {~ f : X \to ~ Y ∣ ∣ ~ f (x) = P_{x} (f (x)), f \in F}$

Define $P^{ω} : X^{ω} \times S^{ω} \to X^{ω} \times {~ Y}^{ω}$ by

$P^{ω} (x s)_{n} := x_{n} P_{x_{n}} (s_{n})$

Let ${~ f}^{*} := P^{ω} (f^{*})$ and $~ μ := P_{*}^{ω} μ$ . By Proposition A.3, $~ μ$ is $2 \sqrt{2}$ -sub-Gaussian. Applying Proposition B.1 (see Appendix) to the tilde objects gives the desired result. Here, we choose the constant $C_{A.4}$ s.t. the term $M = 1$ in $β$ as defined in Proposition B.1 is absorbed by the term $σ ln \frac{e t}{δ} \geq 1$ (note that $t \geq 1$ since we substitute $t = n + 1$ , $δ < 1$ and $σ = 2 \sqrt{2} > 1$ ). Q.E.D.

Definition A.2

Consider a set $X$ , some $x \in X$ , a real vector space $Y$ , some $F \subseteq {X \to Y}$ and a family ${B_{x} \in P S D (Y)}_{x \in X}$ . The $B$ -width of $F$ at $x$ is defined by

$W^{F} (x, B) := sup f, ~ f \in F \sqrt{B_{x} {(f (x) - ~ f (x))}^{\otimes 2}}$

Proposition A.5

There is some $C_{A .5} \in R^{+}$ s.t. the following holds.

Consider a set $X$ , a real vector space $Y$ , some $F \subseteq {X \to Y}$ and a family ${B_{x} \in P S D (Y)}_{x \in X}$ . Consider also some $x \in X^{ω}$ , $y \in Y^{ω}$ , $T \in N^{+}$ , $γ \in (0, 1)$ , $θ \in R^{+}$ , $η_{0}, η_{1} \in R^{+}$ and $δ \in (0, 1)$ . Denote

$β (t) := η_{0} + η_{1} t ln \frac{e t}{δ}$

For any $n \in N$ , define $F_{n}$ by

$F_{n} := {C S}^{F} [{x y}_{: n}, β (n + 1)^{- 1} B]$

Assume that $γ^{T} > \frac{1}{2}$ . Then,

$\infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} [[W^{F_{l T}} (x_{l T + m}, B) > θ]] \leq C_{A .5} {dim}_{R V O} F \cdot (θ^{- 2} β (\frac{1}{1 - γ}) + T)$

Proof of Proposition A.5

For each $x \in X$ , choose a finite set $E_{x}$ and a linear mapping $P_{x} : Y \to R^{E_{x}}$ s.t. $B_{x} (v, w) = P_{x} (v) \cdot P_{x} (w)$ . Let $~ Y := ⨁_{x \in X} R^{E_{x}}$ . Define $~ F \subseteq {X \to ~ Y}$ by

$~ F := {~ f : X \to ~ Y ∣ ∣ ~ f (x) = P_{x} (f (x)), f \in F}$

Define also $~ y$ by

${~ y}_{n} = P_{x_{n}} (y_{n})$

Applying Proposition B.2 (see Appendix) to the tilde objects, we conclude that for any $N \in N$

$N - 1 \sum l = 0 T - 1 \sum m = 0 [[W^{F_{l T}} (x_{l T + m}, B) > θ]] \leq {dim}_{R V O} F \cdot (4 θ^{- 2} β (N T) + T)$

Multiplying the inequality by $γ^{N T}$ and summing over $N$ , we get

$\infty \sum l = 0 (\infty \sum N = l + 1 γ^{N T}) T - 1 \sum m = 0 [[W^{F_{l T}} (x_{l T + m}, B) > θ]] \leq {dim}_{R V O} F \infty \sum N = 0 γ^{N T} (4 θ^{- 2} β (N T) + T)$

On the left hand side, we sum the geometric series. On the right hand side, we use the observation that

$γ^{N T} (4 θ^{- 2} β (N T) + T) \leq \frac{1}{T} \int_{0}^{T} γ^{N T + t - T} (4 θ^{- 2} β (N T + t) + T) d t$

$γ^{N T} (4 θ^{- 2} β (N T) + T) \leq \frac{1}{T γ^{T}} \int_{0}^{T} γ^{N T + t} (4 θ^{- 2} β (N T + t) + T) d t$

Here, we used that $β (t)$ is an increasing function for $t \geq 1$ and $β (0) = 0$ . We get

$\infty \sum l = 0 \frac{γ^{(l + 1) T}}{1 - γ^{T}} T - 1 \sum m = 0 [[W^{F_{l T}} (x_{l T + m}, B) > θ]] \leq \frac{{dim}_{R V O} F}{T γ^{T}} \int_{0}^{\infty} γ^{t} (4 θ^{- 2} β (t) + T) d t$

The integral on the right hand side is a Laplace transform (where the dual variable is $s = ln \frac{1}{γ}$ ). The functions $f (t) = 1$ , $f (t) = t$ and $f (t) = t ln t$ all have the property that, for any $s \in R^{+}$

$L [f] (s) \leq \frac{1}{s} f (\frac{1}{s})$

Indeed, we have

$L [1] (s) = \frac{1}{s} \leq \frac{1}{s}$

$L [t] (s) = \frac{1}{s^{2}} \leq \frac{1}{s} \cdot \frac{1}{s}$

$L [t ln t] (s) = - \frac{ln s + γ_{EM} - 1}{s^{2}} = \frac{1}{s} \cdot \frac{1}{s} (ln \frac{1}{s} + 1 - γ_{EM}) \leq \frac{1}{s} \cdot \frac{1}{s} ln \frac{1}{s}$

Here, $γ_{EM}$ is the Euler-Mascheroni constant.

We conclude

$\frac{γ^{T}}{1 - γ^{T}} \infty \sum l = 0 γ^{l T} T - 1 \sum m = 0 [[W^{F_{l T}} (x_{l T + m}, B) > θ]] \leq \frac{{dim}_{R V O} F}{T γ^{T}} \cdot \frac{1}{ln \frac{1}{γ}} ⎛ ⎝ 4 θ^{- 2} β ⎛ ⎝ \frac{1}{ln \frac{1}{γ}} ⎞ ⎠ + T ⎞ ⎠$

Using the condition $γ^{T} > \frac{1}{2}$ and absorbing $O (1)$ factors into the definition of $C_{A.5}$ , we get

$\infty \sum l = 0 γ^{l T} T - 1 \sum m = 0 [[W^{F_{l T}} (x_{l T + m}, B) > θ]] \leq C_{A.5} {dim}_{R V O} F \cdot (θ^{- 2} β (\frac{1}{1 - γ}) + T)$

Since $γ^{m} \leq 1$ , this implies

$\infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} [[W^{F_{l T}} (x_{l T + m}, B) > θ]] \leq C_{A .5} {dim}_{R V O} F \cdot (θ^{- 2} β (\frac{1}{1 - γ}) + T)$

Q.E.D.

Proposition A.6

There is some $C_{A .6} \in R^{+}$ s.t. the following holds.

Consider a set $X$ , a real vector space $Y$ , some $F \subseteq {X \to Y}$ and a family ${B_{x} \in P S D (Y)}_{x \in X}$ . Assume that for any $x \in X$ and $f \in F$ , $B_{x} {f (x)}^{\otimes 2} \leq 1$ . Consider also some $x \in X^{ω}$ , $y \in Y^{ω}$ , $T \in N^{+}$ , $γ \in (0, 1)$ , $η_{0}, η_{1} \in R^{+}$ and $δ \in (0, 1)$ . Define $β$ and $F_{n}$ the same way as in Proposition A.5. Denote $D := {dim}_{R V O} F$ . Assume $D > 0$ and $γ^{T} > \frac{1}{2}$ . Then,

$\infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} W^{F_{l T}} (x_{l T + m}, B) \leq C_{A .6} (D T + \sqrt{D β (\frac{1}{1 - γ}) \frac{1}{1 - γ}})$

Proof of Proposition A.6

Due to the assumption $B_{x} f (x)^{\otimes 2} \leq 1$ , we have $W^{F_{n}} (x, B) \leq 2$ . For any $t \in R^{+}$ , we have

$W^{F_{l T}} (x_{l T + m}, B) = \int_{0}^{2} [[W^{F_{l T}} (x_{l T + m}, B) > θ]] d θ \leq t + \int_{t}^{2} [[W^{F_{l T}} (x_{l T + m}, B) > θ]] d θ$

$\infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} W^{F_{l T}} (x_{l T + m}, B) \leq \frac{t}{1 - γ} + \int_{t}^{2} \infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} [[W^{F_{l T}} (x_{l T + m}, B) > θ]] d θ$

Applying Proposition A.5 to the right hand side

$\infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} W^{F_{l T}} (x_{l T + m}, B) \leq \frac{t}{1 - γ} + C_{A.5} D \int_{t}^{2} (θ^{- 2} β (\frac{1}{1 - γ}) + T) d θ$

Evaluating the integral and dropping some negative terms on the right hand side, we get

$\infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} W^{F_{l T}} (x_{l T + m}, B) \leq \frac{t}{1 - γ} + C_{A.5} D (\frac{1}{t} β (\frac{1}{1 - γ}) + 2 T)$

We now set $t$ to be

$t := \sqrt{D β (\frac{1}{1 - γ}) \cdot (1 - γ)}$

For an appropriate choice of $C_{A.6}$ , it follows that

$\infty \sum l = 0 T - 1 \sum m = 0 γ^{l T + m} W^{F_{l T}} (x_{l T + m}, B) \leq C_{A.6} (D T + \sqrt{D β (\frac{1}{1 - γ}) \frac{1}{1 - γ}})$

Q.E.D.

Proof of Theorem 1

We take $π_{γ}^{†} := π_{ζ Π T}^{PS}$ where $Π$ is as in Proposition A.1 and $T \in N^{+}$ will be specified later. Denote the Bayesian regret by

$R (γ) := E T \sim ζ [{E U}_{T R}^{*} (γ) - {E U}_{T R}^{π_{γ}^{†}} (γ)]$

By Proposition A.1

$R (γ) = \infty \sum n = 0 γ^{n + 1} E [E H_{n} - H_{*} [V_{H_{n} R} (γ) | Θ_{n}, A_{n}]] + \infty \sum l = 0 γ^{(l + 1) T} E [E {¯ E}_{l}^{T} - E_{l}^{T} [V_{H_{*} R} (γ) | Θ_{l T}]]$

$R (γ) \leq \infty \sum n = 0 γ^{n} E [∣ ∣ ∣ E H_{n} - H_{*} [V_{H_{n} R} (γ) | Θ_{n}, A_{n}] ∣ ∣ ∣] + \infty \sum l = 0 γ^{l T} E [∣ ∣ ∣ ∣ E {¯ E}_{l}^{T} - E_{l}^{T} [V_{H_{*} R} (γ) | Θ_{l T}] ∣ ∣ ∣ ∣]$

We will use the notation

$Δ V (γ) := max T \in H (max s \in S V_{T R} (s, γ) - min s \in S V_{T R} (s, γ))$

It follows that

$R (γ) \leq Δ V (γ) (\infty \sum n = 0 γ^{n} E [d_{tv} (H_{n} (Θ_{n}, A_{n}), H_{*} (Θ_{n}, A_{n}))] + \infty \sum l = 0 γ^{l T} E [d_{tv} (E_{l T}^{T} (Θ_{l T}), {¯ E}_{l T}^{T} (Θ_{l T}))])$

$R (γ) \leq Δ V (γ) (\infty \sum n = 0 γ^{n} E [d_{tv} (H_{n} (Θ_{n}, A_{n}), H_{*} (Θ_{n}, A_{n}))] + \frac{1}{1 - γ^{T}})$

Consider $Y = R^{S}$ equipped with the $L^{1}$ norm. Given $s \in S$ and $a \in A$ , consider also the subspace

$W_{s a} := span {T (s, a) | T \in H}$

By Proposition A.2, there is $B_{s a} \in P S D (Y)$ s.t.

For any $φ \in Δ S$

$B_{s a} φ^{\otimes 2} \leq 1$

For any $T, ~ T \in H$

$\frac{1}{4} D_{l o c}^{2} B_{s a} {(T (s, a) - ~ T (s, a))}^{\otimes 2} \geq d_{tv} {(T (s, a), ~ T (s, a))}^{2}$

We now apply Proposition A.4 with $δ := \frac{1}{2} (1 - γ)^{2}$ and $ϵ := (1 - γ)^{2}$ . We get

$Pr [H_{*} \in \infty ⋂ n = 0 {C S}^{H} [Θ A_{: n} Θ_{n}, β (n + 1)^{- 1} B]] \geq 1 - \frac{1}{2} (1 - γ)^{2}$

Here, $β$ was defined in Proposition A.4.

Since the hypothesis $H_{n}$ is sampled from the posterior, for any $l \in N$ we also have

$Pr [H_{l T} \in {C S}^{H} [Θ A_{: l T} Θ_{l T}, β (l T + 1)^{- 1} B]] \geq 1 - \frac{1}{2} (1 - γ)^{2}$

$Pr [H_{*}, H_{l T} \in {C S}^{H} [Θ A_{: l T} Θ_{l T}, β (l T + 1)^{- 1} B]] \geq 1 - (1 - γ)^{2}$

Denote

$G_{n} := {H_{*}, H_{l T} \in {C S}^{H} [Θ A_{: l T} Θ_{l T}, β (l T + 1)^{- 1} B]} \subseteq Ω$

We get

$R (γ) \leq Δ V (γ) \infty \sum n = 0 γ^{n} (E [d_{tv} (H_{n} (Θ_{n}, A_{n}), H_{*} (Θ_{n}, A_{n})); G_{n}] + (1 - γ)^{2})$

$R (γ) \leq Δ V (γ) (\infty \sum n = 0 γ^{n} E [d_{tv} (H_{n} (Θ_{n}, A_{n}), H_{*} (Θ_{n}, A_{n})); G_{n}] + 1 - γ + \frac{1}{1 - γ^{T}})$

Using property 2 of B

$R (γ) \leq Δ V (γ) (\frac{1}{2} D_{l o c} \infty \sum n = 0 γ^{n} E [\sqrt{B_{Θ_{n} A_{n}} {(H_{n} (Θ_{n}, A_{n}) - H_{*} (Θ_{n}, A_{n}))}_{n}^{\otimes 2}}; G_{n}] + 1 - γ + \frac{1}{1 - γ^{T}})$

Denote

$H_{l} := {C S}^{H} [Θ A_{: l T} Θ_{l T}, β (l T + 1)^{- 1} B]$

Clearly

$Pr [\sqrt{B_{Θ_{n} A_{n}} {(H_{n} (Θ_{n}, A_{n}) - H_{*} (Θ_{n}, A_{n}))}_{n}^{\otimes 2}} \leq W^{H_{⌊ n / T ⌋}} (Θ_{n} A_{n}, B) ∣ ∣ ∣ G_{n}] = 1$

Using this inequality, dropping the $; G_{n}$ (since it can only the right hand side smaller) and moving the sum inside the expected value, we get

$R (γ) \leq Δ V (γ) (\frac{1}{2} D_{l o c} E [\infty \sum n = 0 γ^{n} W^{H_{⌊ n / T ⌋}} (Θ_{n} A_{n}, B)] + 1 - γ + \frac{1}{1 - γ^{T}})$

We will now assume $D_{R V O} > 0$ (if $D_{R V O} = 0$ , then $| H | = 1$ and $R (γ) = 0$ ) and $γ^{T} > \frac{1}{2}$ . Applying Proposition A.6, we conclude

$R (γ) \leq Δ V (γ) (\frac{D_{l o c} C_{A .6}}{2} (D_{R V O} T + \sqrt{D_{R V O} β (\frac{1}{1 - γ}) \frac{1}{1 - γ}}) + 1 - γ + \frac{1}{1 - γ^{T}})$

Denote the second factor on the right hand side $F (γ)$ , so that the inequality becomes

$R (γ) \leq Δ V (γ) F (γ)$

We set

$T := ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \frac{1}{\sqrt{D_{l o c} D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} ⎥ ⎥ ⎥ ⎥ ⎥ ⎦$

Now, we analyze the $γ \to 1$ limit. In this limit, the expression for $T$ justifies our assumption that $γ^{T} > \frac{1}{2}$ . Indeed, we have

$lim γ \to 1 ln γ^{T} = lim γ \to 1 T ln γ = lim γ \to 1 \frac{ln γ}{\sqrt{D_{l o c} D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}} = 0$

Using the previous inequality for $R (γ)$ , we get

$limsup γ \to 1 \frac{R (γ)}{\sqrt{(1 - γ) ln \frac{1}{1 - γ}}} \leq limsup γ \to 1 \frac{Δ V (γ) F (γ)}{\sqrt{(1 - γ) ln \frac{1}{1 - γ}}}$

We assume $τ_{ζ R} < \infty$ , otherwise the theorem is vacuous. Multiplying the numerator and denominator on the right hand side by $\sqrt{1 - γ}$ , we get

$limsup γ \to 1 \frac{R (γ)}{\sqrt{(1 - γ) ln \frac{1}{1 - γ}}} \leq τ_{ζ R} limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot F (γ)$

We will now analyze the contribution of each term in $F (γ)$ to the limit on the right hand side (using the fact that $limsup (F_{1} + F_{2}) \leq limsup F_{1} + limsup F_{2}$ ). We ignore multiplicative constants that can be absorbed into $C$ .

The first term gives (using our choice of $T$ )

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot \sqrt{\frac{D_{l o c} D_{R V O}}{(1 - γ) ln \frac{1}{1 - γ}}} = 0$

The second term gives (using the definitions of $β$ and our choices of $δ$ and $ϵ$ )

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot D_{l o c} \sqrt{D_{R V O} (ln \frac{2 N (H, (1 - γ)^{- 4} B)}{(1 - γ)^{2}} + (1 - γ) ln \frac{2 e}{(1 - γ)^{3}}) \frac{1}{1 - γ}} \leq$

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot D_{l o c} \sqrt{\frac{D_{R V O}}{1 - γ}} ⎛ ⎝ \sqrt{ln N (H, (1 - γ)^{- 4} B)} + \sqrt{ln \frac{2}{(1 - γ)}} + \sqrt{(1 - γ) ln \frac{2 e}{(1 - γ)^{3}}} ⎞ ⎠$

We analyze each subterm separately. The first subterm gives

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot D_{l o c} \sqrt{\frac{D_{R V O}}{1 - γ} ln N (H, (1 - γ)^{- 4} B)} =$

$D_{l o c} \sqrt{D_{R V O}} limsup γ \to 1 \sqrt{\frac{ln N (H, (1 - γ)^{- 4} B)}{\frac{1}{2} ln \frac{1}{(1 - γ)^{2}}}} \leq$

$D_{l o c} \sqrt{2 D_{R V O} D_{M B}}$

The second subterm gives

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot D_{l o c} \sqrt{\frac{D_{R V O}}{1 - γ} ln \frac{2}{1 - γ}} = D_{l o c} \sqrt{D_{R V O}}$

The third subterm gives

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot D_{l o c} \sqrt{\frac{D_{R V O}}{1 - γ} (1 - γ) ln \frac{2 e}{(1 - γ)^{3}}} = 0$

The third and fourth terms give

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot (1 - γ) = 0$

To analyze the fifth term, observe that ${lim}_{γ \to 1} (1 - γ) T (γ) = 0$ . Hence

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot \frac{1}{1 - γ^{T (γ)}} =$

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot \frac{1}{(1 - γ) T (γ)} =$

$limsup γ \to 1 \sqrt{\frac{1 - γ}{ln \frac{1}{1 - γ}}} \cdot \frac{\sqrt{D_{l o c} D_{R V O} (1 - γ) ln \frac{1}{1 - γ}}}{(1 - γ)} =$

$\sqrt{D_{l o c} D_{R V O}}$

Putting everything together, we observe that the expression $D_{l o c} \sqrt{D_{R V O} (D_{M B} + 1)}$ dominates all the terms (up to a multiplicative constant). Here, we use that $D_{l o c}$ is an integer and hence $D_{l o c} \geq \sqrt{D_{l o c}}$ . We conclude

$limsup γ \to 1 \frac{R (γ)}{\sqrt{(1 - γ) ln \frac{1}{1 - γ}}} \leq C τ_{ζ R} D_{l o c} \sqrt{D_{R V O} (D_{M B} + 1)}$

Q.E.D.

Appendix

The following was proved in John (1948), available in the collection Giorgi and Kjeldesen (2014) (the theorem in question is on page 214).

Theorem B.1 (John)

Consider a real finite-dimensional normed vector space $Y$ . Then, there exists $B \in P D (Y)$ s.t. for any $v \in Y$

$B v^{\otimes 2} \leq {∥ v ∥}^{2} \leq dim Y \cdot B v^{\otimes 2}$

The following is from Kadets and Snobar (1971).

Theorem B.2 (Kadets-Snobar)

Consider a real Banach space $Y$ and a finite-dimensional linear subspace $Z \subseteq Y$ . Then, for any $ϵ \in R^{+}$ , there is a projection operator $P : Y \to Y$ s.t. $im P = Z$ and $∥ P ∥ < \sqrt{dim Z} + ϵ$ .

Corollary B.1

Consider a real finite-dimensional normed vector space $Y$ and a linear subspace $Z \subseteq Y$ . Then, there is a projection operator $P^{⋆} : Y \to Y$ s.t. $im P^{⋆} = Z$ and $∥ P^{⋆} ∥ \leq \sqrt{dim Z}$ .

Proof of Corollary B.1

Let $E n d (Y)$ be the vector space of linear operators $Y \to Y$ . Define

$P r (Y, Z) := {P \in E n d (Y) ∣ ∣ P^{2} = P, im P = Z}$

$X := {P \in P r (Y, Z) ∣ ∣ ∥ P ∥ \leq \sqrt{dim Z} + 1}$

$P r (Y, Z)$ is an affine subspace of $E n d (Y)$ and $X$ is a compact set. Define $f : X \to R$ by $f (P) := ∥ P ∥$ . $f$ is a continuous function. By Theorem B.2, $inf f \leq \sqrt{dim Z}$ . Since $X$ is compact, $f$ attains its infimum at some $P^{⋆} \in X$ , and we have $f (P^{⋆}) \leq \sqrt{dim Z}$ , Q.E.D.

The next proposition appeared (in slightly greater generality) in Osband and Van Roy as “Proposition 5”. We will use $I_{d} \in P D (R^{d})$ to denote the form corresponding to the identity matrix.

Proposition B.1 (Osband-Van Roy)

There is some $C_{B .1} \in R^{+}$ s.t. the following holds.

Consider a finite set $X$ , the vector space $Y := R^{d}$ for some $d \in N$ and some $F \subseteq {X \to Y}$ . Let ${H_{n} \subseteq P (X^{ω} \times Y^{ω})}_{n \in N}$ be the canonical filtration, i.e.

$H_{n} := {A^{'} \subseteq X^{ω} \times Y^{ω} ∣ ∣ A^{'} = {x y | {x y}_{: n} \in A}, A \subseteq X^{n} \times Y^{n} Borel}$

Consider also $f^{*} \in F$ and $μ \in Δ (X^{ω} \times Y^{ω})$ s.t. for any $n \in N$ and $x \in X$

$E x y \sim μ [y_{n} | x_{n} = x, H_{n}] = f^{*} (x)$

Assume that $M \in R^{+}$ is s.t. $y_{n} \cdot y_{n} \leq M^{2}$ with $μ$ -probability $1$ for all $n \in N$ . Assume also that $σ \in R^{+}$ is s.t. $μ$ is $σ$ -sub-Gaussian, i.e., for any $α \in Y$ , $n \in N$ and $x \in X$

$E x y \sim μ [exp (α \cdot (y_{n} - f^{*} (x))) | x_{n} = x, H_{n}] \leq exp (\frac{σ^{2} (α \cdot α)}{2})$

Fix $ϵ \in R^{+}$ , $δ \in (0, 1)$ . Define $β : R^{+} \to R$ by

$β (t) := C_{B .1} (σ^{2} ln \frac{N (F, ϵ^{- 2} I_{d})}{δ} + ϵ t (M + σ ln \frac{e t}{δ}))$

Then,

$Pr x y \sim μ [f^{*} \notin \infty ⋂ n = 0 {C S}^{F} [{x y}_{: n}, β (n + 1)^{- 1} I_{d}]] \leq δ$

Note that we removed a square root in the definition of $β$ compared to equation (7) in Osband and Van Roy. This only makes $β$ larger (up to a constant factor) and therefore, only makes the claim weaker.

The proposition below appeared in Osband and Van Roy as “Lemma 1”.

Proposition B.2 (Osband-Van Roy)

Consider a set $X$ , the vector space $Y = R^{d}$ for some $d \in N$ and some $F \subseteq {X \to Y}$ . Consider also some $x \in X^{ω}$ , $y \in Y^{ω}$ , $T \in N^{+}$ , $N \in N$ , $θ \in R^{+}$ and a nondecreasing sequence ${β_{n} \in R^{+}}_{n \in N}^{+}$ . For any $n \in N$ , define $F_{n}$ by

$F_{n} := {C S}^{F} [{x y}_{: n}, β_{n}^{- 1} I_{d}]$

Then,

$N - 1 \sum l = 0 T - 1 \sum m = 0 [[W^{F_{l T}} (x_{l T + m}, I_{d}) > θ]] \leq {dim}_{R V O} F \cdot (4 θ^{- 2} β_{N T - 1} + T)$

Dimensional regret without resets

Results

Proofs

Appendix