Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 10 Nov 2021 18:32 UTC
LW: 2 AF: 1
AF
Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.

Model 1: Let $A$ be the space of arms, $O$ the space of outcomes, $r : A \times O \to R$ the reward function, $X$ and $Y$ vector spaces, $H \subseteq X$ the hypothesis space and $F : A \times O \times H \to Y$ a function s.t. for any fixed $a \in A$ and $o \in O$ , $F (a, o) : H \to Y$ extends to some linear operator $T_{a, o} : X \to Y$ . The semantics of hypothesis $h \in H$ is defined by the equation $F (a, o, h) = 0$ (i.e. an outcome $o$ of action $a$ is consistent with hypothesis $h$ iff this equation holds).

For any $h \in H$ denote by $V (h)$ the reward promised by $h$ :

$V (h) := max a \in A min o \in O : F (a, o, h) = 0 r (a, o)$

Then, there is an algorithm with mistake bound $dim X$ , as follows. On round $n \in N$ , let $G_{n} \subseteq H$ be the set of unfalsified hypotheses. Choose $h_{n} \in S$ optimistically, i.e.

$h_{n} := arg max h \in G_{n} V (h)$

Choose the arm $a_{n}$ recommended by hypothesis $h_{n}$ . Let $o_{n} \in O$ be the outcome we observed, $r_{n} := r (a_{n}, o_{n})$ the reward we received and $h^{*} \in H$ the (unknown) true hypothesis.

If $r_{n} \geq V (h_{n})$ then also $r_{n} \geq V (h^{*})$ (since $h^{*} \in G_{n}$ and hence $V (h^{*}) \leq V (h_{n})$ ) and therefore $a_{n}$ wasn’t a mistake.

If $r_{n} < V (h_{n})$ then $F (a_{n}, o_{n}, h_{n}) \neq 0$ (if we had $F (a_{n}, o_{n}, h_{n}) = 0$ then the minimization in the definition of $V (h_{n})$ would include $r (a_{n}, o_{n})$ ). Hence, $h_{n} \notin G_{n + 1} = G_{n} \cap ker T_{a_{n}, o_{n}}$ . This implies $dim s p a n (G_{n + 1}) < dim s p a n (G_{n})$ . Obviously this can happen at most $dim X$ times.

Model 2: Let the spaces of arms and hypotheses be

$A := H := S^{d} := {x \in R^{d + 1} ∣ ∥ x ∥ = 1}$

Let the reward $r \in R$ be the only observable outcome, and the semantics of hypothesis $h \in S^{d}$ be $r \geq h \cdot a$ . Then, the sample complexity cannot be bound by a polynomial of degree that doesn’t depend on $d$ . This is because Murphy can choose the strategy of producing reward $1 - ϵ$ whenever $h \cdot a \leq 1 - ϵ$ . In this case, whatever arm you sample, in each round you can only exclude ball of radius $\approx \sqrt{2 ϵ}$ around the sampled arm. The number of such balls that fit into the unit sphere is $Ω (ϵ^{- \frac{1}{2} d})$ . So, normalized regret below $ϵ$ cannot be guaranteed in less than that many rounds.
What links here?
- Vanessa Kosoy's comment on [Closed] Job Offering: Help Communicate Infrabayesianism by abramdemski (26 Mar 2022 7:49 UTC; 15 points)