Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.
Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this equation holds).
For any h∈H denote by V(h) the reward promised by h:
V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)
Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N, let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.
hn:=argmaxh∈GnV(h)
Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true hypothesis.
If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore an wasn’t a mistake.
If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on. This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX times.
Model 2: Let the spaces of arms and hypotheses be
A:=H:=Sd:={x∈Rd+1∣∥x∥=1}
Let the reward r∈R be the only observable outcome, and the semantics of hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a polynomial of degree that doesn’t depend on d. This is because Murphy can choose the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that many rounds.
Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.
Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this equation holds).
For any h∈H denote by V(h) the reward promised by h:
V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)
Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N, let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.
hn:=argmaxh∈GnV(h)
Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true hypothesis.
If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore an wasn’t a mistake.
If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on. This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX times.
Model 2: Let the spaces of arms and hypotheses be
A:=H:=Sd:={x∈Rd+1∣∥x∥=1}
Let the reward r∈R be the only observable outcome, and the semantics of hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a polynomial of degree that doesn’t depend on d. This is because Murphy can choose the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that many rounds.