Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 24 Mar 2020 14:08 UTC
LW: 12 AF: 4
AF
Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a “quasi-realizable” setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. But, so far I haven’t studied quasi-Bayesian learning algorithms much, so how do we know it is actually easier than the agnostic setting? Here is a simple example to demonstrate that it is.

Consider a multi-armed bandit, where the arm space is $[0, 1]$ . First, consider the follow realizable setting: the reward is a deterministic function $r : [0, 1] \to [0, 1]$ which is known to be a polynomial of degree $d$ at most. In this setting, learning is fairly easy: it is enough to sample $d + 1$ arms in order to recover the reward function and find the optimal arm. It is a special case of the general observation that learning is tractable when the hypothesis space is low-dimensional in the appropriate sense.

Now, consider a closely related agnostic setting. We can still assume the reward function is deterministic, but nothing is known about its shape and we are still expected to find the optimal arm. The arms form a low-dimensional space (one-dimensional actually) but this helps little. It is impossible to predict anything about any arm except those we already tested, and guaranteeing convergence to the optimal arm is therefore also impossible.

Finally, consider the following quasi-realizable setting: each incomplete hypothesis in our class states that the reward function is lower-bounded by a particular polynomial $f : [0, 1] \to [0, 1]$ of degree $d$ at most. Our algorithm needs to converge to a reward which is at least the maximum of maxima of correct lower bounds. So, the desideratum is weaker than in the agnostic case, but we still impose no hard constraint on the reward function. In this setting, we can use the following algorithm. On each step, fit the most optimistic lower bound to those arms that were already sampled, find its maximum and sample this arm next. I haven’t derived the convergence rate, but it seems probable the algorithm will converge rapidly (for low $d$ ). This is likely to be a special case of some general result on quasi-Bayesian learning with low-dimensional priors.