Vanessa Kosoy answers Instrumental Occam?

Vanessa Kosoy 7 Feb 2020 18:09 UTC
LW: 4 AF: 3
AF
First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis $M$ of description complexity $K (M)$ , the corresponding optimal policy $π_{M}^{*}$ has description complexity $K (π_{M}^{*}) \leq K (M) + O (1)$ , since the operation “find the optimal policy” has $O (1)$ description complexity.

Taking computational resource bounds into account makes things more complicated. For some $M$ computing $π_{M}^{*}$ might be intractable, even though $M$ itself is “efficiently computable” in some sense. For example we can imagine an $M$ that has exponentially many states plus some succinct description of the transition kernel.

One way to deal with it is using some heuristic $h$ for optimization. But, then the description complexity is still $K (h (M)) \leq K (M) + K (h)$ .

Another way to deal with it is restricting ourselves to the kind of hypotheses for which $π_{M}^{*}$ is tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is probably still something like $K (π_{M}^{*}) \leq K (M) + O (1)$ , where this time $M$ is incomplete/fuzzy (but I don’t yet know how PSRL for incomplete/fuzzy hypothesis should work).

Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of model-based RL. In actor-critic algorithms, our network learns a pair consisting of a value function $V$ and a policy $π$ . We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel $T$ (or $T$ and the reward function $R$ ):

$V (s) \leq (1 - γ) R (s) + γ E_{s^{'} \sim T π (s)} [s^{'}]$

Assuming that our incomplete prior assigns weight $K (V, π)$ to this incomplete hypothesis, we get a sort of Occam’s razor for policies.