In The Real Rules have No Exceptions, Said says (apparently speaking of instrumental rules, not epistemic ones):
Prefer simplicity in your rules. Be vigilant that your rules do not grow too complex; make sure you are not relaxing the legitimacy criteria of your exceptions. Periodically audit your rules, inspecting them for complexity; try to formulate simpler versions of complex rules.
This is a very plausible principle. I have had similar thoughts myself. The idea seems to have merit in practice. But is there any theoretical analogue?
Minimum description length (Occam’s Razor) is an epistemic principle. In Bayesian terms, probabilities must sum to one, which translates information-theoretically (via optimal encodings) to the idea that there are only so many short codes, so we have to assign them carefully to the most probable things.
However, expected utility is not a limited resource in this way. Updating to think an option is better than previously thought doesn’t necessarily make other options worse. Selecting policies is not so different from selecting raw actions. And for Bayesian decision theory, it doesn’t really matter whether a policy is specified as a big table of observation/action pairs, vs an elegantly specified rule.
Optimality cares not for elegance!
Yet, even in the relatively formal world of machine learning, the practice seems contrary to this. When you are optimizing a neural network, you don’t actually care that much whether it’s something like a hypothesis (making predictions) or something like a policy (carrying out actions). You apply the same kind of regularization either way, as far as I understand (regularization being the machine-learner’s generalization of Occam). (Correction: this seems not to be the case.)
We might say that this is because (in some sense) the instrumental uncertainty and the epistemic uncertainty are actually being wrapped up together. But (1) this reply seems overly quick to me at present, and I’d want to understand in detail whether this can be justified; (2) I’m curious if there is a purely instrumental version of Occam to be articulated; it seems intuitively really plausible to me, though technically quite mysterious.
So: is it possible to formulate an instrumental version of Occam? Can we justify a simplicity bias in our policies?
First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis M of description complexity K(M), the corresponding optimal policy π∗M has description complexity K(π∗M)≤K(M)+O(1), since the operation “find the optimal policy” has O(1) description complexity.
Taking computational resource bounds into account makes things more complicated. For some M computing π∗M might be intractable, even though M itself is “efficiently computable” in some sense. For example we can imagine an M that has exponentially many states plus some succinct description of the transition kernel.
One way to deal with it is using some heuristic h for optimization. But, then the description complexity is still K(h(M))≤K(M)+K(h).
Another way to deal with it is restricting ourselves to the kind of hypotheses for which π∗M is tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is probably still something like K(π∗M)≤K(M)+O(1), where this time M is incomplete/fuzzy (but I don’t yet know how PSRL for incomplete/fuzzy hypothesis should work).
Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of model-based RL. In actor-critic algorithms, our network learns a pair consisting of a value function V and a policy π. We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel T (or T and the reward function R):
V(s)≤(1−γ)R(s)+γEs′∼Tπ(s)[s′]
Assuming that our incomplete prior assigns weight K(V,π) to this incomplete hypothesis, we get a sort of Occam’s razor for policies.
In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy’s value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.
Maybe problems that don’t have simple solutions (i.e. all their solutions have a large description length) are usually intractable for us. If so, given a problem that we’re trying to solve, the assumption that it has simple solutions is probably either useful (if it’s true) or costless (if it isn’t). In other words: “look for your missing key under the lamppost, not because it’s probably there, but because you’ll only ever find it if it’s there”.
One way of solving the problem of action selection is to treat our own actions as random variables, then build a probabilistic model, and condition that probability distribution on things going well for us in the future. See, for example, Planning By Probabilistic Inference, human brains (I would argue), upside-down RL (kinda, I think?), etc.
In this formulation, instrumental Occam’s razor is just a special case of epistemic Occam’s razor, it seems to me.
Yes, I agree with that. But (as I’ve said in the past) this formalism doesn’t do it for me. I have yet to see something which strikes me as a compelling argument in its favor.
So in the context of planning by probabilistic inference, instrumental occam seems almost like a bug rather than a feature—the unjustified bias toward simpler policies doesn’t seem to serve a clear purpose. It’s just an assumption.
Granted, the fact that I intuitively feel there should be some kind of instrumental occam is a point in favor of such methods in some sense.
At some level, any system that does foresight-based planning has to treat its own actions as part of its world-model. In that context, the idea of “Treat my own actions as random variables—just like everything else in the world—and condition on good things happening in the future” is either exactly equivalent to Monte Carlo Tree Search, or awfully close, I think.
But whereas AlphaGo does MCTS in a simplistic, one-timestep-at-a-time manner, the vision here is to bring to bear the full power of the world-modeling framework—hierarchy, analogizing, compositionality, abstraction, etc. etc.—and apply all those tools to the task of MCTS action planning. Thus things like imitating conspecifics, and taking ideas from the world (“I see that wall is not fixed in place; so maybe I can move it!”), and building hierarchical probabilistic plans over arbitrary time-scales (“Maybe I can squeeze through that hole and see what’s on the other side”), and interleaving multiple plans, and on and on—all these things and more arise organically in this kind of framework. (...if the predictive-world-modeling tools are good enough, of course.)
Maybe there are other ways to formulate action selection that has all those features, but I don’t know them.
(Or I guess the simpler “compelling argument in its favor” is “It appears to be very effective in practice”.)
Hmm. On further reflection, I don’t think the brain really uses Occam’s razor at all, either for action-selection or world-modeling, unless you define “complexity” in some way that makes it tautological. (As Charlie Steiner writes, you have to search through the infinite space of possibilities in some order, with some prior, and it’s both tempting and vacuous to say that the earlier choices are “simpler in the context of this algorithm” or whatever).
I think a more typical dynamic of brain algorithms is what Dileep George calls “memorize—generalize”, i.e. memorize a little snippet of pattern / idea (at some level of abstraction), and then see if the same pattern works on other contexts, or when glued together with other patterns, or when analogized to a different area.
I don’t currently think brain world-modeling algorithms have anything analogous to regularization.