It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won’t go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form “program template P augmented by a hard-coded bit string b of length n”. Such a set contains 2n hypotheses, and its total probability mass is approximately 2−|P|, which is significant for short P (even when n is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both P and b. Such a contest is very hard since learning n bits can take much time for large n.
Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string x is defined as follows. First, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x. For our problem, we are interested in the minimal Q itself: I call it the “sophisticated core” of x. We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.
Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of P. Therefore, the reference agent in the definition of regret now knows P but doesn’t know b, making it feasible to compete with it.
It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won’t go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form “program template P augmented by a hard-coded bit string b of length n”. Such a set contains 2n hypotheses, and its total probability mass is approximately 2−|P|, which is significant for short P (even when n is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both P and b. Such a contest is very hard since learning n bits can take much time for large n.
Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string x is defined as follows. First, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x. For our problem, we are interested in the minimal Q itself: I call it the “sophisticated core” of x. We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.
Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of P. Therefore, the reference agent in the definition of regret now knows P but doesn’t know b, making it feasible to compete with it.