Actually, I do know a stronger property of LI / dominant forecasting. In the notation of my paper, we have the following:
Let {Gk} be a family of gamblers, ξ,ζ probability distributions supported on N. Then, there exists a forecaster F s.t. for any k,m,b∈N (b>0) and x∈Oω, if the first condition holds then the second condition holds:
infn≤mΣVminGkFn(x:n−1)>−b
supn≤mΣVmaxGkFn(x:n−1)≤∑j,cξ(j)ζ(c)cξ(k)∑c>bζ(c)
The above is not hard to see from the proof of Theorem 2, if you keep track of the finite terms. Now, this is not really a bound on the time of convergence, it is something like a bound on the number of mistakes made, which is completely analogous to standard guarantees in online learning.
If you’re running LIA and stop enumerating traders at N, you will get the same dominance property, only for traders up to N.
The fact that VC dimension is non-decreasing for a family of nested classes is a tautology. There is nothing special about the order of the hypotheses in SRM: the order (and weighting) of hypotheses is external information that reflects your prior about the correct hypothesis. Similarly, in LIA we can order the traders by description complexity, same as in Solomonoff induction, because we expect simpler patterns to be more important than complex patterns. This is nothing but the usual Occam’s razor. Or, we can consider more specialized dominant forecasting, with gamblers and ordering selected according to domain-specific considerations.
I don’t think that stopping at N does some kind of “disproportionate” damage to the LI. For example, Theorem 4.2.1 in the LI paper requires one trader per each ϵ>0 and sequence of theorems. If this trader is in the selection, then the probabilities of the theorems will converge to 1 within ϵ. Similarly, in my paper you need the gambler ΓSMk to ensure your forecaster converges to incomplete model M within 1/k.
You can do SRM for any sequence of nested classes of finite VC dimension. For example, if you have a countable set of hypotheses {hn}, you can take classes to be Hn:={hm}m<n. This is just as arbitrary as in LI. The thing is, the error bound that SRM satisfies depends on the actual class in which the reference hypothesis lies. So, compared to a hypothesis in a very high class, SRM can converge very slowly (require very large sample size). This is completely analogous to the inequality I gave before, where ξ(k) appears in the denominator of the bound, so gamblers that are “far away” in the “prior” can win a lot of bets before the forecaster catches up. SRM is useful iff you a priori expect “low” hypotheses to be good approximations. For example, suppose you want to fit a polynomial to some data but you don’t know what degree to use. SRM gives you a rule that determines the degree automatically, from the data itself. However, the reason this rule has good performance is because we expect most functions in the real world to be relatively “smooth” and therefore well-approximable by a low degree polynomial.
Ordering by description complexity is perfectly computable in itself, it just means that we fix a UTM, represent traders by programs (on which we impose a time bound, otherwise it really is uncomputable), and weight each trader by 2^{-program length}. It would be interesting to find some good property this thing has. If we don’t impose a time bound (so we are uncomputable), then the Bayesian analogue is Solomonoff induction, which has the nice property that it only “weakly” depends on the choice of UTM. Will different UTMs gives “similar” LIs? Off the top of my head, I have no idea! When we add the time bound it gets more messy since time is affect by translation between UTMs, so I’m not sure how to formulate an “invariance” property even in the Bayesian case. Looking through Schmidhuber’s paper on the speed prior, I see ey do have some kind of invariance (section 4) but I’m too lazy to understand the details right now.
Thanks for pointing out that bound. I will think about it. (BTW, if at any point you don’t want to continue this back-and-forth exchange, just let me know, otherwise I will probably keep responding because I always find I have things to say)
My point regarding LIA was that the theorems in the LI paper follow from dominance over all e.c. traders, and there are countably many e.c. traders. If you stop at N, all those theorems break. Of course, you will still get something out of dominating the N traders, but you’d have to go back to the blackboard to figure out what it is because you can no longer get the answer from the paper. (And the theorems give you infinitely many properties, of which you’ll retain only finitely many, so in a sense you will lose “almost all” of the desired results)
Surely there is something special about the order of hypotheses in SRM? Vapnik’s 1998 book introduces SRM (start of Ch. 6) with a decomposition of actual risk into (1) empirical risk and (2) a term depending on VC dimension, analogous to the bias-variance decomposition of generalization error. Vapnik says SRM is for the case where sample size is small enough that term (2) is non-negligible. So already, from the start, we are trying to solve a problem where “simpler = better” appears mathematically in an expression for the error we want to minimize. Then he introduces a rule n(l) for selecting a class Sn based on sample size l, and proves an asymptotic convergence rate (Thm. 6.2) which depends on n(l) having a certain property.
It is certainly true that you could order LIA’s traders by complexity (well, maybe not computably...), and I would be interested in the results. Results from some particular “good” ordering seem like the real determinants of whether LIA-like methods would be good in practice. (Since if we do not fix an ordering we can only get results that hold even for bad/”adversarial” orderings that fill early slots with nonsensical strategies)
Replying to Rob.
Actually, I do know a stronger property of LI / dominant forecasting. In the notation of my paper, we have the following:
Let {Gk} be a family of gamblers, ξ,ζ probability distributions supported on N. Then, there exists a forecaster F s.t. for any k,m,b∈N (b>0) and x∈Oω, if the first condition holds then the second condition holds:
infn≤mΣVminGkFn(x:n−1)>−b
supn≤mΣVmaxGkFn(x:n−1)≤∑j,cξ(j)ζ(c)cξ(k)∑c>bζ(c)
The above is not hard to see from the proof of Theorem 2, if you keep track of the finite terms. Now, this is not really a bound on the time of convergence, it is something like a bound on the number of mistakes made, which is completely analogous to standard guarantees in online learning.
If you’re running LIA and stop enumerating traders at N, you will get the same dominance property, only for traders up to N.
The fact that VC dimension is non-decreasing for a family of nested classes is a tautology. There is nothing special about the order of the hypotheses in SRM: the order (and weighting) of hypotheses is external information that reflects your prior about the correct hypothesis. Similarly, in LIA we can order the traders by description complexity, same as in Solomonoff induction, because we expect simpler patterns to be more important than complex patterns. This is nothing but the usual Occam’s razor. Or, we can consider more specialized dominant forecasting, with gamblers and ordering selected according to domain-specific considerations.
Replying to Rob.
I don’t think that stopping at N does some kind of “disproportionate” damage to the LI. For example, Theorem 4.2.1 in the LI paper requires one trader per each ϵ>0 and sequence of theorems. If this trader is in the selection, then the probabilities of the theorems will converge to 1 within ϵ. Similarly, in my paper you need the gambler ΓSMk to ensure your forecaster converges to incomplete model M within 1/k.
You can do SRM for any sequence of nested classes of finite VC dimension. For example, if you have a countable set of hypotheses {hn}, you can take classes to be Hn:={hm}m<n. This is just as arbitrary as in LI. The thing is, the error bound that SRM satisfies depends on the actual class in which the reference hypothesis lies. So, compared to a hypothesis in a very high class, SRM can converge very slowly (require very large sample size). This is completely analogous to the inequality I gave before, where ξ(k) appears in the denominator of the bound, so gamblers that are “far away” in the “prior” can win a lot of bets before the forecaster catches up. SRM is useful iff you a priori expect “low” hypotheses to be good approximations. For example, suppose you want to fit a polynomial to some data but you don’t know what degree to use. SRM gives you a rule that determines the degree automatically, from the data itself. However, the reason this rule has good performance is because we expect most functions in the real world to be relatively “smooth” and therefore well-approximable by a low degree polynomial.
Ordering by description complexity is perfectly computable in itself, it just means that we fix a UTM, represent traders by programs (on which we impose a time bound, otherwise it really is uncomputable), and weight each trader by 2^{-program length}. It would be interesting to find some good property this thing has. If we don’t impose a time bound (so we are uncomputable), then the Bayesian analogue is Solomonoff induction, which has the nice property that it only “weakly” depends on the choice of UTM. Will different UTMs gives “similar” LIs? Off the top of my head, I have no idea! When we add the time bound it gets more messy since time is affect by translation between UTMs, so I’m not sure how to formulate an “invariance” property even in the Bayesian case. Looking through Schmidhuber’s paper on the speed prior, I see ey do have some kind of invariance (section 4) but I’m too lazy to understand the details right now.
Thanks for pointing out that bound. I will think about it. (BTW, if at any point you don’t want to continue this back-and-forth exchange, just let me know, otherwise I will probably keep responding because I always find I have things to say)
My point regarding LIA was that the theorems in the LI paper follow from dominance over all e.c. traders, and there are countably many e.c. traders. If you stop at N, all those theorems break. Of course, you will still get something out of dominating the N traders, but you’d have to go back to the blackboard to figure out what it is because you can no longer get the answer from the paper. (And the theorems give you infinitely many properties, of which you’ll retain only finitely many, so in a sense you will lose “almost all” of the desired results)
Surely there is something special about the order of hypotheses in SRM? Vapnik’s 1998 book introduces SRM (start of Ch. 6) with a decomposition of actual risk into (1) empirical risk and (2) a term depending on VC dimension, analogous to the bias-variance decomposition of generalization error. Vapnik says SRM is for the case where sample size is small enough that term (2) is non-negligible. So already, from the start, we are trying to solve a problem where “simpler = better” appears mathematically in an expression for the error we want to minimize. Then he introduces a rule n(l) for selecting a class Sn based on sample size l, and proves an asymptotic convergence rate (Thm. 6.2) which depends on n(l) having a certain property.
It is certainly true that you could order LIA’s traders by complexity (well, maybe not computably...), and I would be interested in the results. Results from some particular “good” ordering seem like the real determinants of whether LIA-like methods would be good in practice. (Since if we do not fix an ordering we can only get results that hold even for bad/”adversarial” orderings that fill early slots with nonsensical strategies)