I think the asymptotic nature of LI is much more worrying than the things you compare it to. In order for an asymptotic result to be encouraging for practical applications, we need it to be an asymptotic result such that in most cases of interest when it holds, you tend to also have good finite results. If you give a O(n12) algorithm for some problem, this is an encouraging sign that maybe you can develop a practical algorithm. But if a problem is known not to be in P, and then you give an exponential-time algorithm for it, and argue that it can probably be solved efficiently in practice because you’ve established an asymptotic bound, then no one will take this claim very seriously. It seems to me that this is roughly the position logical induction is in, but lifted from efficiency all the way up to computability.
The probabilities given by logical induction converge to coherent probabilities, but there is no computable function that will tell you how long you have to run logical induction to know the eventual probability of a certain sentence within a certain precision. We know logical induction doesn’t do this because it cannot be done; if it could be, then there would be a computable set containing all provable sentences and no disprovable sentences (since every sentence would eventually be shown either not to have probability 0 or not to have probability 1), but there is no such computable set. So we know for sure that in the naive sense, there is no way, even in principle, to run logical induction long enough that you have probabilities you can trust to be reasonably accurate. Maybe a better complexity theory analogy would be if you give a PP algorithm for a problem known not to be in BPP (although there would need to be exciting advancements in complexity theory before that could happen); this gives good reasons to believe that improvements to the algorithm would make it practical to run, and if you run it enough times the majority vote will probably be correct, but it will never be practical to run it enough times for that to happen. Likewise, efficient approximations to logical induction should be able to complete each step in a reasonable amount of time, but will not be able to complete enough steps to give you accurate probabilities.
In order to take the probabilities given by logical induction at very large finite steps seriously, you would need effective asymptotic results, and since this cannot be done for the probabilities converging to coherent probabilities, there would need to be subtler senses in which the probabilities given at finite times can be taken seriously even if they are not close to the limiting probabilities. Now, the logical induction paper does give interesting subtler senses in which the probabilities given by logical induction can be taken seriously before they approach their limiting values, but the fundamental problem is that all of those results say that some pattern will eventually hold, for the same sense of “eventually” in which the probabilities eventually converge to coherent probabilities, so this does not give strong evidence that those desirable patterns can be made to emerge in any reasonable amount of time.
Alex, the difference between your point of view and my own, is that I don’t care that much about the applications of LI to formal logic. I’m much more interested in its applications to forecasting. If Fletcher is worried about the same issues as you, IMO it is not very clear from ’eir essay. On the other hand, the original paper doesn’t talk much about applications outside of formal logic, and my countercriticism might be slightly unfair since it’s grounded in a perspective that Fletcher doesn’t know about.
First, yes, the convergence property is not uniform, but it neither should nor can be uniform. This is completely analogous to usual Bayesian inference where the speed of convergence depends on the probability of the correct hypothesis within the prior, or to the Structural Risk Minimization principle in PAC learning. This is unavoidable: if two models only start to differ at a very late point in time, there no way to distinguish between them before this point. In particular, human brains have the same limitation: it would take you more time to notice a complex pattern than a simple pattern.
Second, of course we can and should work on results about convergence rates, it is just that there is only that much ground you can reasonably cover in one paper. For example, the classical paper on merging of opinions by Blackwell and Dubins also doesn’t analyze convergence rates, and nevertheless it has 500+ citations. To make one concrete guess, if we consider a sequence of observations in {0,1} and look at the deviation between the probability the incomplete model forecaster assigns to the next observation and the probability interval that follows from a given correct incomplete model, then I expect to have some sort of cumulative bound on these deviations analogous to standard regret bounds in online learning. Of course the bound will include some parameter related to the “prior probability” ξ(k).
Third, the only reason the sum in equation (31) is finite is because I wanted to state Theorem 2 in greater generality, without requiring the gamblers to be bounded. However, the gamblers that I actually use (see section 5) are bounded, so for this purpose I might as well have taken an infinite sum over all gamblers.
Finally, there are two important differences between dominant forecasting and your root finding example: one is that, in the root finding example, proving the possibility of convergence is trivial whereas here it is non-trivial so the result is valuable even if the desideratum is not strong enough by itself; another is that enumerating all rational numbers is contrived in a way that the dominant forecaster construction is not, although obviously in itself this is a subjective claim and it’s probably not productive to argue about it if we disagree.
Replying to 240 (I can’t reply directly because of some quirk of the forum).
(I’m not sure that I know your name, is it Robert?)
I’m actually not interested in discussing the verbal arguments in the paper. Who reads the verbal arguments anyway? I go straight for the equations ;) The verbal arguments might or might not misrepresent the results, I don’t care either way.
I am interested in discussing the mathematical content of the LI paper and whether LI is a valuable mathematical discovery.
I agree that there is a relevant sense in which the dominance property in itself is too weak. The way I see it, LI (more generally, dominant forecasting, since I don’t care much about the application to formal logic in particular) is a generalization of Bayesian inference, and the dominance property is roughly analogous to merging of opinions (in my paper there is a more precise analogy, but it doesn’t matter). In the Bayesian case, we can also say that merging of opinions is a too weak property in itself. However:
Proving this weak property is already non-trivial, so it is a valuable and necessary step towards proving stronger properties.
Besides the property, we have an actual construction and I expect this construction to have stronger properties (even though I don’t expect it to be a practical algorithm), like Bayesian inference has stronger properties than just merging of opinions.
Now, nobody currently has a proof of stronger properties or even a formulation (I think), but IMO this is not a reason to disregard all work done so far. Rome wasn’t built in a day :)
Actually, I do know a stronger property of LI / dominant forecasting. In the notation of my paper, we have the following:
Let {Gk} be a family of gamblers, ξ,ζ probability distributions supported on N. Then, there exists a forecaster F s.t. for any k,m,b∈N (b>0) and x∈Oω, if the first condition holds then the second condition holds:
infn≤mΣVminGkFn(x:n−1)>−b
supn≤mΣVmaxGkFn(x:n−1)≤∑j,cξ(j)ζ(c)cξ(k)∑c>bζ(c)
The above is not hard to see from the proof of Theorem 2, if you keep track of the finite terms. Now, this is not really a bound on the time of convergence, it is something like a bound on the number of mistakes made, which is completely analogous to standard guarantees in online learning.
If you’re running LIA and stop enumerating traders at N, you will get the same dominance property, only for traders up to N.
The fact that VC dimension is non-decreasing for a family of nested classes is a tautology. There is nothing special about the order of the hypotheses in SRM: the order (and weighting) of hypotheses is external information that reflects your prior about the correct hypothesis. Similarly, in LIA we can order the traders by description complexity, same as in Solomonoff induction, because we expect simpler patterns to be more important than complex patterns. This is nothing but the usual Occam’s razor. Or, we can consider more specialized dominant forecasting, with gamblers and ordering selected according to domain-specific considerations.
I don’t think that stopping at N does some kind of “disproportionate” damage to the LI. For example, Theorem 4.2.1 in the LI paper requires one trader per each ϵ>0 and sequence of theorems. If this trader is in the selection, then the probabilities of the theorems will converge to 1 within ϵ. Similarly, in my paper you need the gambler ΓSMk to ensure your forecaster converges to incomplete model M within 1/k.
You can do SRM for any sequence of nested classes of finite VC dimension. For example, if you have a countable set of hypotheses {hn}, you can take classes to be Hn:={hm}m<n. This is just as arbitrary as in LI. The thing is, the error bound that SRM satisfies depends on the actual class in which the reference hypothesis lies. So, compared to a hypothesis in a very high class, SRM can converge very slowly (require very large sample size). This is completely analogous to the inequality I gave before, where ξ(k) appears in the denominator of the bound, so gamblers that are “far away” in the “prior” can win a lot of bets before the forecaster catches up. SRM is useful iff you a priori expect “low” hypotheses to be good approximations. For example, suppose you want to fit a polynomial to some data but you don’t know what degree to use. SRM gives you a rule that determines the degree automatically, from the data itself. However, the reason this rule has good performance is because we expect most functions in the real world to be relatively “smooth” and therefore well-approximable by a low degree polynomial.
Ordering by description complexity is perfectly computable in itself, it just means that we fix a UTM, represent traders by programs (on which we impose a time bound, otherwise it really is uncomputable), and weight each trader by 2^{-program length}. It would be interesting to find some good property this thing has. If we don’t impose a time bound (so we are uncomputable), then the Bayesian analogue is Solomonoff induction, which has the nice property that it only “weakly” depends on the choice of UTM. Will different UTMs gives “similar” LIs? Off the top of my head, I have no idea! When we add the time bound it gets more messy since time is affect by translation between UTMs, so I’m not sure how to formulate an “invariance” property even in the Bayesian case. Looking through Schmidhuber’s paper on the speed prior, I see ey do have some kind of invariance (section 4) but I’m too lazy to understand the details right now.
Thanks for pointing out that bound. I will think about it. (BTW, if at any point you don’t want to continue this back-and-forth exchange, just let me know, otherwise I will probably keep responding because I always find I have things to say)
My point regarding LIA was that the theorems in the LI paper follow from dominance over all e.c. traders, and there are countably many e.c. traders. If you stop at N, all those theorems break. Of course, you will still get something out of dominating the N traders, but you’d have to go back to the blackboard to figure out what it is because you can no longer get the answer from the paper. (And the theorems give you infinitely many properties, of which you’ll retain only finitely many, so in a sense you will lose “almost all” of the desired results)
Surely there is something special about the order of hypotheses in SRM? Vapnik’s 1998 book introduces SRM (start of Ch. 6) with a decomposition of actual risk into (1) empirical risk and (2) a term depending on VC dimension, analogous to the bias-variance decomposition of generalization error. Vapnik says SRM is for the case where sample size is small enough that term (2) is non-negligible. So already, from the start, we are trying to solve a problem where “simpler = better” appears mathematically in an expression for the error we want to minimize. Then he introduces a rule n(l) for selecting a class Sn based on sample size l, and proves an asymptotic convergence rate (Thm. 6.2) which depends on n(l) having a certain property.
It is certainly true that you could order LIA’s traders by complexity (well, maybe not computably...), and I would be interested in the results. Results from some particular “good” ordering seem like the real determinants of whether LIA-like methods would be good in practice. (Since if we do not fix an ordering we can only get results that hold even for bad/”adversarial” orderings that fill early slots with nonsensical strategies)
Okay, it looks like we are roughly on the same page :)
(Yes, my name is Robert, I go by Rob)
I think it is definitely possible that the ideas involved in the construction (e.g. making a “supertrader”) may lead to good practical algorithms. But it seems like this issue is orthogonal to the dominance issue. In other words, if you had shown me the constructions first and then the dominance results, I would not have been any more (or less) optimistic about the constructions after seeing those results than before.
It seems to me like there are two totally distinct ingredients here. First, we have a framework for specifying models and choosing which to use (like VC theory), including an idea about model averaging/fusion. Second, we have a trick involving enumerating a countably infinite set. The second part doesn’t seem relevant to the finite time case: if I’m running LIA and you tell me that you’ve changed it so it only enumerates traders up to T3↑↑↑3 and then stops adding more, this will ruin all of the LI criterion guarantees, but it will not change any of the output I’ll see in my lifetime, and the method for combining traders will remain as useful (or not) as it was before.
It’s interesting to compare this to structural risk minimization, where we also have a potential infinity (nested classes of increasing VC dimension), but we are given a good ordering for them (increasing VC dimension). One could do just as well asymptotically by choosing any other order for the classes: they are countable, so you will eventually hit the best one. But in practice the VC dimension order is crucial. The arbitrary enumeration in LIA is unsatisfying to me in the same way an arbitrary ordering of the VC classes would be. (Edit: you could even say SRM succeeds because it lets you avoid counting up to infinity, by giving you an ordering under which the minimum guaranteed risk class for your sample occurs early enough that you can actually find it.)
Vadim, we seem to be talking past each other a bit.
In the linked post, I made some criticisms of some of the (informal/intuitive/motivational) reasoning that was done in the original LI paper. That paper (laudably) makes a number of explicit verbal arguments about why its definitions and results ought to be interesting and promising, and I am disagreeing with those.
In your paper, you are doing something related, but not exactly the same. It’s possible that your paper is different from the original LI paper in a way that answers my criticisms (which would be very interesting to me!). Alternately, it’s possible that my criticisms of the LI paper are not correct to begin with. But these are two different possibilities, and I can’t tell which one you’re addressing here.
The definition of an LI (=something satisfying the LI criterion) in the original paper is based on dominance, and I’ve argued that this is too weak to pick out a category of algorithms with good finite-time properties. Telling me “this thing is an LI,” in the sense used in that paper, tells me nothing about its finite-time properties, for the reasons I explained earlier.
Even granting that, it is of course still possible that some more restrictive criterion, of the form “LI criterion + (something else)”, would pick out a very promising class of algorithms.
So, are you saying
(1) I am wrong and the LI criterion is satisfying (in some way I’m not recognizing)?
(2) I’m right that the LI criterion is unsatisfying, but if we add more specifics, we get something promising? (If so, the “something else” would be very important and I would like to learn more about precisely what it is)
This seems like a total non sequitur, since this problem is if anything much more significant for forecasting than for formal logic.
Also, did you skip over the section about pointwise vs. uniform convergence? That seems to be the most important criticism; this is not establishing the kind of property that is usable to generate useful models, it’s establishing only a much less reliable one.
(N.B. I am the author of the linked post, not Tarn Fletcher.
He offered to link it because I was wary of logging in via Facebook. I’ve done so now, though.)
My central claim is that the desiderata are too weak, not that they are not pragmatically achievable. Specifically, I think that there is something fundamentally unsatisfactory about the definition of “inexploitability” (AKA “dominance” in your paper). I hadn’t seen your paper before and I haven’t looked at it in very much detail, but as far as I can tell, my central claim applies to it as well. (I.e. it isn’t specific to formal logic as opposed to prediction)
In the tumblr post, I complained that this definition was “asymptotic,” which I realize was not very clear. What I meant was that the definition captures only qualitative convergence or lack thereof, throwing away any rate information. Dominance just tells us whether the trader can or can’t make an infinite profit, not how quickly the inductor “plugs the hole” which would allow the trader to make an infinite profit if left unplugged forever.
This means that any further convergence results we get from dominance considerations across multiple traders must be pointwise with respect to the individual traders. For a uniform convergence result, we would need access to convergence rate information about each of the individual traders, so we could ensure that they have all gotten within ϵ at the same time. But the dominance statements for the traders lack this information.
Consider, for instance, procedures like TradingFirm in the LI paper or eq. 31 in your paper, where a set of traders is used to construct a single trader that is “strictly better” than every trader in the set with respect to dominance. For instance, your eq. 31 reads
Gξn:=n∑k=0ξ(k)Gkn
Although dominance carries over from each trader in the set, convergence rates and other finite-time properties do not. Concretely: at any time n, all but finitely many of the Gk from which Gξ is inheriting dominance have not yet appeared in the sum, so they are not yet affecting the behavior of Gξ.
For this reason, telling me only that an inductor dominates a set of traders does not tell me anything about the relation of that inductor to any of those traders at any given time. (After all, for any given trader and time n, it might be using a procedure like the above in which the trader is not considered until some time N>n.)
Say I’m interested in this particular set of traders because, for each trader, there is some convergence property related to it which I would like the inductor to have. (This seems quite generally true.) So for ϵ>0, I’m associating with the trader Gk some property pkn(ϵ), something like “at time n, the hole exploited by Gk has been plugged to within tolerance ϵ.” What I want to conclude here is that, at time n, some of these properties hold. But suppose the inductor dominates the set via a procedure like the one above. Then whenever I ask about some pkn(ϵ), I have no way of knowing whether n is high enough for Gk to have been included in the sum yet, according to the enumeration used to by the procedure. Thus I can’t conclude any of the facts pkn(ϵ).
In short, this framework just isn’t built for uniform convergence. If we want to get any results about how the inductor is doing “in general” at any given time, we need information that the dominance concept throws away.
Finally, about abstract asymptotic results leading to efficient practical algorithms—yes, this happens, but it’s important to think about what information beyond mere convergence is necessary for it to happen. Consider root-finding for a differentiable function f from R→R. Here’s one method that converges (given some conditions): Newton’s method. Here’s another: enumerate the rational numbers in an arbitrary order and evaluate f at one rational number per timestep. (You can approximate the root arbitrarily well with rationals, the function is continuous, blah blah.) Even though these are both convergent, there’s obviously a big difference; the former is actually converging to the result in the intuitive sense of that phrase, while the latter is just trolling you by satisfying your technical criteria but not the intuitions behind them. (Cf. the enumeration-based trader constructions.)
Clarification: I’m not the author of the linked post, I’m just linking it here on behalf of the author (who objects to signing up for things with facebook). I found the most interesting part to be the observation about pointwise convergence—because the LI framework proves its results pointwise, it cannot guarantee that each result which is proved about P_infty will obtain for any finite time n; see also his example about preemptive learning, where for an infinite set of e.c sequences, there may be no finite n that works for all of the set.
I think the asymptotic nature of LI is much more worrying than the things you compare it to. In order for an asymptotic result to be encouraging for practical applications, we need it to be an asymptotic result such that in most cases of interest when it holds, you tend to also have good finite results. If you give a O(n12) algorithm for some problem, this is an encouraging sign that maybe you can develop a practical algorithm. But if a problem is known not to be in P, and then you give an exponential-time algorithm for it, and argue that it can probably be solved efficiently in practice because you’ve established an asymptotic bound, then no one will take this claim very seriously. It seems to me that this is roughly the position logical induction is in, but lifted from efficiency all the way up to computability.
The probabilities given by logical induction converge to coherent probabilities, but there is no computable function that will tell you how long you have to run logical induction to know the eventual probability of a certain sentence within a certain precision. We know logical induction doesn’t do this because it cannot be done; if it could be, then there would be a computable set containing all provable sentences and no disprovable sentences (since every sentence would eventually be shown either not to have probability 0 or not to have probability 1), but there is no such computable set. So we know for sure that in the naive sense, there is no way, even in principle, to run logical induction long enough that you have probabilities you can trust to be reasonably accurate. Maybe a better complexity theory analogy would be if you give a PP algorithm for a problem known not to be in BPP (although there would need to be exciting advancements in complexity theory before that could happen); this gives good reasons to believe that improvements to the algorithm would make it practical to run, and if you run it enough times the majority vote will probably be correct, but it will never be practical to run it enough times for that to happen. Likewise, efficient approximations to logical induction should be able to complete each step in a reasonable amount of time, but will not be able to complete enough steps to give you accurate probabilities.
In order to take the probabilities given by logical induction at very large finite steps seriously, you would need effective asymptotic results, and since this cannot be done for the probabilities converging to coherent probabilities, there would need to be subtler senses in which the probabilities given at finite times can be taken seriously even if they are not close to the limiting probabilities. Now, the logical induction paper does give interesting subtler senses in which the probabilities given by logical induction can be taken seriously before they approach their limiting values, but the fundamental problem is that all of those results say that some pattern will eventually hold, for the same sense of “eventually” in which the probabilities eventually converge to coherent probabilities, so this does not give strong evidence that those desirable patterns can be made to emerge in any reasonable amount of time.
Alex, the difference between your point of view and my own, is that I don’t care that much about the applications of LI to formal logic. I’m much more interested in its applications to forecasting. If Fletcher is worried about the same issues as you, IMO it is not very clear from ’eir essay. On the other hand, the original paper doesn’t talk much about applications outside of formal logic, and my countercriticism might be slightly unfair since it’s grounded in a perspective that Fletcher doesn’t know about.
Replying to “240”
First, yes, the convergence property is not uniform, but it neither should nor can be uniform. This is completely analogous to usual Bayesian inference where the speed of convergence depends on the probability of the correct hypothesis within the prior, or to the Structural Risk Minimization principle in PAC learning. This is unavoidable: if two models only start to differ at a very late point in time, there no way to distinguish between them before this point. In particular, human brains have the same limitation: it would take you more time to notice a complex pattern than a simple pattern.
Second, of course we can and should work on results about convergence rates, it is just that there is only that much ground you can reasonably cover in one paper. For example, the classical paper on merging of opinions by Blackwell and Dubins also doesn’t analyze convergence rates, and nevertheless it has 500+ citations. To make one concrete guess, if we consider a sequence of observations in {0,1} and look at the deviation between the probability the incomplete model forecaster assigns to the next observation and the probability interval that follows from a given correct incomplete model, then I expect to have some sort of cumulative bound on these deviations analogous to standard regret bounds in online learning. Of course the bound will include some parameter related to the “prior probability” ξ(k).
Third, the only reason the sum in equation (31) is finite is because I wanted to state Theorem 2 in greater generality, without requiring the gamblers to be bounded. However, the gamblers that I actually use (see section 5) are bounded, so for this purpose I might as well have taken an infinite sum over all gamblers.
Finally, there are two important differences between dominant forecasting and your root finding example: one is that, in the root finding example, proving the possibility of convergence is trivial whereas here it is non-trivial so the result is valuable even if the desideratum is not strong enough by itself; another is that enumerating all rational numbers is contrived in a way that the dominant forecaster construction is not, although obviously in itself this is a subjective claim and it’s probably not productive to argue about it if we disagree.
Replying to 240 (I can’t reply directly because of some quirk of the forum).
(I’m not sure that I know your name, is it Robert?)
I’m actually not interested in discussing the verbal arguments in the paper. Who reads the verbal arguments anyway? I go straight for the equations ;) The verbal arguments might or might not misrepresent the results, I don’t care either way.
I am interested in discussing the mathematical content of the LI paper and whether LI is a valuable mathematical discovery.
I agree that there is a relevant sense in which the dominance property in itself is too weak. The way I see it, LI (more generally, dominant forecasting, since I don’t care much about the application to formal logic in particular) is a generalization of Bayesian inference, and the dominance property is roughly analogous to merging of opinions (in my paper there is a more precise analogy, but it doesn’t matter). In the Bayesian case, we can also say that merging of opinions is a too weak property in itself. However:
Proving this weak property is already non-trivial, so it is a valuable and necessary step towards proving stronger properties.
Besides the property, we have an actual construction and I expect this construction to have stronger properties (even though I don’t expect it to be a practical algorithm), like Bayesian inference has stronger properties than just merging of opinions.
Now, nobody currently has a proof of stronger properties or even a formulation (I think), but IMO this is not a reason to disregard all work done so far. Rome wasn’t built in a day :)
Replying to Rob.
Actually, I do know a stronger property of LI / dominant forecasting. In the notation of my paper, we have the following:
Let {Gk} be a family of gamblers, ξ,ζ probability distributions supported on N. Then, there exists a forecaster F s.t. for any k,m,b∈N (b>0) and x∈Oω, if the first condition holds then the second condition holds:
infn≤mΣVminGkFn(x:n−1)>−b
supn≤mΣVmaxGkFn(x:n−1)≤∑j,cξ(j)ζ(c)cξ(k)∑c>bζ(c)
The above is not hard to see from the proof of Theorem 2, if you keep track of the finite terms. Now, this is not really a bound on the time of convergence, it is something like a bound on the number of mistakes made, which is completely analogous to standard guarantees in online learning.
If you’re running LIA and stop enumerating traders at N, you will get the same dominance property, only for traders up to N.
The fact that VC dimension is non-decreasing for a family of nested classes is a tautology. There is nothing special about the order of the hypotheses in SRM: the order (and weighting) of hypotheses is external information that reflects your prior about the correct hypothesis. Similarly, in LIA we can order the traders by description complexity, same as in Solomonoff induction, because we expect simpler patterns to be more important than complex patterns. This is nothing but the usual Occam’s razor. Or, we can consider more specialized dominant forecasting, with gamblers and ordering selected according to domain-specific considerations.
Replying to Rob.
I don’t think that stopping at N does some kind of “disproportionate” damage to the LI. For example, Theorem 4.2.1 in the LI paper requires one trader per each ϵ>0 and sequence of theorems. If this trader is in the selection, then the probabilities of the theorems will converge to 1 within ϵ. Similarly, in my paper you need the gambler ΓSMk to ensure your forecaster converges to incomplete model M within 1/k.
You can do SRM for any sequence of nested classes of finite VC dimension. For example, if you have a countable set of hypotheses {hn}, you can take classes to be Hn:={hm}m<n. This is just as arbitrary as in LI. The thing is, the error bound that SRM satisfies depends on the actual class in which the reference hypothesis lies. So, compared to a hypothesis in a very high class, SRM can converge very slowly (require very large sample size). This is completely analogous to the inequality I gave before, where ξ(k) appears in the denominator of the bound, so gamblers that are “far away” in the “prior” can win a lot of bets before the forecaster catches up. SRM is useful iff you a priori expect “low” hypotheses to be good approximations. For example, suppose you want to fit a polynomial to some data but you don’t know what degree to use. SRM gives you a rule that determines the degree automatically, from the data itself. However, the reason this rule has good performance is because we expect most functions in the real world to be relatively “smooth” and therefore well-approximable by a low degree polynomial.
Ordering by description complexity is perfectly computable in itself, it just means that we fix a UTM, represent traders by programs (on which we impose a time bound, otherwise it really is uncomputable), and weight each trader by 2^{-program length}. It would be interesting to find some good property this thing has. If we don’t impose a time bound (so we are uncomputable), then the Bayesian analogue is Solomonoff induction, which has the nice property that it only “weakly” depends on the choice of UTM. Will different UTMs gives “similar” LIs? Off the top of my head, I have no idea! When we add the time bound it gets more messy since time is affect by translation between UTMs, so I’m not sure how to formulate an “invariance” property even in the Bayesian case. Looking through Schmidhuber’s paper on the speed prior, I see ey do have some kind of invariance (section 4) but I’m too lazy to understand the details right now.
Thanks for pointing out that bound. I will think about it. (BTW, if at any point you don’t want to continue this back-and-forth exchange, just let me know, otherwise I will probably keep responding because I always find I have things to say)
My point regarding LIA was that the theorems in the LI paper follow from dominance over all e.c. traders, and there are countably many e.c. traders. If you stop at N, all those theorems break. Of course, you will still get something out of dominating the N traders, but you’d have to go back to the blackboard to figure out what it is because you can no longer get the answer from the paper. (And the theorems give you infinitely many properties, of which you’ll retain only finitely many, so in a sense you will lose “almost all” of the desired results)
Surely there is something special about the order of hypotheses in SRM? Vapnik’s 1998 book introduces SRM (start of Ch. 6) with a decomposition of actual risk into (1) empirical risk and (2) a term depending on VC dimension, analogous to the bias-variance decomposition of generalization error. Vapnik says SRM is for the case where sample size is small enough that term (2) is non-negligible. So already, from the start, we are trying to solve a problem where “simpler = better” appears mathematically in an expression for the error we want to minimize. Then he introduces a rule n(l) for selecting a class Sn based on sample size l, and proves an asymptotic convergence rate (Thm. 6.2) which depends on n(l) having a certain property.
It is certainly true that you could order LIA’s traders by complexity (well, maybe not computably...), and I would be interested in the results. Results from some particular “good” ordering seem like the real determinants of whether LIA-like methods would be good in practice. (Since if we do not fix an ordering we can only get results that hold even for bad/”adversarial” orderings that fill early slots with nonsensical strategies)
Okay, it looks like we are roughly on the same page :)
(Yes, my name is Robert, I go by Rob)
I think it is definitely possible that the ideas involved in the construction (e.g. making a “supertrader”) may lead to good practical algorithms. But it seems like this issue is orthogonal to the dominance issue. In other words, if you had shown me the constructions first and then the dominance results, I would not have been any more (or less) optimistic about the constructions after seeing those results than before.
It seems to me like there are two totally distinct ingredients here. First, we have a framework for specifying models and choosing which to use (like VC theory), including an idea about model averaging/fusion. Second, we have a trick involving enumerating a countably infinite set. The second part doesn’t seem relevant to the finite time case: if I’m running LIA and you tell me that you’ve changed it so it only enumerates traders up to T3↑↑↑3 and then stops adding more, this will ruin all of the LI criterion guarantees, but it will not change any of the output I’ll see in my lifetime, and the method for combining traders will remain as useful (or not) as it was before.
It’s interesting to compare this to structural risk minimization, where we also have a potential infinity (nested classes of increasing VC dimension), but we are given a good ordering for them (increasing VC dimension). One could do just as well asymptotically by choosing any other order for the classes: they are countable, so you will eventually hit the best one. But in practice the VC dimension order is crucial. The arbitrary enumeration in LIA is unsatisfying to me in the same way an arbitrary ordering of the VC classes would be. (Edit: you could even say SRM succeeds because it lets you avoid counting up to infinity, by giving you an ordering under which the minimum guaranteed risk class for your sample occurs early enough that you can actually find it.)
Vadim, we seem to be talking past each other a bit.
In the linked post, I made some criticisms of some of the (informal/intuitive/motivational) reasoning that was done in the original LI paper. That paper (laudably) makes a number of explicit verbal arguments about why its definitions and results ought to be interesting and promising, and I am disagreeing with those.
In your paper, you are doing something related, but not exactly the same. It’s possible that your paper is different from the original LI paper in a way that answers my criticisms (which would be very interesting to me!). Alternately, it’s possible that my criticisms of the LI paper are not correct to begin with. But these are two different possibilities, and I can’t tell which one you’re addressing here.
The definition of an LI (=something satisfying the LI criterion) in the original paper is based on dominance, and I’ve argued that this is too weak to pick out a category of algorithms with good finite-time properties. Telling me “this thing is an LI,” in the sense used in that paper, tells me nothing about its finite-time properties, for the reasons I explained earlier.
Even granting that, it is of course still possible that some more restrictive criterion, of the form “LI criterion + (something else)”, would pick out a very promising class of algorithms.
So, are you saying
(1) I am wrong and the LI criterion is satisfying (in some way I’m not recognizing)?
(2) I’m right that the LI criterion is unsatisfying, but if we add more specifics, we get something promising? (If so, the “something else” would be very important and I would like to learn more about precisely what it is)
(3) Neither of the above?
This seems like a total non sequitur, since this problem is if anything much more significant for forecasting than for formal logic.
Also, did you skip over the section about pointwise vs. uniform convergence? That seems to be the most important criticism; this is not establishing the kind of property that is usable to generate useful models, it’s establishing only a much less reliable one.
(N.B. I am the author of the linked post, not Tarn Fletcher. He offered to link it because I was wary of logging in via Facebook. I’ve done so now, though.)
My central claim is that the desiderata are too weak, not that they are not pragmatically achievable. Specifically, I think that there is something fundamentally unsatisfactory about the definition of “inexploitability” (AKA “dominance” in your paper). I hadn’t seen your paper before and I haven’t looked at it in very much detail, but as far as I can tell, my central claim applies to it as well. (I.e. it isn’t specific to formal logic as opposed to prediction)
In the tumblr post, I complained that this definition was “asymptotic,” which I realize was not very clear. What I meant was that the definition captures only qualitative convergence or lack thereof, throwing away any rate information. Dominance just tells us whether the trader can or can’t make an infinite profit, not how quickly the inductor “plugs the hole” which would allow the trader to make an infinite profit if left unplugged forever.
This means that any further convergence results we get from dominance considerations across multiple traders must be pointwise with respect to the individual traders. For a uniform convergence result, we would need access to convergence rate information about each of the individual traders, so we could ensure that they have all gotten within ϵ at the same time. But the dominance statements for the traders lack this information.
Consider, for instance, procedures like TradingFirm in the LI paper or eq. 31 in your paper, where a set of traders is used to construct a single trader that is “strictly better” than every trader in the set with respect to dominance. For instance, your eq. 31 reads
Gξn:=n∑k=0ξ(k)Gkn
Although dominance carries over from each trader in the set, convergence rates and other finite-time properties do not. Concretely: at any time n, all but finitely many of the Gk from which Gξ is inheriting dominance have not yet appeared in the sum, so they are not yet affecting the behavior of Gξ.
For this reason, telling me only that an inductor dominates a set of traders does not tell me anything about the relation of that inductor to any of those traders at any given time. (After all, for any given trader and time n, it might be using a procedure like the above in which the trader is not considered until some time N>n.)
Say I’m interested in this particular set of traders because, for each trader, there is some convergence property related to it which I would like the inductor to have. (This seems quite generally true.) So for ϵ>0, I’m associating with the trader Gk some property pkn(ϵ), something like “at time n, the hole exploited by Gk has been plugged to within tolerance ϵ.” What I want to conclude here is that, at time n, some of these properties hold. But suppose the inductor dominates the set via a procedure like the one above. Then whenever I ask about some pkn(ϵ), I have no way of knowing whether n is high enough for Gk to have been included in the sum yet, according to the enumeration used to by the procedure. Thus I can’t conclude any of the facts pkn(ϵ).
In short, this framework just isn’t built for uniform convergence. If we want to get any results about how the inductor is doing “in general” at any given time, we need information that the dominance concept throws away.
Finally, about abstract asymptotic results leading to efficient practical algorithms—yes, this happens, but it’s important to think about what information beyond mere convergence is necessary for it to happen. Consider root-finding for a differentiable function f from R→R. Here’s one method that converges (given some conditions): Newton’s method. Here’s another: enumerate the rational numbers in an arbitrary order and evaluate f at one rational number per timestep. (You can approximate the root arbitrarily well with rationals, the function is continuous, blah blah.) Even though these are both convergent, there’s obviously a big difference; the former is actually converging to the result in the intuitive sense of that phrase, while the latter is just trolling you by satisfying your technical criteria but not the intuitions behind them. (Cf. the enumeration-based trader constructions.)
Clarification: I’m not the author of the linked post, I’m just linking it here on behalf of the author (who objects to signing up for things with facebook). I found the most interesting part to be the observation about pointwise convergence—because the LI framework proves its results pointwise, it cannot guarantee that each result which is proved about P_infty will obtain for any finite time n; see also his example about preemptive learning, where for an infinite set of e.c sequences, there may be no finite n that works for all of the set.