Daniel_Burfoot comments on Kevin T. Kelly’s Ockham Efficiency Theorem

Daniel_Burfoot 16 Aug 2010 20:21 UTC
10 points

Kolmogorov complexity also depends on the description language used to create strings.

The whole point of the KC is that it doesn’t depend on the specific programming language (or Turing machine) selected except up to an additive constant. If your language contains a recursive feature, it will assign lower complexity to Fibonacci numbers than my non-recursive language does, but the difference is at most the codelength required for me to simulate your machine on my machine.
- PaulAlmond 19 Aug 2010 0:59 UTC
  3 points
  Parent
  I wrote this some years back. It is probably not without its flaws (and got some criticism on Less Wrong), but the general idea may be relevant: http://www.paul-almond.com/WhatIsALowLevelLanguage.htm
  - Sebastian_Hagen 21 Aug 2010 17:03 UTC
    5 points
    Parent
    AFAICT, the ‘proposed solution’ from WiaLLL doesn’t work. The rest of this post will be devoted to proposing a specific attack on it. As such it will be longish—since WiaLLL makes a strong claim about a fundamental aspect of epistemology, it deserves a considered refutation.
    
    My attack works on the assumption that current sets count programs, not languages—i.e. two different programs that implement the same language will result in two entries in the next current set (regardless of whether they are written in the same language). Paul Almond has told me that this is what he intended in WiaLLL.
    
    Consider a family of turing-complete languages, whose members are called Xn (n ∈ N). I will use ‘X-like’ to refer to any language from this family. Let X := X8.
    
    Programs for any language Xn can be divided into three categories:
    
    Programs that begin with n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don’t matter for this attack).
    
    Programs that have a length of 0 (i.e. empty strings). All such programs are defined to implement X.
    
    Programs that don’t fit either 1 or 2. All such programs are defined to implement X(n+1).
    
    It follows that the set of programs of any bounded length (>0) writable in any Xn (n>0) is dominated by interpreters for X-like languages with a greater n than their own. While all such languages are turing-complete, for any given program length the proportion of non-X-like-interpreting programs writable in them is small, and inversely exponential in n.
    
    What happens when we run the proposed solution from WiaLLL on this X? Assuming sufficiently high values of N, on each step but the first the current set will consist of a mix of X-likes and other languages. For purposes of calculating a lower bound on the long-term proportion of X-likes, assume that programs for non-X-likes never implement an X-like. The X-likes will then tend to thin out, since some of their programs fall into category 1 and implement non-X-likes; however this process is bounded to produce a fraction of non-X-likes no larger than \sum_{i=8}^\infty (1/2^i) = 0.0078125.
    
    The X-likes will also thin themselves out through programs in category 2, since X itself will produce a fixed fraction of non-X-likes. How much they do so depends on the exact growth of N and I in the limit. Given a function I(N) which maps values for program length to the number of iterations to run with this length, one can calculate a rough upper bound on the proportion of non-X-likes produced in this manner as 1-\prod_{N=N0}^\infty (1-I(N)/2^N).
    
    The detailed results of this process therefore depend on the choice of I(N), and of N0. The general relationship is that unless I(N) grows quite quickly in N, the X-likes will only be notably thinned out for small N, and therefore will not be thinned out significantly given a sufficiently big N0. For example, for I(N)=16 and N0=16 (16 iterations per maximum program length, minimum maximum program length of 16 bits) the result is about 0.00049.
    
    Therefore, with free choice of the starting language and some reasonable assumptions about the limiting process used for I and N, we can ensure that the language mix in the final current set will be dominated by X-likes.
    
    Why does all of this matter? Because any X-like especially likes to implement X (since category 2 programs have the minimum length of 0 bits), and for sufficiently large n really doesn’t like to implement any non-X-like, and as such any current set containing a sufficient proportion of X-likes will rate X as lower-level than any other language. This argument might in theory fail because of the possibility that other languages would on average penalize X-likes much more than X-likes penalize them, but if so this can be worked around by setting X to a higher Xn than 8 - the family can be redefined to set the initial n to any integer we need it to be to work around such finite factors.
    
    Therefore, unless I’m missing something critical, the output of the procedure proposed in WiaLLL would be highly sensitive to the starting language as well as I(N) and N0. It therefore fails in its attempt to define a simplicity measure on languages in the absence of additional assumptions not stated in the article.
    - AlephNeil 25 Aug 2010 8:21 UTC
      2 points
      Parent
      (Part 2) But hang on—I can see an easy way to improve on Paul’s definition:
      
      He looks at all implementations of languages, of length less than or equal to N, takes a simple arithmetic mean and then lets N go to infinity. But how about we take a weighted arithmetic mean where the weight of a given implemention is 1/2^itslength (and then we normalise if necessary so that the weights sum to 1). (This does away with the need for N and taking a limit as N goes to infinity.) Then in your example above, the fact that there is always an implementation of X whose length is 0 means that a fixed proportion of the weight previously assigned to X-like languages ‘bleeds away’ on every iteration.
      
      I think we can use a “pathological” language to make X[0] particularly hard to implement (on average) rather than particularly easy, so it would still be false that “low-level index for Y” measured from X must be the same as measured from Y.
      
      But perhaps we can define the “low-level index” of a language Y as the infimum of all low-level indices as measured from languages X?
      
      (Needless to say, the trick of weighting the average by 1/2^length is very standard.)
      - Sebastian_Hagen 25 Aug 2010 20:13 UTC
        2 points
        Parent
        Let’s see if I understand this correctly. You’re redefining current sets to a mapping from the set of all computable and turing-complete languages to their weights in [0,1]. Except for the first step of the process, you will actually have non-zero weights on each of those languages, so we’re working with true countably infinite sets here. This might require a stronger hypercomputer to implement than WiaLLL, but I suppose that’s fine for the purely theoretical purposes of these constructions.
        
        Weighting each program by 1/2^i probably doesn’t work, since that still results in a total measure of 1 for each set of all programs with a given length n ∈ N (since there are 2^n programs in each of those sets), but that’s a minor detail—weighting by e.g. 1/4^i should get you a workable result.
        
        I think this is still attackable in so far as that it’s possible to produce a setup that will make X win over any non-X-like. It would lose to certain other X-likes in each round, but this does not hold in the (I → ∞) limit.
        
        It’s probably not necessary for me to write up the modified attack in full detail; the basic idea is to redefine the categories of Xn programs as follows:
        
        Programs that begin with 2n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don’t matter for this attack).
        
        Programs that begin with j ∈ [n,2n) 0 bits. All such programs are defined to implement X.
        
        Programs that don’t fit either 1 or 2. All such programs are defined to implement X(n+1).
        
        So the support of Xn for X approaches 0 in the (n → ∞) limit, which allows the X-like family to keep a large weight through arbitrarily many iterations, and even in the limit of (I → ∞). However, it approaches 0 much slower than their support for any non-X-like, which means that X still wins over every non-X-like at each step of the simplicity computation.
        
        Also note that the choice of a prefix length linear in the X-like level is fairly arbitrary. I think this is sufficient to get the desired effect, but if this turns out to be incorrect we can of course use any computable function here, meaning we can adjust this attack to make the fraction of non-X-likes produced fall much faster.
        
        But perhaps we can define the “low-level index” of a language Y as the infimum of all low-level indices as measured from languages X?
        
        I don’t understand this suggestion. If I understood your previous modifications correctly, all but the first current sets of your process assign non-zero measure to every computable and turing-complete language, so taking minimums about languages from this set wouldn’t get you any useful information. As for X-likes, if you are going to treat those seperately from other languages in the simplicity determination process, you’d need to specify some precise way of how to recognize them. The specific X families I’ve talked about are of course only examples; there’s countably infinitely more such families that exhibit similar properties. Please elaborate on what you meant?
    - AlephNeil 25 Aug 2010 7:45 UTC
      2 points
      Parent
      So essentially, the problem with Paul Almond’s suggestion is that one can find languages X that are ‘X[0]-pathological’ in the sense that almost all of their children are both (i) good at implementing some fixed language X[0] and (ii) even more X[0]-pathological than they are (this being a recursive definition of ‘pathological’).
      
      Such languages destroy any hope that Paul’s “low-level index for Y” won’t depend on which language X we start from.
- Johnicholas 19 Aug 2010 0:49 UTC
  2 points
  Parent
  Kolomogorov complexity isn’t an option, and approximations of Kolmogorov complexity don’t have that uniqueness property.
  
  Even if we did have that uniqueness property, it’s still not all that helpful if you’re trying to decide among a finite number of finite-complexity hypotheses. Choice of description language is still sufficient to completely determine your choice of which hypothesis is simplest.
  
  As far as I can tell, a MDL-ish complexity measurement that is “unique up to a constant” acts like a prior—in the limit of infinite evidence, the initial bias will wash out. Kelly is fond of analogizing this “justification” of Occam to a flat tire—it helps you get home because you can eventually fix it.
  - Daniel_Burfoot 19 Aug 2010 3:41 UTC
    1 point
    Parent
    
    Kolomogorov complexity isn’t an option, and approximations of Kolmogorov complexity don’t have that uniqueness property.
    
    Right, but we can find a series of increasingly tight upper bounds to the KC, and the bounds are language-independent in the same sense that the KC itself is.
    
    Then for any sufficiently large dataset, we can say that the probability of the data is given by the model that produces the tightest upper bound for the KC. This is nicely rigorous—if you think my model is unjustified and doesn’t correspond to the “real” probabilities, all you have to do is produce a new model that achieves a tighter KC-bound.
    - Johnicholas 19 Aug 2010 4:40 UTC
      0 points
      Parent
      Can you help me with the question: “How do you decide which of a finite number of hypotheses, each of which is of finite complexity, is simplest, using approximations of Kolmogorov complexity?”
      
      Also, my understanding is that people normally approximate Kolmogorov complexity by using non-Turing-complete models of computation, such as the finite-state automata that Hutter and his co-authors recently used to play Pac-Man: http://il.youtube.com/watch?v=RhQTWidQQ8U
      
      If the model of computation is not Turing complete, then complexity measurements derived from it do not have the “uniqueness” property that we’re talking about.
      - cousin_it 19 Aug 2010 10:58 UTC
        4 points
        Parent
        Thanks for that link, I wasn’t aware of Monte Carlo AIXI. It’s amusing how little it took to make Nesov and Roko frightened.
        XiXiDu 19 Aug 2010 11:37 UTC
        3 points
        Parent
        A interposed question: Can you or someone point me to an introduction or starting point regarding the overall topic related to Kolomogorov complexity (so I am able to follow, not contribute)? Or is an explanation already contained in the sequences? This idea seems to be discussed quite often on LW. Thank you.
        cousin_it 19 Aug 2010 11:46 UTC
        8 points
        Parent
        I don’t think it’s been thoroughly explained in the sequences, but why not start with the Wikipedia page? Or this post.
        
        On second thought, this is actually a very simple idea that can be explained in a comment. Let me try:
        
        The Kolmogorov complexity of a finite string of bytes is the size (in bytes) of the shortest Python program that outputs that string and then halts. Or you could use any other language in place of Python and get a different value of complexity—but it cannot differ by more than a constant as the string grows, because you could always implement a constant-size Python interpreter in the other language to “convert” one description into another. So complexity is not unique, but all definitions of complexity sorta “agree” as we go to infinity.
        
        For example, if a string consists of “a” repeated a million times, its Kolmogorov complexity is quite small because you can write a very short program that outputs a million a’s and then halts. But this is not the case for all possible strings. Here’s a simple way to see why: there can’t be more Python programs of length N than there are different byte strings of length N, because any program is itself a byte string. Therefore, you cannot encode all strings of length N using programs of length N/2 or something, because there’s not enough of them. Actually this shows that most strings of length N must have complexity close to N. (Obviously it can’t be much larger than N, because any string of length N can be output by a program of length N+c that “hardcodes” that specific string.)
        
        Another important observation. Imagine you have a huge, random-looking string of bytes. How can you calculate its Kolmogorov complexity? Turns out that you can’t, because there’s no easier way to do that than, essentially, simulating all programs up to length N and seeing what they output. But this is impossible because any given program can (say) hang idly for a million years and only then start outputting symbols, and you have no way of learning that reliably from the program text (this is known as the “halting problem”). So K-complexity is largely a theoretical notion, I have no idea why people love it so.
        Daniel_Burfoot 19 Aug 2010 16:55 UTC
        0 points
        Parent
        
        So K-complexity is largely a theoretical notion, I have no idea why people love it so.
        
        Imagine some Oracle told you that obtaining the ultimate theory of physics—the one that unifies relativity and quantum gravity, and all that—was impossible. Then following your logic you might conclude that it was pointless to study physics, since one can never obtain the perfect theory. But that would be wrong; it would still be highly worthwhile to attempt to obtain better and better physical theories.
        XiXiDu 19 Aug 2010 13:52 UTC
        0 points
        Parent
        Thanks, great! I think this might be another introductory explanation.
        XiXiDu 19 Aug 2010 16:34 UTC
        −2 points
        Parent
        Here is more: Information vs. Meaning
        
        Kolmogorov-Chaitin information theory is focused on quantification—that is, on being able to measure the quantity of information in a string of characters. It doesn’t care what the string means. Information theory cares about how much information is in a string; it doesn’t care what the string means. In fact, you can say something much stronger about information theory: if you consider meaning at all, then you’re not doing information theory. A truly abstract string could mean many different things: information theory’s measurement of its information content must include everything that could be used relative to any possible system of meaning to determine the information content of a string.
        
        wedrifid 19 Aug 2010 12:07 UTC
        0 points
        Parent
        
        So K-complexity is largely a theoretical notion, I have no idea why people love it so.
        
        Which notion do you prefer?
        cousin_it 19 Aug 2010 12:16 UTC
        2 points
        Parent
        Haven’t thought too hard about this question. In my mind complexity is filed away as a “mystery”, on the same shelf as frequentism vs Bayesianism, decision theory and other things. I know the state of the art and am convinced that it’s unsatisfactory, but how to fix it is unclear. You could carve out a nice research area for yourself if you took any of these questions and ran with it :-)
        [deleted] 20 Aug 2010 1:15 UTC
        2 points
        Parent
        I have a vague idea that one shouldn’t look for a single correct notion of complexity, but instead try to find some reasonable properties that any measure should have, and study them all at once. For instance, if I propose a measure that turns out to be equivalent to square-root of K-complexity, who’s to say it’s better or worse? More seriously, one could imagine complexity measures like “the time it would take to explain in English”, “the time it would take to explain in Japanese”, “the time it would take a smart person to explain”, “the time it would take a stupid person to explain”...
        
        But when I try to think about “properties a measure should have” all I can come up with is a kind of monotonicity: a complexity measure is a real-valued function on strings whose value on a given string is larger than the value on an (initial?) substring. That is not even true of K-complexity. (E.g. if N is an integer with high complexity, but less than 10 to the one-hundred, then a string of N repeated zeroes will have higher complexity than a string of 10 to the one-hundred zeroes.)
        wedrifid 19 Aug 2010 12:22 UTC
        0 points
        Parent
        
        You could carve out a nice research area for yourself if you took any of these questions and ran with it :-)
        
        So many topics, so little time! :)
        cousin_it 19 Aug 2010 12:29 UTC
        6 points
        Parent
        It’s amazing how much a person can do if some topic manages to interest them more than the Internet, even for a little while.
        Expand this thread
        wedrifid 19 Aug 2010 12:59 UTC
        2 points
        Parent
        A timely reminder. I’d better go back to obsessing about nootropics and see if I cannot amaze myself somewhat.
        
        Damn Lesswrong and its “Recent Comments:” and its orange envelope icon.
        cousin_it 19 Aug 2010 13:33 UTC
        2 points
        Parent
        When you reach a result, be sure to post it and make the Internet a little more alluring for all of us :-)
        wedrifid 19 Aug 2010 13:52 UTC
        1 point
        Parent
        This suggests a new focus—drugs to treat internet addiction and promote procrastination resistance. The experimental procedure seems obvious. The double-blind, placebo controlled T4ET (Tv Tropes—Time To Exit Test).
        wedrifid 19 Aug 2010 12:03 UTC
        0 points
        Parent
        This concept sounds like one that could be a useful addition to your ‘could’, ‘probability’ mini-series. Putting things into ‘python’ language, as you have done, makes the concepts much simpler to apply.
        cousin_it 19 Aug 2010 12:07 UTC
        2 points
        Parent
        That mini-series is about stuff that I make up myself. K-complexity is a well known notion, maybe you could ask JoshuaZ—he was planning to write some introductory math posts.
        wedrifid 19 Aug 2010 12:49 UTC
        0 points
        Parent
        
        That mini-series is about stuff that I make up myself. K-complexity is a well known notion,
        
        Ok. I actually considered the ‘could’ post a well known notion just expressed in an intuitive format.
        
        maybe you could ask JoshuaZ—he was planning to write some introductory math posts.
        
        Not a bad idea. Failing that I could always write one myself. But for now I’ll just hope your comment above comes up in the custom search when I’m looking for a link target. You touched on most of the important details (albeit briefly) so that was about ²⁄₃ post ready as is.
        Johnicholas 19 Aug 2010 14:03 UTC
        1 point
        Parent
        Sorry, I thought it was old news in LW circles. Here’s the actual paper:
        
        http://jveness.info/publications/veness_rl_via_aixi_approx.pdf
        Vladimir_Nesov 19 Aug 2010 16:50 UTC
        −1 points
        Parent
        
        Thanks for that link, I wasn’t aware of Monte Carlo AIXI. It’s amusing how little it took to make Nesov and Roko frightened.
        
        It’s not the specific result that’s frightening, but the overall research direction, where each result contributes exclusively to destroying the world. We’ve got a small tiny step closer to disaster, job well done.
- Dan_Moore 17 Aug 2010 13:34 UTC
  0 points
  Parent
  I see. Thanks.
  
  Question: How many bits does adding a recursive feature take up?