Kevin T. Kelly’s Ockham Efficiency Theorem
There is a game studied in Philosophy of Science and Probably Approximately Correct (machine) learning. It’s a cousin to the Looney Labs game “Zendo”, but less fun to play with your friends. http://en.wikipedia.org/wiki/Zendo_(game) (By the way, playing this kind of game is excellent practice at avoiding confirmation bias.) The game has two players, who are asymmetric. One player plays Nature, and the other player plays Science. First Nature makes up a law, a specific Grand Unified Theory, and then Science tries to guess it. Nature provides some information about the law, and then Science can change their guess, if they want to. Science wins if it converges to the rule that Nature made up.
Kevin T. Kelly is a philosopher of science, and studies (among other things) the justification within the philosophy of science for William of Occam’s razor: “entities should not be multiplied beyond necessity”. The way that he does this is by proving theorems about the Nature/Science game with all of the details elaborated.
Why should you care? Firstly, his justification is different from the overfitting justification that sometimes shows up in Bayesian literature. Roughly speaking, the overfitting justification characterizes our use of Occam’s razor as pragmatic—we use Occam’s razor to do science because we get good generalization performance from it. If we found something else (e.g. boosting and bagging, or some future technique) that proposed oddly complex hypotheses, but achieved good generalization performance, we would switch away from Occam’s razor.
An aside regarding boosting and bagging: These are ensemble machine learning techniques. Suppose you had a technique that created decision tree classifiers, such as C4.5 (http://en.wikipedia.org/wiki/C4.5_algorithm), or even decision stumps (http://en.wikipedia.org/wiki/Decision_stump). Adaboost (http://en.wikipedia.org/wiki/Adaboost) would start by weighting all of the examples identically and invoking your technique to find an initial classifier. Then it would reweight the examples, prioritizing the ones that the first iteration got wrong, and invoke your technique on the reweighted input. Eventually, Adaboost outputs an ensemble of decision trees (or decision stumps), and taking the majority opinion of the ensemble might well be more effective (generalize better beyond training) than the original classifier. Bagging is similar (http://en.wikipedia.org/wiki/Bootstrap_aggregating).
Ensemble learning methods are a challenge for “prevents overfitting” justifications of Occam’s razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.
Secondly, his alternative justification bears on ordering hypotheses “by simplicity”, providing an alternative to (approximations of) Kolmogorov complexity as a foundation of science.
Let’s take a philosopher’s view of physics—from a distance, a philosopher listens to the particle physicists and hears: “There are three fundamental particles that make up all matter!”, “We’ve discovered another, higher-energy particle!”, “Another, even higher-energy particle!”. There is no reason the philosopher can see why this should stop at any particular number of particles. What should the philosopher believe at any given moment?
This is a form of the Nature vs. Science game where Nature’s “Grand Unified Theory” is known to be a nonnegative number (of particles), and at each round Nature can reveal a new fundamental particle (by name) or remain silent. What is the philosopher of science’s strategy? Occam’s razor suggests that we prefer simpler hypotheses, but if there were 254 known particles, how would we decide whether to claim that there are 254, 255, or 256 particles? Note that in many encodings, 255 and 256 are quite round numbers and therefore have short descriptions; low Kolmogorov complexity.
An aside regarding “round numbers”: Here are some “shortest expressions” of some numbers, according to one possible grammar of expressions. (Kolmogorov complexity is unique up to an additive constant, but since we never actually use Kolmogorov complexity, that isn’t particularly helpful.)
1 == (1)
2 == (1+1)
3 == (1+1+1)
4 == (1+1+1+1)
5 == (1+1+1+1+1)
6 == (1+1+1+1+1+1)
7 == (1+1+1+1+1+1+1)
8 == ((1+1)*(1+1+1+1))
9 == ((1+1+1)*(1+1+1))
10 == ((1+1)*(1+1+1+1+1))
11 == (1+1+1+1+1+1+1+1+1+1+1)
12 == ((1+1+1)*(1+1+1+1))
13 == ((1)+((1+1+1)*(1+1+1+1)))
14 == ((1+1)*(1+1+1+1+1+1+1))
15 == ((1+1+1)*(1+1+1+1+1))
16 == ((1+1+1+1)*(1+1+1+1))
17 == ((1)+((1+1+1+1)*(1+1+1+1)))
18 == ((1+1+1)*(1+1+1+1+1+1))
19 == ((1)+((1+1+1)*(1+1+1+1+1+1)))
20 == ((1+1+1+1)*(1+1+1+1+1))
21 == ((1+1+1)*(1+1+1+1+1+1+1))
22 == ((1)+((1+1+1)*(1+1+1+1+1+1+1)))
24 == ((1+1+1+1)*(1+1+1+1+1+1))
25 == ((1+1+1+1+1)*(1+1+1+1+1))
26 == ((1)+((1+1+1+1+1)*(1+1+1+1+1)))
27 == ((1+1+1)*((1+1+1)*(1+1+1)))
28 == ((1+1+1+1)*(1+1+1+1+1+1+1))
30 == ((1+1+1+1+1)*(1+1+1+1+1+1))
32 == ((1+1)*((1+1+1+1)*(1+1+1+1)))
35 == ((1+1+1+1+1)*(1+1+1+1+1+1+1))
36 == ((1+1+1)*((1+1+1)*(1+1+1+1)))
40 == ((1+1)*((1+1+1+1)*(1+1+1+1+1)))
42 == ((1+1+1+1+1+1)*(1+1+1+1+1+1+1))
45 == ((1+1+1)*((1+1+1)*(1+1+1+1+1)))
As you can see, there are several inversions, where a larger number (in magnitude) is “simpler” in that it has a shorter shortest expression (in this grammar). The first inversion occurs at 12, which is simpler than 11. In this grammar, squares, powers of two, and smooth numbers generally (http://en.wikipedia.org/wiki/Smooth_number) will be considered simple. Even though “the 256th prime” sounds like a short description of a number, this grammar isn’t flexible enough to capture that concept, and so this grammar does not consider primes to be simple. I believe this illustrates that picking a particular approximate Kolmogorov-complexity-like concept is a variety of justification of Occam’s razor by aesthetics. Humans can argue about aesthetics, and be convinced by arguments somewhat like logic, but ultimately it is a matter of taste.
In contrast, Kelly’s idea of “simplicity” is related to Popper’s falsifiability. In this sense of simplicity, a complex theory is one that can (for a while) camouflage itself as another (simple) theory, but a simple theory cannot pretend to be complex. So if Nature really had 256 particles, it could refuse to reveal them for a while (maybe they’re hidden in “very high energy regimes”), and the 256-particle universe would exactly match the givens for a 254-particle universe. However, the 254-particle universe cannot reveal new particles; it’s already shown everything that it has.
Remember, we’re not talking about elaborate data analysis, where there could be “mirages” or “aliasing”, patterns in the data that look initially like a new particle, but later reveal themselves to be explicable using known particles. We’re talking about a particular form of the Nature vs Science game where each round Nature either reveals (by name) a new particle, or remains silent. This illustrates that Kelly’s simplicity is relative to the possible observables. In this scenario, where Nature identifies new particles by name, then the hypothesis that we have seen all of the particles and there will be no new particles is always the simplest, the “most falsifiable”. With more realistic observables, the question of what is the simplest consistent hypothesis becomes trickier.
A more realistic example that Kelly uses (following Kuhn) is the Copernican hypothesis that the earth and other planets circle the sun. In what sense is it simpler than the geocentric hypothesis? From a casual modern perspective, both hypotheses might seem symmetric and of similar complexity. The crucial effect is that Ptolemy’s model parameters (velocities, diameters) have to be carefully adjusted to create a “coincidence”—apparent retrograde motion that always coincides with solar conjunction (for mercury and venus) and solar opposition (for the other planets). (Note: Retrograde means moving in the unusual direction, west to east. Conjunction means two entities near each other in the sky. Opposition means the entities are at opposite points of the celestial sphere.) The Copernican model “predicts” that coincidence; not in the sense that the creation of the model precedes knowledge of the effect, but that any tiny deviation from exact coincidence to be discovered in the future would be evidence against the Copernican model. In this sense, the Copernican model is more falsifiable; simpler.
The Ockham Efficiency Theorems explain in what sense this version of Occam’s razor is strictly better than other strategies for the Science vs. Nature game. If what we care about is the number of public mind changes (saying “I was mistaken, actually there are X particles.” would count as a mind change for any X), and the timing of the mind changes, then Occam’s razor is the best strategy for the Science vs. Nature game. The Occam strategy for the number-of-particles game will achieve exactly as many mind changes as there are particles. A scientist who deviates from Occam’s razor allows Nature to extract a mind change from the (hasty) scientist “for free”.
The way this works in the particles game is simple. To extract a mind change from a hasty scientist who jumps to predicting 12 particles when they’ve only seen 11, or 256 particles when they’ve only seen 254, Nature can simply continuously refuse to reveal new particles. If the scientist doesn’t ever switch back down to the known number of particles, then they’re a nonconvergent scientist—they lose the game. If the scientist, confronted with a long run of “no new particles found” does switch to the known number of particles, then Nature has extracted a mind change from the scientist without a corresponding particle. The Occam strategy achieves the fewest possible number of mind changes (that is, equal to the number of particles), given such an adversarial Nature.
The “Ockham Efficiency Theorems” refer to the worked-out details of more elaborate Science vs. Nature games—where Nature chooses a polynomial GUT, for example.
This entire scenario does generalize to noisy observations as well (learning Perlean causal graphs) though I don’t understand this aspect fully. If I understand correctly, the scientist guesses a probability distribution over the possible worlds and you count “mind changes” as changes of that probability distribution, so adjusting a 0.9 probability to 0.8 would be counted as a fraction of a mind change. Anyway, read the actual papers, they’re well-written and convincing.
www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm
This post benefited greatly from encouragement and critique by cousin_it.
- 4 Dec 2011 18:29 UTC; 2 points) 's comment on Welcome to Less Wrong! (2010-2011) by (
- 28 Oct 2010 4:50 UTC; 0 points) 's comment on What Makes My Attempt Special? by (
As Unknown pointed out, using Kolmogorov complexity is a better bet if winning means finding the truth quickly, instead of just eventually.
The rules of this game, plus the assumption that Nature is maximally adversarial, seems specifically designed to so that this particular version of Ockham would be optimal. It doesn’t really seem to provide much insight into what one should do in more general/typical settings.
The question that Kelly is trying to answer is “Why, or in what sense, does Occam’s razor work?”. Yes, the answer is “It works in that it is worst-case-efficient.” He doesn’t assume a simplicity-biased prior (which some might characterize as a circular justification of Occam’s razor).
I worry that the particular example that I’m presenting here (where the Occam strategy steps in a vaguely linear fashion) is coloring your thinking about Kelly’s work generally, which has little or nothing to do with stepping in a vaguely linear fashion, and everything to do with believing the simplest theory compatible with the evidence. (Which, based on your other writings, I think you would endorse.)
Please investigate further, maybe read one of Kelly’s papers, I would hate for my poor writing skills to mislead you.
Is the particles example one of Kelly’s own examples, or something you made up to explain his idea? Because in that example at least, we seem to be assuming a game with very specific rules, showing that a particular strategy is optimal for that game, and then calling that strategy “Occam”. Compare this with Solomonoff induction, where we only have to assume that the input is computable and can show that Bayesian updating with a prior based on Kolmogorov complexity is optimal in a certain sense.
I do endorse that, but there are various aspects of the idea that I’m still confused about, and I’m not seeing how Kelly’s work helps to dissolve those confusions.
I think I read some of Kelly’s own writings the last time cousin_it pointed them out, but again, didn’t really “get it” (in the sense of seeing why it’s important/interesting/insightful). I’m hoping that I’m just missing something, and you, or someone else, can show me what it is. Or perhaps just point to a specific paper that explains Kelly’s insights most clearly?
The particles example is derived from one of Kelly’s examples (marbles in a box).
Kolmogorov complexity is unavailable—only approximations it are available, and they’re not unique. There are multiple reasonable grammars that we can use to do MDL, and no clear justification for why we should use one rather than another. For an extreme example, imagine measuring simplicity by “size of the patch to the Catholic Church’s dogma.”
Kelly’s notion that Occam means worst-case-efficient-in-mind-changes allows us to avoid the prior “dropping like manna from heaven”.
I recommend this paper:
http://www.hss.cmu.edu/philosophy/kelly/papers/bonn5.pdf
but if that one doesn’t do it for you, there are others with more cartoons and less words:
http://www.fitelson.org/few/few_05/kelly_2.pdf
Or ones with more words and less cartoons:
http://www.hss.cmu.edu/philosophy/kelly/papers/Ch4-Glymour%20&%20Kelly-final.pdf
In the first paper you cite, there is a particles example that is essentially the same example as yours, and Kelly does use it as his main example. The only difference is instead of counting the number of fundamental particles in physics, his example uses a device that we know will emit a finite number of particles.
On page 33, Kelly writes about how his idea would handle a modification of the basic example:
So suppose this device has been emitting one particle every 10 seconds for the last million seconds. According to Kelly’s version of Ockham’s Razor (perhaps we should call it Kelly’s Razor instead?), we can’t predict that the next particle will come 10 seconds later. What use is Kelly’s idea, if I want to have a notion of complexity that can help us (or an AI) make decisions in general, instead of just playing some specific games for which it happens to apply?
You read the paper! Thanks for pointing out that we know somehow that only a finite number of particles will ever be found.
To explain the “oneicle” problem: It seems like how a scenario is coded into a game matters. For example, if you viewed the timed particles game as having two possible worlds “The device will always emit a particle every 10 seconds.” and “The device will sometimes emit a particle every 10 seconds.”, then the first world cannot pretend to be the second world, but the second world can camouflage itself as the first world for a time, and so (Kelly’s version of) Occam’s razor says the first is simpler—we get the intuitively correct answer.
The alternative coding is somewhat analogous to the color “grue” (which is green up until some date, and blue thereafter). You recode the problem to talk about “oneicles”, a concept that refers to non-particles up to time 1, and particles thereafter. If you allow this sort of recoding, then you would also allow “twoticles”, and the infinite hierarchy of symmetric re-codings causes a problem. I tend to think this is a technical problem that is unlikely to expand into the philosophy part of the theory, but I’m kindof an idiot, and I may be missing something—certainly we would like to avoid coding-dependence.
That’s a problem (the first problem Kelly mentioned in that paper), but do you really require a theory to have no problems remaining in order for it to be counted as insightful? No one else addresses the question “Where does the prior come from?”.
It would be one thing if Kelly said that the theory currently can’t predict that another particle will come in 10 seconds, but he hopes to eventually extend it so that it can make predictions like that. But instead he says that Ockham is mute on the question, and that’s the right answer.
Neither does Kelly. I don’t see how we can go from his idea of Ockham to a Bayesian prior, or how to use it directly in decision making. Kelly’s position above suggests that he doesn’t consider this to be the problem that he’s trying to solve. (And I don’t see what is so interesting about the problem that he is trying to solve.)
Okay, I think we’ve reached a point of reflective disagreement.
I agree with you that Kelly was wrong to be enamored of his formalization’s output on the timed particles example; it’s either a regrettable flaw that must be lived with, or a regrettable flaw that we should try to fix, and I don’t understand enough of the topological math to tell which.
However, the unjustified Occam prior in the standard Bayesian account of science is also a regrettable flaw—and Kelly has demonstrated that it’s probably fixable. I find that very intriguing, and am willing to put some time into understanding Kelly’s approach—even if it dissolves something that I previously cherished (such as MDL-based Occam priors).
Reasonable people can reasonably disagree regarding which research avenues are likely to be valuable.
I am very late to the discussion. I have not read Kelley’s papers in detail, so pardon me if my question betrays a fundamental misunderstanding of what you wrote: How can “(Kelly’s version of) Occam’s razor says the first [world] is simpler” and give us “the intuitively correct answer” if an infinite number of particles will be emitted in the first world, even though Kelley has already specified that the device will only emit a finite number of particles?
The statements, though contradictory, refer to two different thought experiments.
The two comments, though contradictory, refer to two different thought experiments.
I see. Thanks for the explanation.
Nice post! Couple quibbles/questions.
It’s not a challenge unless you think that each of the weak classifiers should be powerful enough to do the classification correctly. Occam = “Do not multiply entities beyond need”, not “Do not multiply entities”.
The Occam strategy isn’t to say the number of particles that exist equals the number observed. It’s to use the simplest underlying theory. We have theories with fewer moving parts than number of particles predicted. Maybe you mean “fundamental concepts” when you say “particles”; but in that case, the non-Occam strategy is a straw man, since I don’t think anyone proposes new concepts that provide no additional explanatory power.
Can you elaborate your sentence that begins “It’s not a challenge..”?
My understanding is that if our real justification for “Why do we use Occam’s razor?” was “Because that way we avoid overfitting.” then if a future statistical technique that outperformed Occam by proposing weirdly complex hypotheses came along, we would embrace it wholeheartedly.
Boosting and bagging are merely illustrative of the idea that there might be a statistical technique that achieves good performance in a statistical sense, though nobody believes that their outputs (large ensembles of rules) are “really out there”, the way we might believe Schroedinger’s equation is “really out there”.
We only use boosting if our set of low-complexity hypotheses does not contain the solution we need. And instead of switching to a larger set of still-low-complexity hypotheses, we do something much cheaper, a second best thing: we try to find a good hypothesis in the convex hull of the original hypothesis space.
In short, boosting “outperforms Occam” only in man-hours saved: boosting requires less thinking and less work than properly applying Occam’s razor. That really is a good thing, of course.
I second the OP’s recommendation of Zendo. It could’ve been named “Induction: The Game”, it’s a great medium for practicing (and teaching!) some basic science and rationality skills, and it’s also a heck of a lot of fun. Plus, you can use the neat little Zendo pyramids (also called Icehouse pieces) for various other games.
I agree,
I’m not sure if I should go with Treehouse or IceTowers...
If I buy IceTowers I could probably play Treehouse, but not the other way round...
:)
Hm, I’d never heard of Treehouse before, thanks for the link. I’ll try that out next time I bring my Zendo set somewhere.
You’re right that trying to do Zendo with only a Treehouse set would be very frustrating. Even with a full 60 piece Zendo set, it doesn’t take more than a few rounds before you have to start taking apart some of the older experiments to build new ones.
You don’t really need the pyramids, though they’re definitely neat. You can do it with coins, dice, little glass beads, whatever. With words and phrases, it makes a car game; the game “Green Glass Door” can be seen as an instance.
I call this game “playing Science”.
I’m not sure this is true. Overfitting means that a classifier achieves good empirical performance (on the known data), but bad generalization performance (on new, previously unseen data). The weak classifiers used in Adaboost et al. don’t really suffer from overfitting; they just have bad empirical performance (this is why they are called “weak”). Also, ensemble methods do suffer from overfitting.
It is very hard to analyze complexity using naive methods. For example, let’s say we have an Adaboost classifier built from 4 weak classifiers with relatively large weights. Then we add another classifier, but with a small weight. Now, did the complexity increase by 25%? Why? If the 5th weight is small, the 5-classifier should be nearly identical to the 4-classifier, so it should achieve about the same generalization performance.
According to the MDL / approximate-Kolmogorov-complexity notion of complexity, to measure the complexity of a hypothesis, fix a notation and count bits of the description of that hypothesis in that notation.
It’s certainly possible to find a notation where the scenario you describe has the small increment in complexity that you expect. But why pick that notation rather than another? How do you justify your priors?
Because you need the right notion of complexity in order to prevent overfitting. To prevent overfitting, you need to penalize highly expressive model classes. The MDL complexity measure does not perfectly capture the notion of expressivity, especially if you use a naive encoding method. An ensemble with many subclassifiers may require many bits to specify, but still not be very expressive, if the weights are small.
I think we’re not trying to prevent overfitting. We’re talking about the problem of induction.
How do we know that a method that prevented overfitting in the past will continue to prevent overfitting in the future? Appealing to “induction has always worked before” is circular.
I think it was Hume who asked: How do we know that bread will continue to nourish humans? It always has before, but in what sense are past observations logical grounds for anything? We would need some magical law of nature that says “Nature definitely won’t just change all the rules.” But of course, there is no such law, and Nature might change all the rules at any time.
Suppose there are two kinds of worlds, one where bread always nourishes, and one where bread sometimes nourishes. The first variety is strictly simpler, because the second can camouflage itself as the first, matching the givens for an arbitrary amount of time, but the converse is not true. The Occam strategy of believing the simplest hypothesis consistent with the givens has this pleasant property: It is the best strategy as measured in worst-case mind changes against an adversarial Nature. Therefore, believe that bread will continue to nourish.
I worry that I’m just muddying the water here.
I think Occam’s razor is fairly easy to justify in general.
I would say that the problem is resolved if we assume that our reference class is “every conceivable, formally describable universe”—and the description in “formally describable” doesn’t just describe the universe’s state at an instant: It describes the entire universe, and its history, as an object. We should assume that one of those objects corresponds to our world.
Once we have that, we have some data from experiments, and we generate a partial model, which is an algorithm that accepts past data and predicts future data. The model needs to work for past data, and we want one that is as likely as possible to predict future data correctly. We are hoping that our partial model correctly describes the behavior of our universe—which is one of the ones in the reference class. The greater the information content in the partial model, the more specific it is being, and the smaller the proportion of possible universes in the reference class that will agree with it. The smaller the information content in the partial model, the more general it is being, and the greater the proportion of possible universes in the reference class that will agree it and, therefore, the greater will be the chance that one of these universes happens to be the real one (or the one in which we are living if you subscribe to some form of modal realism—not that it matters whether you do or not).
This should easily deal with the issue of why, when you see ordered behavior in reality, you should expect to see continued ordered behavior. It doesn’t resolve these issues:
Why do we see any ordered behavior in the first place? None of this imposes any ordered behavior on reality. It simply says that if you see some you should expect more to follow. Any simplicity that you observe does not imply that reality is simple: it simply means that your partial model is relying on a simple feature of reality that happened to be there—a very different thing. It does nothing to stop reality being a chaotic mess. However, an anthropic argument might be used here.
It doesn’t resolve the issue of the coding system used for the descriptions—although I think that issue can be resolved without too much trouble—though I am sure others would disagree.
I think the phrase you used “the proportion of possible universes” isn’t, in general, well defined without a measure (effectively a probability distribution) on that space—and there isn’t a unique best probability distribution.
Kolmogorov complexity also depends on the description language used to create strings. If the language has a recursive feature, for example, this would assign lower complexity to members of the Fibonacci sequence.
So, part of the task of Science is to induce what Nature’s description language is.
The whole point of the KC is that it doesn’t depend on the specific programming language (or Turing machine) selected except up to an additive constant. If your language contains a recursive feature, it will assign lower complexity to Fibonacci numbers than my non-recursive language does, but the difference is at most the codelength required for me to simulate your machine on my machine.
I wrote this some years back. It is probably not without its flaws (and got some criticism on Less Wrong), but the general idea may be relevant: http://www.paul-almond.com/WhatIsALowLevelLanguage.htm
AFAICT, the ‘proposed solution’ from WiaLLL doesn’t work. The rest of this post will be devoted to proposing a specific attack on it. As such it will be longish—since WiaLLL makes a strong claim about a fundamental aspect of epistemology, it deserves a considered refutation.
My attack works on the assumption that current sets count programs, not languages—i.e. two different programs that implement the same language will result in two entries in the next current set (regardless of whether they are written in the same language). Paul Almond has told me that this is what he intended in WiaLLL.
Consider a family of turing-complete languages, whose members are called Xn (n ∈ N). I will use ‘X-like’ to refer to any language from this family. Let X := X8.
Programs for any language Xn can be divided into three categories:
Programs that begin with n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don’t matter for this attack).
Programs that have a length of 0 (i.e. empty strings). All such programs are defined to implement X.
Programs that don’t fit either 1 or 2. All such programs are defined to implement X(n+1).
It follows that the set of programs of any bounded length (>0) writable in any Xn (n>0) is dominated by interpreters for X-like languages with a greater n than their own. While all such languages are turing-complete, for any given program length the proportion of non-X-like-interpreting programs writable in them is small, and inversely exponential in n.
What happens when we run the proposed solution from WiaLLL on this X? Assuming sufficiently high values of N, on each step but the first the current set will consist of a mix of X-likes and other languages. For purposes of calculating a lower bound on the long-term proportion of X-likes, assume that programs for non-X-likes never implement an X-like. The X-likes will then tend to thin out, since some of their programs fall into category 1 and implement non-X-likes; however this process is bounded to produce a fraction of non-X-likes no larger than \sum_{i=8}^\infty (1/2^i) = 0.0078125.
The X-likes will also thin themselves out through programs in category 2, since X itself will produce a fixed fraction of non-X-likes. How much they do so depends on the exact growth of N and I in the limit. Given a function I(N) which maps values for program length to the number of iterations to run with this length, one can calculate a rough upper bound on the proportion of non-X-likes produced in this manner as 1-\prod_{N=N0}^\infty (1-I(N)/2^N).
The detailed results of this process therefore depend on the choice of I(N), and of N0. The general relationship is that unless I(N) grows quite quickly in N, the X-likes will only be notably thinned out for small N, and therefore will not be thinned out significantly given a sufficiently big N0. For example, for I(N)=16 and N0=16 (16 iterations per maximum program length, minimum maximum program length of 16 bits) the result is about 0.00049.
Therefore, with free choice of the starting language and some reasonable assumptions about the limiting process used for I and N, we can ensure that the language mix in the final current set will be dominated by X-likes.
Why does all of this matter? Because any X-like especially likes to implement X (since category 2 programs have the minimum length of 0 bits), and for sufficiently large n really doesn’t like to implement any non-X-like, and as such any current set containing a sufficient proportion of X-likes will rate X as lower-level than any other language. This argument might in theory fail because of the possibility that other languages would on average penalize X-likes much more than X-likes penalize them, but if so this can be worked around by setting X to a higher Xn than 8 - the family can be redefined to set the initial n to any integer we need it to be to work around such finite factors.
Therefore, unless I’m missing something critical, the output of the procedure proposed in WiaLLL would be highly sensitive to the starting language as well as I(N) and N0. It therefore fails in its attempt to define a simplicity measure on languages in the absence of additional assumptions not stated in the article.
(Part 2) But hang on—I can see an easy way to improve on Paul’s definition:
He looks at all implementations of languages, of length less than or equal to N, takes a simple arithmetic mean and then lets N go to infinity. But how about we take a weighted arithmetic mean where the weight of a given implemention is 1/2^itslength (and then we normalise if necessary so that the weights sum to 1). (This does away with the need for N and taking a limit as N goes to infinity.) Then in your example above, the fact that there is always an implementation of X whose length is 0 means that a fixed proportion of the weight previously assigned to X-like languages ‘bleeds away’ on every iteration.
I think we can use a “pathological” language to make X[0] particularly hard to implement (on average) rather than particularly easy, so it would still be false that “low-level index for Y” measured from X must be the same as measured from Y.
But perhaps we can define the “low-level index” of a language Y as the infimum of all low-level indices as measured from languages X?
(Needless to say, the trick of weighting the average by 1/2^length is very standard.)
Let’s see if I understand this correctly. You’re redefining current sets to a mapping from the set of all computable and turing-complete languages to their weights in [0,1]. Except for the first step of the process, you will actually have non-zero weights on each of those languages, so we’re working with true countably infinite sets here. This might require a stronger hypercomputer to implement than WiaLLL, but I suppose that’s fine for the purely theoretical purposes of these constructions.
Weighting each program by 1/2^i probably doesn’t work, since that still results in a total measure of 1 for each set of all programs with a given length n ∈ N (since there are 2^n programs in each of those sets), but that’s a minor detail—weighting by e.g. 1/4^i should get you a workable result.
I think this is still attackable in so far as that it’s possible to produce a setup that will make X win over any non-X-like. It would lose to certain other X-likes in each round, but this does not hold in the (I → ∞) limit.
It’s probably not necessary for me to write up the modified attack in full detail; the basic idea is to redefine the categories of Xn programs as follows:
Programs that begin with 2n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don’t matter for this attack).
Programs that begin with j ∈ [n,2n) 0 bits. All such programs are defined to implement X.
Programs that don’t fit either 1 or 2. All such programs are defined to implement X(n+1).
So the support of Xn for X approaches 0 in the (n → ∞) limit, which allows the X-like family to keep a large weight through arbitrarily many iterations, and even in the limit of (I → ∞). However, it approaches 0 much slower than their support for any non-X-like, which means that X still wins over every non-X-like at each step of the simplicity computation.
Also note that the choice of a prefix length linear in the X-like level is fairly arbitrary. I think this is sufficient to get the desired effect, but if this turns out to be incorrect we can of course use any computable function here, meaning we can adjust this attack to make the fraction of non-X-likes produced fall much faster.
I don’t understand this suggestion. If I understood your previous modifications correctly, all but the first current sets of your process assign non-zero measure to every computable and turing-complete language, so taking minimums about languages from this set wouldn’t get you any useful information. As for X-likes, if you are going to treat those seperately from other languages in the simplicity determination process, you’d need to specify some precise way of how to recognize them. The specific X families I’ve talked about are of course only examples; there’s countably infinitely more such families that exhibit similar properties. Please elaborate on what you meant?
So essentially, the problem with Paul Almond’s suggestion is that one can find languages X that are ‘X[0]-pathological’ in the sense that almost all of their children are both (i) good at implementing some fixed language X[0] and (ii) even more X[0]-pathological than they are (this being a recursive definition of ‘pathological’).
Such languages destroy any hope that Paul’s “low-level index for Y” won’t depend on which language X we start from.
Kolomogorov complexity isn’t an option, and approximations of Kolmogorov complexity don’t have that uniqueness property.
Even if we did have that uniqueness property, it’s still not all that helpful if you’re trying to decide among a finite number of finite-complexity hypotheses. Choice of description language is still sufficient to completely determine your choice of which hypothesis is simplest.
As far as I can tell, a MDL-ish complexity measurement that is “unique up to a constant” acts like a prior—in the limit of infinite evidence, the initial bias will wash out. Kelly is fond of analogizing this “justification” of Occam to a flat tire—it helps you get home because you can eventually fix it.
Right, but we can find a series of increasingly tight upper bounds to the KC, and the bounds are language-independent in the same sense that the KC itself is.
Then for any sufficiently large dataset, we can say that the probability of the data is given by the model that produces the tightest upper bound for the KC. This is nicely rigorous—if you think my model is unjustified and doesn’t correspond to the “real” probabilities, all you have to do is produce a new model that achieves a tighter KC-bound.
Can you help me with the question: “How do you decide which of a finite number of hypotheses, each of which is of finite complexity, is simplest, using approximations of Kolmogorov complexity?”
Also, my understanding is that people normally approximate Kolmogorov complexity by using non-Turing-complete models of computation, such as the finite-state automata that Hutter and his co-authors recently used to play Pac-Man: http://il.youtube.com/watch?v=RhQTWidQQ8U
If the model of computation is not Turing complete, then complexity measurements derived from it do not have the “uniqueness” property that we’re talking about.
Thanks for that link, I wasn’t aware of Monte Carlo AIXI. It’s amusing how little it took to make Nesov and Roko frightened.
A interposed question: Can you or someone point me to an introduction or starting point regarding the overall topic related to Kolomogorov complexity (so I am able to follow, not contribute)? Or is an explanation already contained in the sequences? This idea seems to be discussed quite often on LW. Thank you.
I don’t think it’s been thoroughly explained in the sequences, but why not start with the Wikipedia page? Or this post.
On second thought, this is actually a very simple idea that can be explained in a comment. Let me try:
The Kolmogorov complexity of a finite string of bytes is the size (in bytes) of the shortest Python program that outputs that string and then halts. Or you could use any other language in place of Python and get a different value of complexity—but it cannot differ by more than a constant as the string grows, because you could always implement a constant-size Python interpreter in the other language to “convert” one description into another. So complexity is not unique, but all definitions of complexity sorta “agree” as we go to infinity.
For example, if a string consists of “a” repeated a million times, its Kolmogorov complexity is quite small because you can write a very short program that outputs a million a’s and then halts. But this is not the case for all possible strings. Here’s a simple way to see why: there can’t be more Python programs of length N than there are different byte strings of length N, because any program is itself a byte string. Therefore, you cannot encode all strings of length N using programs of length N/2 or something, because there’s not enough of them. Actually this shows that most strings of length N must have complexity close to N. (Obviously it can’t be much larger than N, because any string of length N can be output by a program of length N+c that “hardcodes” that specific string.)
Another important observation. Imagine you have a huge, random-looking string of bytes. How can you calculate its Kolmogorov complexity? Turns out that you can’t, because there’s no easier way to do that than, essentially, simulating all programs up to length N and seeing what they output. But this is impossible because any given program can (say) hang idly for a million years and only then start outputting symbols, and you have no way of learning that reliably from the program text (this is known as the “halting problem”). So K-complexity is largely a theoretical notion, I have no idea why people love it so.
Imagine some Oracle told you that obtaining the ultimate theory of physics—the one that unifies relativity and quantum gravity, and all that—was impossible. Then following your logic you might conclude that it was pointless to study physics, since one can never obtain the perfect theory. But that would be wrong; it would still be highly worthwhile to attempt to obtain better and better physical theories.
Thanks, great! I think this might be another introductory explanation.
Here is more: Information vs. Meaning
Which notion do you prefer?
Haven’t thought too hard about this question. In my mind complexity is filed away as a “mystery”, on the same shelf as frequentism vs Bayesianism, decision theory and other things. I know the state of the art and am convinced that it’s unsatisfactory, but how to fix it is unclear. You could carve out a nice research area for yourself if you took any of these questions and ran with it :-)
I have a vague idea that one shouldn’t look for a single correct notion of complexity, but instead try to find some reasonable properties that any measure should have, and study them all at once. For instance, if I propose a measure that turns out to be equivalent to square-root of K-complexity, who’s to say it’s better or worse? More seriously, one could imagine complexity measures like “the time it would take to explain in English”, “the time it would take to explain in Japanese”, “the time it would take a smart person to explain”, “the time it would take a stupid person to explain”...
But when I try to think about “properties a measure should have” all I can come up with is a kind of monotonicity: a complexity measure is a real-valued function on strings whose value on a given string is larger than the value on an (initial?) substring. That is not even true of K-complexity. (E.g. if N is an integer with high complexity, but less than 10 to the one-hundred, then a string of N repeated zeroes will have higher complexity than a string of 10 to the one-hundred zeroes.)
So many topics, so little time! :)
It’s amazing how much a person can do if some topic manages to interest them more than the Internet, even for a little while.
A timely reminder. I’d better go back to obsessing about nootropics and see if I cannot amaze myself somewhat.
Damn Lesswrong and its “Recent Comments:” and its orange envelope icon.
When you reach a result, be sure to post it and make the Internet a little more alluring for all of us :-)
This suggests a new focus—drugs to treat internet addiction and promote procrastination resistance. The experimental procedure seems obvious. The double-blind, placebo controlled T4ET (Tv Tropes—Time To Exit Test).
This concept sounds like one that could be a useful addition to your ‘could’, ‘probability’ mini-series. Putting things into ‘python’ language, as you have done, makes the concepts much simpler to apply.
That mini-series is about stuff that I make up myself. K-complexity is a well known notion, maybe you could ask JoshuaZ—he was planning to write some introductory math posts.
Ok. I actually considered the ‘could’ post a well known notion just expressed in an intuitive format.
Not a bad idea. Failing that I could always write one myself. But for now I’ll just hope your comment above comes up in the custom search when I’m looking for a link target. You touched on most of the important details (albeit briefly) so that was about 2⁄3 post ready as is.
Sorry, I thought it was old news in LW circles. Here’s the actual paper:
http://jveness.info/publications/veness_rl_via_aixi_approx.pdf
It’s not the specific result that’s frightening, but the overall research direction, where each result contributes exclusively to destroying the world. We’ve got a small tiny step closer to disaster, job well done.
I see. Thanks.
Question: How many bits does adding a recursive feature take up?
I’m pretty sure Nature doesn’t have one.
Obligatory XKCD
In the game described by the OP, Nature must invent a law, which must be expressed in a language. So, in the game, there must be a description language.
As to whether (real life) Nature can be said to have a description language: I assess the validity of this concept in terms of its success or failure in explaining phenomena. I wouldn’t be ‘pretty sure’ of any hypothesis unless it were tested.
So, in a true language the results of saying something could be equal to actions?
I’m not sure how to phrase this...
I think say and do: Heat + Air + Fuel and I get Fire?
Hmm...
The game of Science vs. Nature is more complicated than that, and it’s the interesting structure that allows scientists to make predictions that are better than “what we’ve seen so far is everything there is.” In particular, the interesting things in both Chemistry and particle Physics is that scientists were able to find regularities in the data (the Periodic Table is one example) that led them to predict missing particles. Once they knew what properties to look for, they usually were able to find them. When a theory predicts particles that aren’t revealed when the right apparatus is constructed, we drop the theory.
But in the meantime, you’d have a more interesting game (and closer to Zendo) if Nature gave you a way of classifying objects. In Zendo, there is only one dimension. Something is a match (I forget Zendo’s terminology) or it isn’t. [In the real world, one of the possible moves is inventing a new test. “If I hold the object up to the light, what do I see?”] Some new tests turn out not to reveal anything useful, while others give you a whole new way of categorizing all the things you thought you knew about before.
In this context, Occam’s razor is a rule about inventing rules to explain complex behavior, not rules about how many things there are. Your objective is to explain a pile of evidence, and you get to make up whatever story you like, but in the end, your story has to guide you in predicting something that hasn’t happened yet. If you can make better predictions than other scientists, you can have as complicated a rule as you like. If you and they make the same predictions, then the observers get to break the tie by deciding which rule is simpler.
What I love about this post is that, at heart, it challenges the very ability to be rational, it hints at the possibility that what is most effective (and how we ultimately function) is as statistical learning machines whose understanding and predictive ability lie in structures very different from the ones we use to communicate. Our language and formal argument structures define a space of understanding that is more a reflection of our communication technology than the true nature of things. In this way, Occams razor and Kolmogorov complexity reflect a motivation to communicate efficiently (both with ourself and with others). It can have practical utility in the sense that a simpler conscious model is easier to analyse and in that sense we have more chance of finding its flaws (predictive failures). In this sense a simpler theory (in the sense of being more comprehensible) is more likely to be correct if it explains the same evidence, as it literally has more analysis performed on it, because it is easier to think about. Or in an AI sense, has a larger (mentally synthesised) evaluation set.
Successful analysis of uncertainty without using a prior implies some kind of hidden Bayes-structure.
Consider hypothesis 1 and hypothesis 2. Hypothesis 2 can pretend to be hypothesis 1 but not vice versa. What does this mean?
p(D1|H2)>0
p(D2|H1)=0
Observations about D2 are clearly irrelevant to our present inference problem, instead, what matters is this:
p(D1|H1)=1
p(D1|H2)<1
p(D1|H1)>p(D1|H2)
that is, D1 causes a Bayesian agent to update in favor of hypothesis 1, because hypothesis 1 fits the evidence better.
(this is only exactly true if D1 and D2 are the only possible data, but the principle should still work in more complicated cases)
So Kelly avoids using a prior by copying the rest of Bayes’ theorem. The hypotheses he is advocating are not “simple” hypotheses, they are ones that fit the data. Occam’s Razor is still missing.
This is extremely analogous to frequentist statistics, which considers worst-case scenarios rather than using a prior, and therefore achieves some of what you can do with Bayes.
If I understand you correctly, you’re saying that reasoning “non-probabilistically” about possibilities and impossibilities can be viewed as a subset of Bayesian reasoning, where instead of numerical probabilities, one uses qualitative probabilities—zero corresponds to impossible, nonzero corresponds to possible and so on.
I agree with you, that you can view it that way; but probably there are ways to view any sort of reasoning as a “subset” of some other domain of reasoning.
The best analogy I have is with classical and intuitionistic logic. The implication/negation fragment of intuitionistic propositional logic can be viewed as a subset of classical logic, in that the theorems provable in intuitionistic logic are a strict subset of classical logic theorems (if you map classical implication to intuitionistic implication, and classical negation to intuitionistic negation).
However, there’s a slightly trickier mapping (http://en.wikipedia.org/wiki/G%C3%B6del%E2%80%93Gentzen_negative_translation) which introduces a lot of double negations, and if you use this mapping, classical propositional logic fits entirely inside of intuitionistic logic, and intuitionistic logic can be viewed as introducing a new “constructive” implication while keeping all of the old truths.
What I’m saying is that effective techniques for reasoning non-probablistically derive their effectiveness from their similarity to Bayesian reasoning. In some special cases, they are identical to Bayesian reasoning, but in other cases, Bayes is superior. This is because Bayes is proved, by theorem, to be Right. Not Unbiased or Robust, but Right.
http://lesswrong.com/lw/mt/beautiful_probability/ seems like the appropriate lesswrong post.
So I’m not saying it’s a subset, I’m saying its similar to. One way to be similar to something is to copy some but not all of it, which is taking a subset plus adding other stuff. This is almost the same concept, obviously. ` Using the two logics I know best: Classical propositional logic is Bayesian logic for probabilities of 1 and 0. All results of Bayesian logic can be derived by propositional logic and certain additional assumptions.
But these are qualitatively different. I don’t know how to describe this difference—which one is fundamental? I think both are, in different ways.
What’s missing is how you use it in real-world problems, I think. Because it is often the case that one formal system is contained in another, and the other is contained in the one. But formal systems are only useful if they model reality, or serve some other function. Classical logic is used more than intuitionistic logic—is this because it’s more useful? (Classical logic is more useful than Bayesian in the world of mathematics, but less in the real world, I think.)
I need to think more about this...
Is your additive example supposed to be the minimum number of 1s needed to represent n as a product or sum of 1s? It looks like that but if so, 6 should be ((1+1)(1+1+1)) and 11 should be (1+(1+1)(1+1+1+1+1)). 12 is still an “inversion.”
ETA A few other numbers seem to be off as a result of the mistake with 6. 30 and 42 also have more compact representations.
No, it’s one of the expressions with the minimum number of characters needed to represent the number, using this grammar:
EXP ::= ( EXP + EXP ) | ( EXP * EXP ) | ( NUM )
NUM ::= 1 | NUM + 1
As you can see, reasonable people can reasonably disagree on what is an appropriate grammar to use for minimum description length, even on arbitrary examples like this one.
An observation about the number sequence:
...
3 == (1+1+1)
4 == (1+1+1+1)
5 == (1+1+1+1+1)
6 == (1+1+1+1+1+1)
7 == (1+1+1+1+1+1+1)
...
I would rewrite this as:
...
3 == (1+1+1)
4 == ((1+1)*(1+1))
5 == (1+1+1+1+1)
6 == ((1+1)*(1+1+1))
7 == (1+1+1+1+1+1+1)
...
Where the first inversion is at 4-5. The reason I do this is because of the relation to primes and to make the pattern seem more consistent. Then of course the next step would be exponent notation:
…
8 == ((1+1)^(1+1+1)
…
So the idea of choosing the correct method of patterning can change dramatically with small changes in rules, such that we can have:
8 == (1+1+1+1+1+1+1+1)
8 == ((1+1)*(1+1+1+1))
8 == ((1+1)^(1+1+1)
And the places of inversion will change according to method.
A question in an evolutionary context: Say the operators are symbols of evolutionary complexity, increasing from addition to multiplication to exponent, wouldn’t use be situational? Given specific circumstances could one “8” be more valid than another “8″, even though the understanding is that they amount to the same thing? Is the pattern one method will make more valid than the pattern of another?
I can’t help but think that the method is dependent on the amount of space in which we have to create the pattern, that the addition method is ok for a large posting width, and small numbers, but as the numbers increase and/or the posting width decreases, method becomes more important, maybe even requiring the creation (evolution) of more operators for more methods …
A little off-topic, but you can greatly shorten your descriptions above. In the case where you’re just adding 1′s, you could just as well leave off the pluses or the 1′s. So 8 == 11111111 And you don’t need the parens since the operations you’re actually performing always give higher precedence to addition than to multiplication. So then 8 == 11*11*11 == 11^111 And since you’re never multiplying by 1 or mutiplying/adding zero, you can use pluses instead of 1′s. So:
And so on
I think removing the parentheses can affect the overall pattern formation.
12 == (((1+1)^(1+1))*(1+1+1))
12 != +^+*++
Because we are changing the order of operations:
1+1^1+1*1+1+1
2^2*3
2^6
64
Also, I don’t think there is a way to use exponents and make 12 without using parentheses.
So if we simplify the method of notation, then some operations cannot be used for certain numbers. and this changes the pattern, and/or the availability of possible patterns.
This topic is discussed in D. Hofstadter’s book Godel, Escher, Bach: an Eternal Golden Braid.
Indeed, I hadn’t thought far enough ahead to worry about cases where order of operations would matter between exponentiation and multiplication. So parens would have to be acceptable where it’s ambiguous. What the default order of operations should be, can be left as an exercise for the reader.
as for your specific concern, if we left the order of operations as “+ then ^ then *”, +^+*++ would work fine.
We could keep the abbreviated syntax but avoid the necessity of parentheses by using post-op notation, couldn’t we?
For that, I think we’d need a stack with a line-termination character. In effect, we’d be removing two characters for one, which I suppose would be an improvement.
So:
The order of operations can be variable dependent on the number
The notation method effects pattern
The pattern can change if primes are always used
Etc.
How many ways can we change the rules? How many rules are there?
It is amazing just how much variation in pattern can be achieved just by changing and/or adding/subtracting rules.
There is the general rule of rules:
The fewer the rules, the less variable the pattern, and the inverse, the more rules the more variable the pattern.
Sorry for the late edit, but, I would also add that the fewer the rules the less adaptable as well...
If there are a very large number of fundamental particles, it is likely that there is some reason for the number, rather than being a merely random number, and if so, then you would likely reach the truth quicker by using some version of Kolmogorov complexity, rather than counting the particles one by one as you discover them.
This is circular. You assume that the universe is likely to have low K-complexity and conclude that a K-complexity razor works well. Kelly’s work requires no such assumption, this is why I think it’s valuable.
Yes, “It is likely that there is some reason for the number” implies a low Kolmogorov complexity. But it seems to me that if you look at past cases where we have already discovered the truth, you may find that there were cases where there were indeed reasons, and therefore low K-complexity. If so that would give you an inductive reason to suspect that this will continue to hold; this argument would not be circular. In other words, as I’ve said previously, in part we learn by induction which type of razor is suitable.
But then you’ve got to justify induction, which is as hard as justifying Occam.
I have a justification for induction too. I may post it at some point.
Is this a “margin is too small to contain” type of thing? Because I would be very surprised if there were a philosophically airtight location where recursive justification hits bottom.
Cool.
Could you expand on this?
The only reason I can think of is that the particles have various qualities, and we’ve got all the possible combinations.
I assume that there some range of numbers which suggest an underlying pattern—it’s vanishingly unlikely that there’s a significance to the number of stars in the galaxy.
I think there was something in Gregory Bateson about this—that there’s a difference between a number that’s part of a system (he was talking about biology, not physics) as distinct from “many”.
So you claim that a K-complexity prior will work better because our universe is likely to have low K-complexity. This is circular—a justification of Occam’s razor that takes for granted your own intuitive concept of Occam’s razor. Kelly’s work makes no such assumption, that’s why it looks valuable to me.