A Proof of Occam’s Razor
Related to: Occam’s Razor
If the Razor is defined as, “On average, a simpler hypothesis should be assigned a higher prior probability than a more complex hypothesis,” or stated in another way, “As the complexity of your hypotheses goes to infinity, their probability goes to zero,” then it can be proven from a few assumptions.
1) The hypotheses are described by a language that has a finite number of different words, and each hypothesis is expressed by a finite number of these words. That this allows for natural languages such as English, but also for computer programming languages and so on. The proof in this post is valid for all cases.
2) A complexity measure is assigned to hypotheses in such a way that there are or may be some hypotheses which are as simple as possible, and these are assigned the complexity measure of 1, while hypotheses considered to be more complex are assigned higher integer values such as 2, 3, 4, and so on. Note that apart from this, we can define the complexity measure in any way we like, for example as the number of words used by the hypothesis, or in another way, as the shortest program which can output the hypothesis in a given programming language (e.g. the language of the hypotheses might be English but their simplicity measured according to a programming language; Eliezer Yudkowsky follows this way in the linked article.) Many other definitions would be possible. The proof is valid for all definitions that follow the conditions laid out.
3) The complexity measure should also be defined in such a way that there are a finite number of hypotheses given the measure of 1, a finite number given the measure of 2, a finite number given the measure of 3, and so on. Note that this condition is not difficult to satisfy; it would be satisfied by either of the definitions mentioned in condition 2, and in fact by any reasonable definition of simplicity and complexity. The proof would not be valid without this condition precisely because if simplicity were understood in such a way as to allow for an infinite number of hypotheses with minimum simplicity, the Razor would not be valid for that understanding of simplicity.
The Razor follows of necessity from these three conditions. To explain any data, there will be in general infinitely many mutually exclusive hypotheses which could fit the data. Suppose we assign prior probabilities for all of these hypotheses. Given condition 3, it will be possible to find the average probability for hypotheses of complexity 1 (call it x1), the average probability for hypotheses of complexity 2 (call it x2), the average probability for hypotheses of complexity 3 (call it x3), and so on. Now consider the infinite sum “x1 + x2 + x3…” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.
Now, x1 is a finite real number. So in order for this series to converge, there must be only a finite number of later terms in the series equal to or greater than x1. There will therefore be some complexity value, y1, such that all hypotheses with a complexity value greater than y1 have an average probability of less than x1. Likewise for x2: there will be some complexity value y2 such that all hypotheses with a complexity value greater than y2 have an average probability of less than x2. Leaving the derivation for the reader, it would also follow that there is some complexity value z1 such that all hypotheses with a complexity value greater than z1 have a lower probability than any hypothesis with a complexity value of 1, some other complexity value z2 such that all hypotheses with a complexity value greater than z2 have a lower probability than any hypothesis of complexity value 2, and so on.
From this it is clear that on average, or as the complexity tends to infinity, hypotheses with a greater complexity value have a lower prior probability, which was our definition of the Razor.
N.B. I have edited the beginning and end of the post to clarify the meaning of the theorem, according to some of the comments. However, I didn’t remove anything because it would make the comments difficult to understand for later readers.
- What is an “anti-Occamian prior”? by 23 Oct 2023 2:26 UTC; 35 points) (
- 30 Dec 2014 1:23 UTC; 5 points) 's comment on Open thread, Dec. 29, 2014 - Jan 04, 2015 by (
- 11 Jan 2011 5:22 UTC; 4 points) 's comment on Non-trivial probability distributions for priors and Occam’s razor by (
- 26 Aug 2010 14:28 UTC; 3 points) 's comment on The prior of a hypothesis does not depend on its complexity by (
- 11 Jan 2011 14:12 UTC; 2 points) 's comment on Non-trivial probability distributions for priors and Occam’s razor by (
- 23 Aug 2010 6:39 UTC; 0 points) 's comment on The Importance of Self-Doubt by (
- Justifying Induction by 22 Aug 2010 11:13 UTC; 0 points) (
Another way of putting this is that all probability distributions over an infinite enumerated set are biased towards elements that occur earlier, regardless of the principle of enumeration.
Certainly not regardless of the probability distribution though. Do you have a particular probability distribution in mind?
Regardless of the probability distribution.
If one has any assignment of probabilities to an infinite series of mutually exclusive hypotheses H1, H2, …, then for every epsilon > 0 there is an N such that every hypothesis after the Nth has probability less than epsilon. In fact, there is an N such that the sum of the probabilities of all the hypotheses after the Nth is less than epsilon.
But N could be 3^^^3. Which does an injury to the term early in my book. E.g. I could swap the probabilities of p (x3^^^3) and p(x1).
Indeed you could, but that problem is already present in the definition of Kolmogorov complexity. It’s only defined up to an arbitrary additive constant determined by (in one formulation) the choice of a universal Turing machine. The Kolmogorov complexity of a string is the size of the shortest input for that UTM that produces that string as output, but there’s nothing in the definition to prevent the UTM from having any finite set of arbitrary strings on speed-dial.
Kelly deals with this by looking at complexity from other angles. For example, a complex world can give you a long sequence of observations persuading you that it’s a simple world and then suddenly “change its mind”, but a simple world cannot pretend that it’s complex.
Why not? It would look almost exactly like the complex worlds imitating it, wouldn’t it?
Hmm… maybe I was reading your claim as stronger than you intended. I was imagining you were claiming that property would hold for any finite enumerated subset, which clearly isn’t what you meant.
If the sum of every term in a sequence after the Nth one is less than epsilon, then the sum of every term in any subsequence after the Nth one is also less than epsilon.
Right, but that isn’t what I meant—it is not necessarily the case that for every n, every hypothesis after the nth has probability less than that of the the nth hypothesis. Obviously—which is why I should have noticed my confusion and not misread in the first place.
Yes, regardless of the distribution.
And yet another way is this: every computable enumeration of all programs is equivalent to a UTM, and therefore generates the same measure of Kolmogorov complexity, up to the usual additive constant.
Proof: Computability of the enumeration means that there is a program which takes any integer N and returns the Nth in that enumeration. Represent this program as a Turing machine and cascade it with a second Turing machine that executes the program returned by the first TM.
Is this a novel observation? Does it go any distance towards answering Shalizi’s question at the end of this note?
This assumes countable additivity. Otherwise, each basic event could have probability zero, though the whole has probability 1.
I’m inclined to use countable additivity, but I couldn’t give you arguments for it, other than by sketching models that violate it and pointing out that they are weird.
Yes, exactly.
This is the best way of putting it.
I agree with you that this is a good way of putting it (but don’t believe that ‘it’ is correct).
Your proof can’t be right because it doesn’t use the concept of “complexity” in any non-trivial manner. If we replace “complexity” with “flubbity”, so that there’s only a finite number of hypotheses for any given “flubbity”, your proof will still go through.
For some actual work on justifying Occam’s Razor, see Kevin T. Kelly’s Ockham Efficiency results. Kelly unpacks “complexity” in a nontrivial way that isn’t just description length.
ETA: see this earlier discussion about learning and simplicity, it’s very relevant to the topic of your interest.
Consider a creature with a prior of the type described in this post. Then there is some concept of “foobity”, such that hypotheses with higher “foobity” are assigned smaller weights. The creature will find that it follows Occam’s Razor, if it does not have a separate concept of “complexity” such that “complexity” != “foobity”. But why would it, unless there was some reason for that to be the case? “Typically” there would be no reason for the creature to have an explicit concept of complexity that’s different from the concept of complexity implicit in its prior.
Assuming that’s the case, its concept of complexity may still seem very strange and convoluted to us or to some other creature, but to that creature it will appear to be perfectly natural, since there’s nothing else to judge it by.
Could we just be such a creature? Intuitively, the answer seems to be no. Our concept of complexity seems to be natural in some absolute, objective sense, and not just relative to itself. But why is it so hard to pin that down?
Circular/anthropic arguments are seductive, but invariably turn out to be flawed because they predict less order and more narrowly-averted chaotic doom than we actually observe. Compared to our value system, which is genuinely a product of many evolutionary accidents, our concept of complexity is too simple because it can be captured (albeit imperfectly) by Turing machines. In other words, a creature using a randomly-evolved concept of “foobity” wouldn’t be able to approach it with simple math, as we do.
I think i’ts a mistake to reach for the anthropic hammer whenever you see a confusing problem. The most extreme example was when Nesov produced a “proof” that particle physics is a Darwinian construct, after which (thankfully) the absurdity heuristic finally kicked in.
But what do you mean by “simple” math? Simple according to what, if not “foobity”?
ETA: I looked at Nesov’s comment about particle physics, and didn’t understand it. Can you explain?
Re ETA: Nesov said that particle physics are this way because you only care about the worlds where they are this way. Just like your explanation of probabilities. :-)
A correction (though I mixed that up in comments too): what we anticipate is not necessarily linked to what we care about. Particle physics is this way because we anticipate worlds in which it’s this way, but we may well care about other worlds in which it isn’t.
Anticipation is about what we can control (as evolution saw the possibility, based on the past in the same world), not what we want to happen. Since evolution is causal, we don’t anticipate acausal control, but we can care about acausal control.
The useful conclusion seems to be that the concept of anticipation (and hence, reality/particle physics) is not fundamental in the decision theory sense, it’s more like the concept of hunger: something we can feel, can have accurate theories about, but doesn’t answer questions about the nature of goodness.
Don’t know about you, but I anticipate acausal control, to a degree. I have a draft post titled “Taking UDT Seriously” featuring such shining examples as: if a bully attacks you, you should try to do maximum damage while disregarding any harm to yourself, because it’s good for you to be predicted as such a person. UDT is seriously scary when applied to daily life, even without superintelligences.
I don’t think UDT (or a variant of UDT that applies to humans that nobody has really formulated yet, because the original UDT assumed that one has access to one’s source code) implies this, because the difference between P(bully predicts me as causing a lot of damaged | I try to cause maximum damage) and P(bully predicts me as causing a lot of damaged | I don’t try to cause maximum damage) seems quite small (because the bully can’t see or predict my source code and also can’t do a very good job of simulating or predicting my decisions), while the negative consequences of trying to cause maximum damage seems quite high if the bully fails to be preemptively dissuaded (e.g., being arrested or sued or disciplined or retaliated against).
(Not sure if you still endorse this comment, 9 years later, but I sometimes see what I consider to be overly enthusiastic applications of UDT, and as the person most associated with UDT I feel an obligation to push against that.)
Can you post this in the discussion area?
You seem to be mixing up ambient control within a single possible world with assignment of probability measure to the set of possible worlds (which anticipation is all about). You control the bully by being expected (credibly threatening) to retaliate within a single possible world. Acausal control is about controlling one possible world from another, while ambient (logical) control is about deciding the way your possible world will turn out (what you discussed in the recent posts).
More generally, logical control can be used to determine an arbitrary concept, including that of utility of all possible worlds considered together, or of all mathematical structures. Acausal control is just a specific way in which logical control can happen.
Yep. I can’t seem to memorize the correct use of our new terminology (acausal/ambient/logical/etc), so I just use “acausal” as an informal umbrella term for all kinds of winning behavior that don’t seem to be recommended by CDT from the agent’s narrow point of view. Like one-boxing in Newcomb’s Problem, or being ready to fight in order to release yet-undiscovered pheromones or something.
“Correct” is too strong a descriptor, it’s mostly just me pushing standardization of terminology, based on how it seems to have been used in the past.
Sorry, I misparsed your comment and gave a wrong answer, which I then deleted.
Your original comment was trivially correct, and my reply missed the point. We can never justify our concept of complexity by thinking like that—linguistically—because this would be like trying to justify our prior with our prior, “a priori”. If my prior is based on complexity and Bob’s prior is based on foobity (religion, or whatever), we will find each other’s priors weird. So if you ask whether all imaginable creatures have to use our concept of complexity, the easy answer is no. Instead we look at the outside world and note that our brand of razor seems to work. When it doesn’t (religion, or whatever), we update it. Is there any other aspect to your question that I missed?
Let’s call our brand of razor together with the algorithm we use to update it (using what we see from the outside world) our “meta-razor”. Now is this “meta-razor” just a kind of “foobity”, i.e., an arbitrary notion that we just happen to have, or is there something objective about it?
I spent some time thinking about your question and cannot give an answer until I understand better what you mean by objective vs arbitrary.
The concept of complexity looks objective enough in the mathematical sense. Then, if I understand you correctly, you take a step back and say that mathematics itself (including logic, I presume?) is a random concept, so other beings could have wildly different “foomatics” that they find completely clear and intuitive. With the standards thus raised, what kind of argument could ever show you that something is “objective”? This isn’t even the problem of induction, this is… I’m at a loss for words. Why do you even bother with Tegmark’s multiverse then? Why not say instead that “existence” is a random insular human concept, and our crystalloid friends could have a completely different concept of “fooistence”? Where’s the ground floor?
Here’s a question to condense the issue somewhat. What do you think about Bayesian updating? Is it “objective” enough?
Perhaps asking that question wasn’t the best way to make my point. Let me try to be more explicit. Intuitively, “complexity” seems to be an absolute, objective concept. But all of the formalizations we have of it so far contain a relativized core. In Bayesian updating, it’s the prior. In Kolmogorov complexity, it’s the universal Turing machine. If we use “simple math”, it would be the language we use to talk about math.
This failure to pin down an objective notion of complexity causes me to question the intuition that complexity is objective. I’d probably change my mind if someone came up with a “reasonable” formalization that’s not “relative to something.”
Implementable on a machine during my lifetime. That’s got to be an objective property, at least I’m wondering how you will spin it to sound subjective :-)
Edit: whoops, sorry, this is wrong. Don’t bother answering.
Short computer programs, compared to the ones that would encode our concepts of “beauty” or “fairness”, say.
The fact that the proof uses only very weak properties of the notion of “complexity” does not show that the proof is invalid. In fact, it suggests (not with certainty) the opposite: that an even stronger result could be proven.
Thanks and apologies, I was being careless when I said the proof “couldn’t be right”. It may be formally right, but fail to explain why we should use Occam’s Razor instead of the Flubby Razor.
Yes a statement like “on average, X holds” needs to be quantitative in order to be useful in practice. Still, this is a valuable argument, especially as far as it clarifies what needs to be done to reach a quantitative refinement.
The proof does allow for different Razors depending on your specific definition of complexity. It’s true that this makes the statement fairly weak. But it is also in a way a virtue, because it corresponds with our actual use of the Razor. There are in fact different definitions of complexity and we use different ones in different contexts; we learn from experience which contexts require which Razors. For example, if you want to describe the path of a ball moving through the air, you expect to be able to describe it with some fairly simple mathematics, and you think this more likely than complicated mathematical descriptions.
On the other hand, if you see a precise human-shaped footprint in the mud in your yard, saying “a human did it” is very complicated mathematically, since the mathematical description would include the description of a human being. So in this way, “Wind and weather did it” should be simpler, in the mathematical way. Nonetheless, you use a different Razor and say that with this Razor, it is simpler to say that a human being did it.
The last paragraph feels wrong to me. This is the explanation for that feeling that came to mind first, but I’m not positive it’s my true objection:
If your only observation were the footprint, sure—but your observations also include that you live on a planet with 7 billion people who leave footprints like that—so despite the comparitive algorithmic simplicity of the weather model, it doesn’t match the observations as well.
Yes, it is likely that a stronger result is possible. However, it is difficult to make it stronger while satisfying my conditions (true in all possible worlds; true according to every logically consistent assignment of priors).
Interestingly, flubbity will correlate with complexity, regardless of how you define it. This is for pretty much the same reason as the inverse correlation of complexity and probability.
I don’t understand your objection. Yes, it will apply to other attributes as well. I don’t see how that prevents it from applying to complexity.
If the objection is that I didn’t describe complexity in detail, this doesn’t matter, precisely because the proof will still go through regardless of how much or little is added.
It does prevent the proof from applying to complexity. Which hypothesis would you choose: the simpler one, or the less flubby one?
On average you should choose the simpler of two hypotheses compared according to simplicity, and again, on average you should choose the less flubby of two hypotheses compared according to flubbiness.
Or to put this in more real terms: the Razor will be true if it is understood to mean that you should choose the hypothesis that can be described by the shorter program, and it will also be true that you should choose the hypothesis that can be described by a shorter English description.
Yes, these can on particular occasions be opposed to one another, and you would have to ask which rule is better: choose the shorter English description, or choose the shorter program? My proof does not answer this question, but it doesn’t have to, because both rules measure some kind of complexity, and the Razor is true whether it is taken in one way or the other.
Flubbity may not have much to do with complexity. In fact it can be opposed to complexity, except in the limit for extremely complex/flubby hypotheses. For example, you may say that flubbity=1000000-complexity for complexity<1000000, and flubbity=complexity elsewhere. Your proof will go through just fine, but in our world (which probably doesn’t need such huge hypotheses) it will lead to the opposite of Occam’s Razor. You don’t always have the luxury of letting your parameter go to infinity.
By Occam’s razor?
The proof follows on average. Naturally you can construct artificial examples that make fun of it, but there can be no proof of the Razor which is not based on averages, since in fact, it happens on various occasions that the more complex hypothesis is more correct than the simpler hypothesis.
I don’t object to the formal correctness of the proof, but the statement it proves is way too weak. Ideally we’d want something that works for complexity but not flubbity. For any Occamian prior you care to build, I can take the first few hypotheses that comprise 99% of its weight, build a new prior that assigns them a weight of 1e-20 combined, and claim it’s just as good as yours by Occamian lights.
If we removed the words “on average” from the formulation of your theorem, we’d have a stronger and more useful statement. Kelly’s work shows an approach to proving it not just “on average”, but for all possible hypothesis lengths.
ETA: I apologize for not objecting to the formal side of things. I just read the proof once again and failed to understand what it even means by “on average”.
I started reading some of Kelly’s work, and it isn’t trying to prove that the less complex hypothesis is more likely to be true, but that by starting from it you converge on the truth more quickly. I’m sure this is right but it isn’t what I was looking for.
Yes, the statement is weak. But this is partly because I wanted a proof which would be 1) valid in all possible worlds ; 2) valid according to every logically consistent assignment of priors. It may be that even with these conditions, a stronger proof is possible. But I’m skeptical that a much stronger proof is possible, because it seems to be logically consistent for someone to say that he assigns a probability of 99% to a hypothesis that has a complexity of 1,000,000, and distributes the remaining 1% among the remaining hypotheses.
This is also why I said “on average.” I couldn’t remove the words “on average” and assert that a more complex statement is always less probable without imposing a condition on the choice of prior which does not seem to be logically necessary. The meaning of “on average” in the statement of the Razor is that in the limit, as the complexity tends to infinity, the probability necessarily tends to zero; given any probability x, say 0.000001 or whatever , there will be some complexity value z such that all statements equal or greater than that complexity value z have a probability less than x.
I will read the article you linked to.
Why do you want the theorem to hold for every logically consistent prior? This looks backwards. Occamian reasoning should show why some prior distributions work better than others, not say they’re all equally good. For example, the Solomonoff prior is one possible formalization of Occam’s Razor.
Because for every logically consistent prior, there should be a logically possible world where that prior works well. If there isn’t, and you can prove this to me, then I would exclude priors that don’t work well in any possible world.
I want it to apply to every possible world because if we understand the Razor in such a way that it doesn’t apply in every possible world, then the fact that Razor works well is a contingent fact. If this is the case there can’t be any conclusive proof of it, nor does it seem that there can be any ultimate reason why the Razor works well except “we happen to be in one of the possible worlds where it works well.” Yes, there could be many interpretations which are more practical in our actual world, but I was more interested in an interpretation which is necessary in principle.
This is even more backwards. There are logically possible worlds where an overseer god punishes everyone who uses Bayesian updating. Does this mean we should stop doing science? Looking for “non-contingent” facts and “ultimate” reasons strikes me as a very unfruitful area of research.
Different people have different interests.
How do you know when that happens?
My point is that if someone has a higher greater for the more complex hypothesis which turns out to be correct, you cannot object to his prior, saying “How did you know that you should use a higher prior,” since people do not justify their priors. Otherwise they wouldn’t be priors.
A major use (if not the whole point,) of occam’s razor is to have a rational basis for priors.
If people don’t have to justify their priors, then why have a process for generating them at all?
If I create an encoding with ‘God’ as a low complexity explanation, would you say I am being rational?
But the point of my question above was that you find out that the more complex hypothesis is correct when you get evidence for it. Juggling your priors is not the way to do it. (in fact it probably invites accidentally counting evidence twice.
This is spot-on. Furthermore, such a lax definition indicates that certain hypotheses will have “probability” zero. If our language is English, the explanation, “PUJF;FDAS!;FDS?” could be assigned probability zero. While this does not guarantee that the set of possible explanations is finite, neither does the author prove that the set of nonzero possibilities is infinite, and if it is not this proof is largely useless.
Also, interestingly, the rules do not address the, “A wizard did it!” problem with complexity, though that is likely far beyond the attempted scope.
This really doesn’t seem to have formally proven anything, and so at least the title should not be “a proof of Occam’s razor”. And it doesn’t seem like you’ve done anything more interesting or useful here than, say, Solomonoff induction, which can already be seen as a formalization of Occam’s razor.
Maybe if you made your proof more rigorous and explained why you think the statement you’re trying to prove is relevant to anyone (or should properly be called Occam’s razor) then that would help.
To expand on cousin_it’s point, this is way too weak because it allows for arbitrary encodings, like a complexity value of 1 for “The lady down the street is a witch; she did it.”
And if you start with that complexity prior, it’s difficult to see how evidence would rescue you from it.
I don’t see how you can exclude that encoding without additional stipulations that seem arbitrary to me.
The fact that you assign something a complexity value of 1 doesn’t mean you assign it a high probability. I allow for the possibility that some more complex hypotheses will be more probable than some less complex hypotheses. And even if the hypothesis of complexity 1 is the most probable, it may only have a probability of 75% or something. You could easily be rescued from this by ordinary Bayesian evidence. In fact this will still be true even if you assign it a probability of 99.99%. In order to put yourself beyond the possibility of being rescued by evidence, you would have to try very hard.
It’s true that other hypotheses better predicting the data would still overcome it in time. But a rigged encoding would still allow for theories which add an epiphenomenal entity to count as less complex; and we definitely don’t want Occam’s Razor to do that. I know there’s no way to guarantee that, say, the universal Turing machine complexity is “fair”, but you can’t dodge the necessity of putting some restrictions on the encoding.
As I said to cousin it, the proof allows for different Razors depending on the specific definition of complexity. This is actually a good thing because we can learn from experience which definition to use in various contexts.
Of course experience doesn’t directly say whether or not epiphenomenal entities exist, but we do discover by experience that the “number of entities” understanding of the Razor often works well, that is, that hypotheses with less entities posited are more probable than ones with more. This would lead us to adopt a Razor that does not count the addition of epiphenomena as less complex.
Would you agree that even if such restrictions are desirable in practice, there is no way to add such restrictions without adding something which is not logically necessary? The original proof is intended to be logically necessary.
At which point, all the difficulty in deciding between theories is transferred to deciding between encodings. Yes we can learn over time which encodings work well in different domains, but at that point, you’re really just putting chunks of theory into your encoding choice.
There is a place for this kind of shorthand: I shouldn’t have to model people in detail to form a hypothesis explaining the presence of a building in a feild. But this is because I know people exist from vast amounts of other data, so I have a justification for the shorthand of ‘a human did it’ in my encoding. It should have a high complexity cost, but it comes packaged with tons of evidence so in most cases it’s practically free.
I think the treatment of complexity is OK in this argument. However, I don’t think it proves Occam’s Razor. Other than possible exception within set theory, the complexity of an explanation of any phenomenon is always finite. (Counterexample if you disagree?) There is no finite integer which is ‘almost’ infinite.
So, given xn—the average probability among explanations of complexity n—it’s quite possible for a (finite) positive number of integers m > n for which xm > xn.
Yes, I agree with the possibility you mention. I don’t think that means it doesn’t prove Occam’s Razor because I don’t think the Razor means that every simpler hypothesis is more probable than every more complex hypothesis. Other things might affect your priors besides how complex the hypothesis is.
I don’t think you’ve stated the theorem you are trying to prove precisely. I think that would help.
As the complexity of your hypotheses tends to infinity, their probability tends to zero. Still, that doesn’t mean that each and every increase of complexity decreases the probability.
I upvoted the article, because I like this definition (i.e., it corresponds well to how I think of Occam’s Razor), and I think you’ve proven it. However, I agree with Oscar Cunningham that the statement at the top of your article is different, and you didn’t prove it. In particular, I don’t like the phrase ‘on average’.
I edited the article.
State that this is what you’re trying to prove at the top of your main post. People are downvoting you because you haven’t proved what they see as Occam’s razor.
I edited the article.
It seems like what you’re saying is: “If we have an infinite number of hypotheses, some of them must have infinetisimally small prior probability”. That doesn’t require a proof.
(I upvoted this article in spite of not learning much from it, since I love this kind of discussion. Why not read some papers about PAC learning, MDL, Algorithmic Info. Theory, or VC theory and compare/contrast for us the view of Occam implicit in each development?)
Clearly as complexity goes to infinity the probability goes to 0. (i.e. For any epsilon>0 there exists n such that all hypotheses with complexity>n have probability less than epsilon. (since otherwise there would be infinitely many hypotheses with probability >epsilon, taking 1/epsilon of them would produce a collection of hypotheses with probability>1))
So as the post says, “on average” Occam’s razor must be true. (As must any other Razor using a different measure of complexity, or flubbity, or whatever. However, message length seems to me to be the only “nice” measure to use on hypotheses.) But it doesn’t say that the simplest hypotheses have to be ranked in complexity order. Of course it would be elegant if they were.
Why should we favour the elegant explanation?
Occam’s razor of course!
Well taking this to be true, it seems like we could even construct an ordering where we enumerate algorithms based on something that maps to inverse complexity, so this would prove the opposite of Occam’s razor.
But I can’t think of any such ordering, and I’ve got a proof kicking around in my head that there can be no such ordering, so this hunch is probably wrong.
For any given inverse complexity, infinitely many algorithms have a lower inverse complexity. This seems to break any attempt to enumerate in that order, or assign decreasing nonzero probabilities that sum to 1.
Indeed there is no such ordering.
The part about there being sets of infinitely many mutually exclusive hypotheses needs to be marked as an assumption, since I can construct languages where it does not hold. Also, you need to strengthen this assumption by saying that every hypothesis is part of at least one such set.
You also need to somehow get from “the probability of a statement decreases monotonically with its complexity” to “the probability of a statement decreases exponentially with its complexity”.
Maybe this has already been considered but has anyone noted that even without introducing complexity at all, you can at least say that some programs are more likely simply because there is no uniform distribution on a countably infinite set, no matter which probability measure or which way you label the set there must be at most a finite number of programs with the same probability.
Hopefully there’s someone still here after 10 years XD.
Oh look, if we definitely the complexity as “the date when hypothesis was published”, then I can say that the prior probability that our earth stands on top of a whale, on top of a turtle on top of an elephant is the highest, because this hypothesis is the oldest. And the Occam’s razor becomes “don’t propose new hypotheses”. Trinitrotrololol)
I find it funny, that it works even in continuous case: suppose that we have probability density defined in R^n (or any other set). Then whatever bijection F:R <--> R^n we apply, the integral of probability density on that path should converge, therefore p(F(x)) goes to zero faster than 1/x. :)
Also, look: suppose the “real” universe is a random point x from some infinite set X. Let’s say we are considering finite set of hypotheses “H”. Probability that random hypothesis h € H is closest to x is 1/|H|. So the larger H is, the less likely it is that any particular point from it is the best description of our universe! Which gives us Occam’s razor in terms of accuracy, instead of correctness, and works for uncountable sets of universes.
And in this case it is almost surely impossible to describe universe in a finite amount of symbols.
I don’t think this proves anything at a practical level, given that it requires us to deal with hypotheses of arbitrarily high complexity. It breaks down when dealing with finite values.
Suppose there is a finite complexity value W, beyond which no more complex hypotheses can actually be imagined by any being in our universe. Physics implies that there is such a W. Nothing in your argument implies that z1<W, though.
As I said to cousin it, I wanted a proof which would be valid in all possible worlds and valid according to every logically consistent assignment of priors, and for this reason, yes, in some possible worlds things would be as you say.
x1, x2, … is not the set of probabilities of all our mutually exclusive hypotheses. Each xi is the average of an infinite number of hypothesis to explain some particular set of data. Each xi would thus be zero, just as the probability of a real-valued random number between 0 and 1 comes up exactly 0.5 is zero.
This makes the proof fail. It’s interesting, though. But I don’t think you can save it. Basically, you have an infinite set of mutually-exclusive hypotheses, and you want to label them with an infinite series that sums to 1. But you can’t relate an infinite convergent series to the order of your hypotheses, unless the complexity n of the hypothesis is a term in the formula for its probability—and, if it is, your approach will require that this formula decreases with n. So your approach requires presuming what you’re trying to prove.
I’m also bothered by the idea that hypotheses are mutually exclusive. Having an explanation of complexity n does not prevent having an explanation of complexity n+1. If your proof applies only to a set of hypotheses pre-selected for being mutually exclusive, this set could have a very different distribution than the set of hypotheses that a would-be user of Occam’s razor faces in the wild.
There does not appear to be any a priori ordering of probability by complexity value, and unless you assume this the conclusion is improperly drawn. If you assume it, the argument is circular.
If the issue I’m missing is that the conclusion should read:
Then that would explain my confusion. It does make the conclusion pretty weak, as it’s unclear if infinite possibilities even exist. If they don’t, the proof collapses.
I’d written out more reasoning, but I think this works best via counterexample. Consider:
A. .05, .04, . 03, …....… 1, m, n,
B. .03, .04, .05, …....… n, n, n
n>m and for all Xa, a not equal to n or m, Xa<.03.
I may misunderstand the system you’re using, but it looks to me like there is nothing preventing n and m from taking any arbitrary value. In other words, any hypothesis of any arbitrary complexity could conceivably have a higher average probability than X1, principally because our definition of complexity is completely undefined. At some point, hypotheses must start declining in probability, but that does not prevent there from being a spike at complexity level 1,000,000, and it does not prevent complexity level 5 being more likely than complexity level 1.
Assume … is identical in all cases, and each member of … < x1, x2, and x3. If I understand the system correctly, in this case, hypotheses of complexity 2 would have a higher probability than those with complexity 1, which contradicts the conclusion, which is that hypotheses with a greater complexity value have a lower prior probability. Because the definition of complexity is vacuous, this could start at any particular n.
In something more like natural language, simply because x1 is higher than some infinite subset of x’s, that does not mean it is higher than all X’s. Xn could be greater than X1 for any arbitrary value of n. I don’t see anything that says x1 must be the greatest x, and if it is not, the conclusion appears false. There is some level of complexity that is, on average, better than every other, but there is no particular reason it must be the lowest level of complexity. Z1 could be equal to just about any natural number.
I would be totally unsurprised if I missed something here, so please correct me if I have misunderstood the proof.
Hmmm… I’d say that the problem is formulated in a way that allows for some maths, but in the process loses some of its essence.
For example, the ‘proof’ only talks about the length of the hypothesis, because that can be easily quantified. However, that does not rule out very short pseudo-explanations like ‘god did it’ or ‘it’s magic’. To rule out those, in practice one would need to introduce some non-mathematical language about what language is acceptable—but that would make it very hard to reason about it mathematically.
The usual solution to this problem is actually just the opposite: introduce an even more mathematical language, which avoids bringing in pre-cached complex-but-simple-seeming notions like “god” and “magic” so that the message length is more meaningful. Turing machine instructions are a popular choice.
More on this here.
As some commentators didn’t like or couldn’t understand the original phrasing of the theorem, I have added a phrase interpreting this. I didn’t remove the phrase because it would make the previous comments difficult to understand.
Occams’s Razor is based on empirical observations.
You can’t prove it—indeed, you can easily imagine worlds where it is not true.
Really?
Define the “simplicity” of a hypothesis as 1/”complexity”. If you can construct a probability distribution based on the “simplicity” of a hypothesis, then, by the same argument as above, you can prove that you should prefer the hypothesis with less “simplicity”.
There are a finite number of hypotheses less complicated than any given hypothesis, and an infinite number of hypotheses more complicated than any given hypothesis. If there were only a finite number of hypotheses more complicated than any given hypothesis, the theorem would fail.
I cannot help but get the impression that this conclusion relies on doing nasty things with infinity. In particular it seems to take both the following assumptions:
0 can not be a probability.
There can be an infinite number of probabilities in a probability distribution.
All the probabilities in a distribution sum to 1.
… and then play them off against each other. If I was one of those high complexity theorems I’d be feeling fairly violated right about now. In fact I can recall someone using similar reasoning in a discussion with Eliezer (it involved throwing darts at a continuous real valued dart board). It ended with Eliezer questioning “0 can not be a probability” in the case of infinite sets. I don’t recall if he also questioned infinite set atheism.
Those things aren’t opposed to one another. It is perfectly possible for me to choose an integer with such a probability distribution: 1⁄2 for the number 1; 1⁄4 for the number 2; 1⁄8 for the number 3; and so on. The probabilities will sum to 1, there will be an infinite number of probabilities, and none will be zero. It is perfectly reasonable.
Also, it will not fundamentally change the theorem if some of the probabilities are allowed to be zero.
That is a good point. My objection is clearly incompletely expressed. It is my impression that your reasoning relied on forcing the probability distribution to follow tricks like the above in order to rule out “The distribution is just uniform. That’s it. They’re all the freaking same probability.” That is, I expected you forbid a distribution made up of 1/infinity while insisting that the distribution is infinitely large by saying “but zero isn’t a probability”.
Basically, if you say there are infinitely many options then I say each option is allowed to be 1/infinity and that infinity times 1 divided by infinity is 1. Yes, it looks like adding together lots of zeros to get one, but hey, if you mess with infinity then expect that sort of thing. I’m sure there are special mathematical ways to refer to the “1/infinity” kind of zero but I just say that if you can have infinitely many things you can have infinitely many infinitely small identical things.
Really? That is interesting. That’s an even stronger claim than the one you made in the post. You are confident that it holds even when you remove the “since the probabilities can not be zero” part?
You can have a uniform distribution over a continuous space, but not over integers. The reason is that probability measures are supposed to be countably additive, so if each integer has weight 0, they all together must have weight 0. To make a uniform distribution over infinitely many hypotheses, you need them to be an uncountable set (e.g. parameterize them somehow by real numbers between 0 and 1). This is exactly what happens when your hypotheses talk about an unknown real parameter: each individual value has probability 0, but an interval of values has positive probability that you obtain by integrating the prior over that interval. If you can extend this idea to hypotheses indexed by complexity, this will be a Major Breakthrough, but I think it’s very hard.
I think it’s worth adding that while you can have a uniform distribution over some continuous spaces, like the real numbers between 0 and 1, you can’t have a uniform distribution over all real numbers.
How can you do that? I thought 0-1 had the same cardinality as all reals, and that the cardinality of all reals is too high to have a uniform distribution.
I’m completely flummoxed by the level of discussion here in the comments to Unknowns’s post. When I wrote a post on logic and most commenters confused truth and provability, that was understandable because not everyone can be expected to know mathematical logic. But here we see people who don’t know how to sum or reorder infinite series, don’t know what a uniform distribution is, and talk about “the 1/infinity kind of zero”. This is a rude wakeup call. If we want to discuss issues like Bayesianism, quantum mechanics or decision theory, we need to take every chance to fill the gaps in our understanding of math.
To answer your question: [0,1] does have the same cardinality as all reals, so in the set-theoretic sense they’re equivalent. But they are more than just sets: they come equipped with an additional structure, “measure”). A probability distribution can only be defined as uniform with regard to some measure. The canonical measure of the whole [0,1] is 1, so you can set up a uniform distribution that says the probability of each measurable subset is the measure of that subset (alternatively, the integral of the constant function f(x)=1 over that subset). But the canonical measure of the whole real line is infinite, so you cannot build a uniform distribution with respect to that.
After you’re comfortable with the above idea, we can add another wrinkle: even though we cannot set up a uniform distribution over all reals, we can in some situations use a uniform prior over all reals. Such things are called improper priors and rely on the lucky fact that the arithmetic of Bayesian updating doesn’t require the integral of the prior to be 1, so in well-behaved situations even “improper” prior distributions can always give rise to “proper” posterior distributions that integrate to 1.
Yeah, it might help to make a list of “math recommended for reading and posting on Less Wrong” Unfortunately, the entire set is likely large enough such that even many physics majors won’t have all of it (lots of physics people don’t take logic or model theory classes). At this point the list of math topics seems to include Lebesque integration, Godel’s theorems and basic model theory, basics of continuous and discrete probability spaces, and a little bit of theoretical compsci ranging over a lot of topics (both computability theory and complexity theory seem relevant). Some of the QM posts also require a bit of linear algebra to actually grok well but I suspect that anyone who meets most of the rest of the list will have that already. Am I missing any topics?
Not all LW participants need to know advanced math. Human rationality is a broad topic, and someone like Yvain or Alicorn can contribute a lot without engaging the math side of things. What I care about is the signal-to-noise ratio in math discussion threads. In other words, I’m okay with regular ignorance but hate vocal ignorance with a fiery passion.
I propose a simple heuristic: if you see others using unfamiliar math, look it up before commenting. If it relies on other concepts that you don’t understand, look them up too, and so on. Yes, it might take you days of frantic digging. It sometimes takes me days full-time, even though I have a math degree. Time spent learning more math is always better invested than time spent in novice-level discussions.
That seems valid, but part of the trouble also seems to be people thinking they understand math that they don’t after reading popularizations. I’m not sure how to deal with that other than just having those individuals read the actual math.
We need merely a norm to tell people to stop, not a magic way of explaining math faster than possible. Also, more realistic timeline for grokking (as opposed to parsing) deeper concepts is months, not days, and that’s if you are smart enough in the first place.
Agreed. I have no idea either.
Well, if someone posts something wrong, call them out on it.
Empirically that doesn’t seem to help much. See for example the comment thread in cousin_it’s last top-level post.
Cousin_it and JoshuaZ, this sounds as though it could be a good topic (or group of topics) for a top-level post.
Do you mean a set of posts on telling when you don’t know enough or a set of posts on the math people should know?
Either or both would be useful, but I was thinking about the latter.
In getting to some of those things, there are some more basic subjects that would have to be mastered:
Algebra: giving names to unknown numerical quantities and then reasoning about them the way one would with actual numbers.
Calculus: the relationship between a rate of change and a total amount, and the basic differential equations of physics, e.g. Newtonian mechanics, the diffusion equation, etc.
If this all seems like a lot, many people spend until well into their twenties in school. Think where they would get to if all that time had been usefully spent!
Why is all that physics necessary? I’m not seeing it.
Practical examples. Not many people are going to plough through abstract mathematics without them.
Causal networks, axiomatizations of expected utility.
Gosh, now I don’t know whether to feel bad or not for asking that question.
But I guess ‘no, it’s not just cardinality that matters but measure’ is a good answer. Is there any quick easy explanation of measure and its use in probability?
(I have yet to learn anything from Wikipedia on advanced math topics. You don’t hear bad things about Wikipedia’s math topics, but as Bjarne said of C++, complaints mean there are users.)
Don’t feel bad, you’re actually a hero. The four levels of depravity, by cousin_it:
Ask an illiterate question.
Assert an illiterate statement.
Assert and violently defend an illiterate statement.
Assert and violently defend an illiterate statement, becoming offended and switching to ad hominems when you begin to lose.
I’m not sure if other people can successfully learn math topics from Wikipedia because I’m atypical and other-optimizing is hard. Anyway, here’s how you tell whether you actually understand a math topic: you should be able to solve simple problems. For example, someone who can’t find Nash equilibria in simple 2x2 games is unqualified to talk about the Prisoner’s Dilemma, and someone who can’t solve the different Monty Hall variants is unqualified to talk about Bayesianism and priors.
You should post this hierarchy somewhere more permanent. It seems… useful.
Here’s a non-wiki explanation of measure. http://www.ams.org/bookstore/pspdf/stml-48-prev.pdf It’s a generalization of the concept of length.
Complaints mean there are annoyed users, yes. :-)
Also, I second this observation of Wikipedia’s math pages not being a good learning resource. They’re pretty good as reference and for refreshers on stuff you’ve already learned, but it’s no substitute for a decent textbook and/or instructor.
Yes, it’s uncountable, so the probability of any particular real number is zero. I’m not going to go into detail by what I mean by a probability distribution here (though I’ll note that cousin it used the word “integrate”). To uniformly pick a number $x$ from 0 to 1, use an infinite sequence of coin flips to represent it in binary. Infinity is impractical, but for natural questions, like: is $x$ between 1⁄5 and 1/sqrt(2), you will almost surely need only finitely many flips (and your expected number is also finite). And the probability that the answer is yes is 1/sqrt(2)-1/5; that is the sense in which it is uniform.
You can transfer this distribution to the set of all real numbers, eg, by logit or arctan, but it won’t be uniform in the same sense. One can satisfy that uniformity by a measure which not a probability measure.
It won’t change “fundamentally” i.e. in any important way, but it will slightly change the meaning of the probability decreasing on average, because for example, the probability of all hypotheses of complexity 17 might be 0, and if this is the case then there is no greater complexity which will give a lower probability for this particular value. But in any case either the terms will sum to 1 after a finite number of values, in which case the probability will drop permanently to 0 after that, or it will not, in which case the theorem will continue to hold as applied to all the hypotheses that have some positive probability.
Just because there are infinitely many options doesn’t mean that it’s reasonable to assign a probability of 1/infinity. These options are claims about reality that you make with a language; saying that their probability is such a value is much the same as if I said that the probability that you are right in this disagreement is 1/infinity. If someone can say this in the context of my argument, he can say it any context, and this would not be standard probability theory.
Is the claim “you can have probability distributions with infinitely many options and no such distribution can be uniform” a premise or conclusion of your argument?
I don’t agree with the above but even if I did I would assert that it is trivial to reverse the order of ranking such that the arbitrary limitations on the probability distribution lead you to completely the opposite conclusion.
You can’t reverse the order, because there is a least element but no greatest element. So they can’t be swapped round.
The time taken to reverse the ordering of an ordered list depends on the size of the list. Yes, if the size is infinite you cannot reverse it. Or print it out. Or sum it. Or find the median. Infinity messes things up. It certainly doesn’t lead us to assuming:
“Things can be considered in order of complexity but not in order of simplicity (1/compexity) and therefore the most complex things must be lower probability than simple things.”
I say this is just demonstrating “things you shouldn’t do with infinity”, not proving Occams Razor.
That is not a conclusion. By the definitions given, there is no “most complex thing”; rather, for any given element you pick out that you might think is the most complex thing, there are infinitely many that are more complex.
Infinity is not magic.
The claim you mention is a conclusion from standard probability theory.
The order can’t be reversed, just as you can’t reverse the order of the natural numbers to make them go from high to low. Yes, you could start at a google and count down to 1, but then after that you would be back at counting up again. In the same way every possible ordering of hypotheses will sooner or later start ordering them from simpler to more complex.
I voted the post up because it is interesting but I think it is incorrect. It doesn’t give me reason to believe that Occams’s Razor is to be preferred over the reverse distribution.
For Chaos!
Alternately, if you prefer, order according to 1/complexity. The numbers wouldn’t be integers but they would still be a finite number at each value and could be treated the same way.
Whaaaaa? Your proof is wrong. Choosing y(n)=0 and z(n)=0 satisfies your criteria, but doesn’t lead to your conclusion.
Did I leave something out of the reversal or did I reverse one too many things? (I think y and j need different indices too.)
No, the statement you’re proving is also wrong. When you’re summing an infinite series, the directions “toward zero” and “toward infinity” aren’t symmetric: the terms of the series must become arbitrarily small as n goes to infinity, but they don’t have to become small as n goes to zero. And it’s impossible to enumerate hypotheses (using integers starting from 1) in order of decreasing complexity.
No, but it is harder to describe.
It’s impossible to enumerate infinitely many hypotheses no matter which way you try to do it. You just realise that it is impossible quicker if you try to do it backwards.
I am interested in whether you disagree with just my explanation (reasonable enough, it isn’t that great) or actually disagree with the principle:
For any given ordered probability distribution another probability distribution can be described with the values reversed
If you disagree with that then the inferential difference between us would seem to insurmountable.
This is supposed to be the point where I throw up my hands and angrily tell you to study some math.
For what it’s worth, I’m not sure what exactly your principle means because it hasn’t been stated in math terms, but I can imagine only one math idea that could correspond to your vague statement, and that idea is self-evidently wrong. Namely, if you have a real-valued function defined on the positive integers and that function is strictly decreasing, you cannot reorder the values to make them strictly increasing. If you think that you can, I advise you to try building an example. For example, take f(n) = 1/n, which is strictly decreasing. Try to build a strictly increasing function g(m) such that for each m there’s a unique n (and vice versa) satisfying g(m)=f(n). Both m and n range over the positive integers.
I didn’t throw my hands up angrily, I more flicked them out dismissively. I’m not going to learn anything further from this conversation either.
I can construct a real valued functions of the kind you mention easily enough, what I cannot do is construct such a function that is computable in finite time. You want me to draw a conclusion from this that seems obviously absurd.
Such a g(m) cannot exist. Here’s a proof:
Assume that g(m) exists. Then there exists an m0 such that g(m0)=f(1)=1. Take any m1>m0. Then g(m1)>1 (because g is strictly increasing). But there also exists an n1 such that g(m1)=f(n1). This cannot be, because f(n) is always less than or equal to 1 when n is a positive integer. Contradiction, QED.
If you have a proof that g(m) exists (computable or uncomputable, no matter), we have a difficult situation on our hands. Could you exhibit your proof now?
Seriously, what you are saying seems to be trivially wrong. You should make it more explicit either way: give an example of what you mean (construct the real valued function, however uncomputable), or engage cousin_it’s discussion.
The only thing I expect to see down that path is more argument on what things are and are not permitted to do with infinity.
If I was to continue discussion on the subject of Occam’s Razor it would be in the form of a new post. Probably involving either a stack of papers and a dart or an extremely large game of spin the bottle. The last one sounds more fun.
You are not acting reasonably in this thread, which is surprising since elsewhere you usually do.
What’s the reverse of the sequence 1,2,3,4...?
I had a different formulation, that suggested you stick with the first hypothesis you picked.
Definitely better.