Occam’s Razor
The more complex an explanation is, the more evidence you need just to find it in belief-space. (In Traditional Rationality this is often phrased misleadingly, as “The more complex a proposition is, the more evidence is required to argue for it.”) How can we measure the complexity of an explanation? How can we determine how much evidence is required?
Occam’s Razor is often phrased as “The simplest explanation that fits the facts.” Robert Heinlein replied that the simplest explanation is “The lady down the street is a witch; she did it.”
One observes that the length of an English sentence is not a good way to measure “complexity.” And “fitting” the facts by merely failing to prohibit them is insufficient.
Why, exactly, is the length of an English sentence a poor measure of complexity? Because when you speak a sentence aloud, you are using labels for concepts that the listener shares—the receiver has already stored the complexity in them. Suppose we abbreviated Heinlein’s whole sentence as “Tldtsiawsdi!” so that the entire explanation can be conveyed in one word; better yet, we’ll give it a short arbitrary label like “Fnord!” Does this reduce the complexity? No, because you have to tell the listener in advance that “Tldtsiawsdi!” stands for “The lady down the street is a witch; she did it.” “Witch,” itself, is a label for some extraordinary assertions—just because we all know what it means doesn’t mean the concept is simple.
An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, “Maybe a really powerful agent was angry and threw a lightning bolt.” The human brain is the most complex artifact in the known universe. If anger seems simple, it’s because we don’t see all the neural circuitry that’s implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don’t feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.
To a human, Maxwell’s equations take much longer to explain than Thor. Humans don’t have a built-in vocabulary for calculus the way we have a built-in vocabulary for anger. You’ve got to explain your language, and the language behind the language, and the very concept of mathematics, before you can start on electricity.
And yet it seems that there should be some sense in which Maxwell’s equations are simpler than a human brain, or Thor the thunder-agent.
There is. It’s enormously easier (as it turns out) to write a computer program that simulates Maxwell’s equations, compared to a computer program that simulates an intelligent emotional mind like Thor.
The formalism of Solomonoff induction measures the “complexity of a description” by the length of the shortest computer program which produces that description as an output. To talk about the “shortest computer program” that does something, you need to specify a space of computer programs, which requires a language and interpreter. Solomonoff induction uses Turing machines, or rather, bitstrings that specify Turing machines. What if you don’t like Turing machines? Then there’s only a constant complexity penalty to design your own universal Turing machine that interprets whatever code you give it in whatever programming language you like. Different inductive formalisms are penalized by a worst-case constant factor relative to each other, corresponding to the size of a universal interpreter for that formalism.
In the better (in my humble opinion) versions of Solomonoff induction, the computer program does not produce a deterministic prediction, but assigns probabilities to strings. For example, we could write a program to explain a fair coin by writing a program that assigns equal probabilities to all 2N strings of length N. This is Solomonoff induction’s approach to fitting the observed data. The higher the probability a program assigns to the observed data, the better that program fits the data. And probabilities must sum to 1, so for a program to better “fit” one possibility, it must steal probability mass from some other possibility which will then “fit” much more poorly. There is no superfair coin that assigns 100% probability to heads and 100% probability to tails.
How do we trade off the fit to the data, against the complexity of the program? If you ignore complexity penalties, and think only about fit, then you will always prefer programs that claim to deterministically predict the data, assign it 100% probability. If the coin shows HTTHHT, then the program that claims that the coin was fixed to show HTTHHT fits the observed data 64 times better than the program which claims the coin is fair. Conversely, if you ignore fit, and consider only complexity, then the “fair coin” hypothesis will always seem simpler than any other hypothesis. Even if the coin turns up HTHHTHHHTHHHHTHHHHHT . . .
Indeed, the fair coin is simpler and it fits this data exactly as well as it fits any other string of 20 coinflips—no more, no less—but we see another hypothesis, seeming not too complicated, that fits the data much better.
If you let a program store one more binary bit of information, it will be able to cut down a space of possibilities by half, and hence assign twice as much probability to all the points in the remaining space. This suggests that one bit of program complexity should cost at least a “factor of two gain” in the fit. If you try to design a computer program that explicitly stores an outcome like HTTHHT, the six bits that you lose in complexity must destroy all plausibility gained by a 64-fold improvement in fit. Otherwise, you will sooner or later decide that all fair coins are fixed.
Unless your program is being smart, and compressing the data, it should do no good just to move one bit from the data into the program description.
The way Solomonoff induction works to predict sequences is that you sum up over all allowed computer programs—if every program is allowed, Solomonoff induction becomes uncomputable—with each program having a prior probability of 1⁄2 to the power of its code length in bits, and each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits.
The Minimum Message Length formalism is nearly equivalent to Solomonoff induction. You send a string describing a code, and then you send a string describing the data in that code. Whichever explanation leads to the shortest total message is the best. If you think of the set of allowable codes as a space of computer programs, and the code description language as a universal machine, then Minimum Message Length is nearly equivalent to Solomonoff induction.1
This lets us see clearly the problem with using “The lady down the street is a witch; she did it” to explain the pattern in the sequence 0101010101. If you’re sending a message to a friend, trying to describe the sequence you observed, you would have to say: “The lady down the street is a witch; she made the sequence come out 0101010101.” Your accusation of witchcraft wouldn’t let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused.
Witchcraft may fit our observations in the sense of qualitatively permitting them; but this is because witchcraft permits everything , like saying “Phlogiston!” So, even after you say “witch,” you still have to describe all the observed data in full detail. You have not compressed the total length of the message describing your observations by transmitting the message about witchcraft; you have simply added a useless prologue, increasing the total length.
The real sneakiness was concealed in the word “it” of “A witch did it.” A witch did what?
Of course, thanks to hindsight bias and anchoring and fake explanations and fake causality and positive bias and motivated cognition, it may seem all too obvious that if a woman is a witch, of course she would make the coin come up 0101010101. But I’ll get to that soon enough. . .
1 Nearly, because it chooses the shortest program, rather than summing up over all programs.
- Raising the Sanity Waterline by 12 Mar 2009 4:28 UTC; 239 points) (
- Crisis of Faith by 10 Oct 2008 22:08 UTC; 175 points) (
- The Hidden Complexity of Wishes by 24 Nov 2007 0:12 UTC; 173 points) (
- References & Resources for LessWrong by 10 Oct 2010 14:54 UTC; 167 points) (
- Message Length by 20 Oct 2020 5:52 UTC; 134 points) (
- Where Recursive Justification Hits Bottom by 8 Jul 2008 10:16 UTC; 123 points) (
- Zombies! Zombies? by 4 Apr 2008 9:55 UTC; 114 points) (
- Pascal’s Mugging: Tiny Probabilities of Vast Utilities by 19 Oct 2007 23:37 UTC; 112 points) (
- A mostly critical review of infra-Bayesianism by 28 Feb 2023 18:37 UTC; 104 points) (
- Interpersonal Entanglement by 20 Jan 2009 6:17 UTC; 102 points) (
- Zombies Redacted by 2 Jul 2016 20:16 UTC; 94 points) (
- Measuring Optimization Power by 27 Oct 2008 21:44 UTC; 89 points) (
- A Priori by 8 Oct 2007 21:02 UTC; 86 points) (
- Entropy, and Short Codes by 23 Feb 2008 3:16 UTC; 82 points) (
- Excluding the Supernatural by 12 Sep 2008 0:12 UTC; 79 points) (
- Magical Categories by 24 Aug 2008 19:51 UTC; 74 points) (
- How to Fix Science by 7 Mar 2012 2:51 UTC; 70 points) (
- Fake Utility Functions by 6 Dec 2007 16:55 UTC; 69 points) (
- So You Think You’re a Bayesian? The Natural Mode of Probabilistic Reasoning by 14 Jul 2010 16:51 UTC; 66 points) (
- Belief in the Implied Invisible by 8 Apr 2008 7:40 UTC; 65 points) (
- Where Philosophy Meets Science by 12 Apr 2008 21:21 UTC; 61 points) (
- My Kind of Reflection by 10 Jul 2008 7:21 UTC; 61 points) (
- your terminal values are complex and not objective by 13 Mar 2023 13:34 UTC; 61 points) (
- Let There Be Light by 17 Mar 2010 19:35 UTC; 60 points) (
- Rationalists are missing a core piece for agent-like structure (energy vs information overload) by 17 Aug 2024 9:57 UTC; 59 points) (
- Solomonoff Cartesianism by 2 Mar 2014 17:56 UTC; 51 points) (
- an Evangelion dialogue explaining the QACI alignment plan by 10 Jun 2023 3:28 UTC; 51 points) (
- Algorithmic Intent: A Hansonian Generalized Anti-Zombie Principle by 14 Jul 2020 6:03 UTC; 50 points) (
- If Clarity Seems Like Death to Them by 30 Dec 2023 17:40 UTC; 46 points) (
- The Pascal’s Wager Fallacy Fallacy by 18 Mar 2009 0:30 UTC; 44 points) (
- Dissolving the Problem of Induction by 27 Dec 2020 17:58 UTC; 40 points) (
- A Suggested Reading Order for Less Wrong [2011] by 8 Jul 2011 1:40 UTC; 38 points) (
- Occupational Infohazards by 18 Dec 2021 20:56 UTC; 35 points) (
- Great Explanations by 31 Oct 2011 23:58 UTC; 34 points) (
- My Model of Gender Identity by 1 Apr 2023 3:03 UTC; 34 points) (
- Coding Rationally—Test Driven Development by 1 Oct 2010 15:20 UTC; 33 points) (
- Complexity and Intelligence by 3 Nov 2008 20:27 UTC; 33 points) (
- Complexity Penalties in Statistical Learning by 6 Feb 2019 4:13 UTC; 31 points) (
- Comment on Coherence arguments do not imply goal directed behavior by 6 Dec 2019 9:30 UTC; 30 points) (
- Bayesianism for humans: prosaic priors by 2 Sep 2014 21:45 UTC; 30 points) (
- Points of Departure by 9 Sep 2008 21:18 UTC; 29 points) (
- Occam’s Razor and the Universal Prior by 3 Oct 2021 3:23 UTC; 28 points) (
- Which Basis Is More Fundamental? by 24 Apr 2008 4:17 UTC; 28 points) (
- Deutsch and Yudkowsky on scientific explanation by 20 Jan 2021 1:00 UTC; 27 points) (
- Poker with Lennier by 15 Nov 2011 22:21 UTC; 26 points) (
- Bridging syntax and semantics, empirically by 19 Sep 2018 16:48 UTC; 25 points) (
- What kind of place is this? by 25 Feb 2023 2:14 UTC; 24 points) (
- 4 Aug 2019 22:55 UTC; 23 points) 's comment on Occam’s Razor: In need of sharpening? by (
- Is the human brain a valid choice for the Universal Turing Machine in Solomonoff Induction? by 8 Dec 2018 1:49 UTC; 22 points) (
- Deep neural networks are not opaque. by 6 Jul 2022 18:03 UTC; 22 points) (
- Naming and pointer thickness by 28 Apr 2021 6:35 UTC; 22 points) (
- Intuitive Explanation of AIXI by 12 Jun 2022 21:41 UTC; 21 points) (
- 13 Jun 2011 16:09 UTC; 20 points) 's comment on Rewriting the sequences? by (
- Preferences as an (instinctive) stance by 6 Aug 2019 0:43 UTC; 18 points) (
- How do low level hypotheses constrain high level ones? The mystery of the disappearing diamond. by 11 Jul 2023 19:27 UTC; 17 points) (
- Map and Territory: Summary and Thoughts by 5 Dec 2020 8:21 UTC; 16 points) (
- Human priors, features and models, languages, and Solmonoff induction by 10 May 2021 10:55 UTC; 16 points) (
- Rationality Reading Group: Part C: Noticing Confusion by 18 Jun 2015 1:01 UTC; 15 points) (
- 17 Nov 2011 13:44 UTC; 15 points) 's comment on Bayes Slays Goodman’s Grue by (
- 30 Dec 2011 13:10 UTC; 14 points) 's comment on Stupid Questions Open Thread by (
- 18 May 2008 1:58 UTC; 14 points) 's comment on Do Scientists Already Know This Stuff? by (
- Assigning probabilities to metaphysical ideas by 8 Sep 2021 0:15 UTC; 14 points) (
- 11 Jul 2010 12:49 UTC; 13 points) 's comment on Raising the Sanity Waterline by (
- Values, Valence, and Alignment by 5 Dec 2019 21:06 UTC; 12 points) (
- 8 Sep 2011 21:07 UTC; 11 points) 's comment on A Rationalist’s Tale by (
- 2 Nov 2007 17:04 UTC; 11 points) 's comment on An Alien God by (
- 13 May 2011 15:45 UTC; 10 points) 's comment on The elephant in the room, AMA by (
- 15 Jan 2012 1:48 UTC; 10 points) 's comment on Welcome to Less Wrong! (2012) by (
- 3 Jun 2010 7:27 UTC; 10 points) 's comment on Open Thread: June 2010 by (
- Gelman Against Parsimony by 24 Nov 2013 15:23 UTC; 10 points) (
- 4 Jul 2012 19:48 UTC; 9 points) 's comment on Rationality Quotes July 2012 by (
- Summarizing the Sequences Proposal by 4 Aug 2011 21:15 UTC; 9 points) (
- Toy model of preference, bias, and extra information by 24 Mar 2021 10:14 UTC; 9 points) (
- Has anyone on LW written about material bottlenecks being the main factor in making any technological progress? by 27 Jan 2021 14:14 UTC; 9 points) (
- [SEQ RERUN] Occam’s Razor by 9 Sep 2011 14:38 UTC; 9 points) (
- 19 Aug 2010 11:46 UTC; 8 points) 's comment on Kevin T. Kelly’s Ockham Efficiency Theorem by (
- 25 Jun 2011 7:36 UTC; 8 points) 's comment on Exclude the supernatural? My worldview is up for grabs. by (
- The Perception-Action Cycle by 23 Jul 2012 1:44 UTC; 8 points) (
- 9 Jun 2012 22:44 UTC; 8 points) 's comment on Ask an experimental physicist by (
- 26 Feb 2023 14:23 UTC; 7 points) 's comment on The Preference Fulfillment Hypothesis by (
- 10 Feb 2017 6:40 UTC; 7 points) 's comment on The Social Substrate by (
- 30 Jul 2013 2:35 UTC; 7 points) 's comment on Open thread, July 29-August 4, 2013 by (
- 18 Sep 2011 10:21 UTC; 6 points) 's comment on Atheism & the autism spectrum by (
- 18 Feb 2010 13:17 UTC; 5 points) 's comment on Open Thread: February 2010 by (
- Are Magical Categories Relatively Simple? by 14 Apr 2012 20:59 UTC; 5 points) (
- 2 Jan 2011 6:24 UTC; 5 points) 's comment on The Proper Use of Humility by (
- 1 Jan 2011 16:42 UTC; 5 points) 's comment on Question about self modifying AI getting “stuck” in religion by (
- 28 May 2013 5:19 UTC; 5 points) 's comment on Being Foreign and Being Sane by (
- 14 Aug 2011 11:50 UTC; 5 points) 's comment on Take heed, for it is a trap by (
- 17 Mar 2011 20:55 UTC; 5 points) 's comment on Rationality Outreach: A Parable by (
- 6 Dec 2019 15:02 UTC; 5 points) 's comment on The Devil Made Me Write This Post Explaining Why He Probably Didn’t Hide Dinosaur Bones by (
- 10 Aug 2010 15:30 UTC; 5 points) 's comment on A Proof of Occam’s Razor by (
- 14 May 2014 13:40 UTC; 4 points) 's comment on What do rationalists think about the afterlife? by (
- 8 Apr 2010 23:40 UTC; 4 points) 's comment on Frequentist Magic vs. Bayesian Magic by (
- 21 May 2009 5:05 UTC; 4 points) 's comment on Positive Bias Test (C++ program) by (
- Yet another safe oracle AI proposal by 26 Feb 2012 23:45 UTC; 4 points) (
- 28 May 2012 21:50 UTC; 4 points) 's comment on [SEQ RERUN] Living in Many Worlds by (
- Dreams of AIXI by 30 Aug 2010 22:15 UTC; 4 points) (
- 29 Jul 2011 23:11 UTC; 4 points) 's comment on [SEQ RERUN] Fake Explanations by (
- 22 Nov 2010 14:36 UTC; 4 points) 's comment on What I’ve learned from Less Wrong by (
- 13 Sep 2012 0:17 UTC; 4 points) 's comment on Introducing Simplexity by (
- 14 May 2014 2:56 UTC; 3 points) 's comment on What do rationalists think about the afterlife? by (
- 2 Dec 2010 11:07 UTC; 3 points) 's comment on References & Resources for LessWrong by (
- 18 May 2008 2:06 UTC; 3 points) 's comment on Do Scientists Already Know This Stuff? by (
- 7 Apr 2011 1:07 UTC; 3 points) 's comment on Bayesian Epistemology vs Popper by (
- 27 Apr 2014 16:15 UTC; 2 points) 's comment on Questions to ask theist philosophers? I will soon be speaking with several by (
- Why would code/English or low-abstraction/high-abstraction simplicity or brevity correspond? by 4 Sep 2020 19:46 UTC; 2 points) (
- Meetup : Seattle Biweekly Meetup: Occam’s Razor, Repetition and Time’s Up by 9 Sep 2011 4:46 UTC; 2 points) (
- 29 Oct 2010 5:38 UTC; 2 points) 's comment on The prior probability of justification for war? by (
- 6 Mar 2017 17:36 UTC; 2 points) 's comment on Am I Really an X? by (
- 7 Jul 2010 18:56 UTC; 2 points) 's comment on Open Thread June 2010, Part 4 by (
- 28 Apr 2011 12:38 UTC; 2 points) 's comment on What are the leftover questions of metaethics? by (
- 19 Oct 2007 8:35 UTC; 2 points) 's comment on Congratulations to Paris Hilton by (
- 10 Jul 2009 17:53 UTC; 2 points) 's comment on Causation as Bias (sort of) by (
- 21 Sep 2011 10:58 UTC; 2 points) 's comment on Open Thread: September 2011 by (
- 13 Apr 2011 19:46 UTC; 2 points) 's comment on Occam’s Razor, Complexity of Verbal Descriptions, and Core Concepts by (
- Gettier in Zombie World by 23 Jan 2011 6:44 UTC; 2 points) (
- 14 Jan 2021 15:50 UTC; 2 points) 's comment on TurnTrout’s shortform feed by (
- 26 Nov 2011 6:49 UTC; 2 points) 's comment on Welcome to Less Wrong! (2010-2011) by (
- 25 Apr 2009 10:46 UTC; 2 points) 's comment on The Sin of Underconfidence by (
- 23 Sep 2023 14:35 UTC; 2 points) 's comment on Where are the people building AGI in the non-dumb way? by (
- 5 Apr 2011 3:49 UTC; 1 point) 's comment on Bayesianism versus Critical Rationalism by (
- Why would code/English or low-abstraction/high-abstraction simplicity or brevity correspond? by 4 Sep 2020 19:46 UTC; 1 point) (
- 28 Jul 2009 4:31 UTC; 1 point) 's comment on Bayesian Flame by (
- 11 Aug 2009 18:49 UTC; 1 point) 's comment on The usefulness of correlations by (
- Persuasiveness vs Soundness by 13 Apr 2009 8:43 UTC; 1 point) (
- 7 Aug 2019 3:57 UTC; 1 point) 's comment on Preferences as an (instinctive) stance by (
- 25 Jun 2011 16:32 UTC; 1 point) 's comment on Exclude the supernatural? My worldview is up for grabs. by (
- 14 Oct 2019 21:57 UTC; 1 point) 's comment on I would like to try double crux. by (
- 10 Sep 2008 16:48 UTC; 1 point) 's comment on Points of Departure by (
- 3 Mar 2010 6:05 UTC; 0 points) 's comment on Open Thread: March 2010 by (
- 6 Apr 2011 2:40 UTC; 0 points) 's comment on Bayesianism versus Critical Rationalism by (
- 1 Jan 2012 6:20 UTC; 0 points) 's comment on Stupid Questions Open Thread by (
- 18 Feb 2011 19:53 UTC; 0 points) 's comment on Cryonics and Pascal’s wager by (
- 13 Apr 2009 20:37 UTC; 0 points) 's comment on GroupThink, Theism … and the Wiki by (
- 1 Dec 2012 14:34 UTC; 0 points) 's comment on Open Thread, December 1-15, 2012 by (
- 16 Dec 2012 16:06 UTC; 0 points) 's comment on Ends Don’t Justify Means (Among Humans) by (
- 30 Nov 2013 13:12 UTC; 0 points) 's comment on 2013 Less Wrong Census/Survey by (
- 18 May 2016 5:01 UTC; 0 points) 's comment on How do you learn Solomonoff Induction? by (
- 8 Jun 2012 5:01 UTC; 0 points) 's comment on Open Problems Related to Solomonoff Induction by (
- 31 Mar 2012 16:05 UTC; 0 points) 's comment on Collaborative project: New rationality materials page by (
- A Proof of Occam’s Razor by 10 Aug 2010 14:20 UTC; 0 points) (
- 16 Dec 2010 5:22 UTC; 0 points) 's comment on A Proof of Occam’s Razor by (
- 21 Sep 2011 16:10 UTC; 0 points) 's comment on [SEQ RERUN] A Priori by (
- 12 Mar 2012 17:03 UTC; 0 points) 's comment on Falsification by (
- 12 Mar 2012 16:02 UTC; 0 points) 's comment on Falsification by (
- 1 Nov 2011 2:58 UTC; -1 points) 's comment on Great Explanations by (
- 1 Feb 2015 6:24 UTC; -1 points) 's comment on My Skepticism by (
- 13 Jun 2011 17:17 UTC; -1 points) 's comment on A Defense of Naive Metaethics by (
- 25 Jan 2015 23:32 UTC; -3 points) 's comment on Entropy and Temperature by (
- 25 Jun 2012 7:09 UTC; -6 points) 's comment on [SEQ RERUN] Is Morality Preference? by (
- 19 Nov 2011 15:04 UTC; -10 points) 's comment on Beyond the Reach of God by (
The Vapnik Chernovenkis Dimension also offers a way of filling in the detail of the the concept of “simple” appropriate to Occam’s Razor. I’ve read about it in the context of statistical learning theory, specifically “probably approximately correct learning”.
Having successfully tuned the parameters of your model to fit the data, how likely is it to fit new data, that is, how well does it generalise. The VC dimension comes with formulae that tell you. I’ve not been able to follow the field, but I suspect that VC dimension leads to worst case estimates whose usefulness is harmed by their pessimism.
Great post!
“Your accusation of witchcraft wouldn’t let you shorten the rest of the message; you would still have to describe, in full detail, the data which her witchery caused.”
My model of witches, if I had one, would produce a given simple sequence like 01010101 with greater probability than a given random sequence like 00011011. Wouldn’t yours? I might agree if you said “in nearly full detail”.
Steven, that means you have to transmit the accusation of witchcraft, followed by a computer program, followed by the coded data. Why not just transmit the computer program followed by the coded data? I don’t expect my own environment to be random noise, but that has nothing to do with witchcraft...
Alan, I agree that VC dimension is an important conceptually different way of thinking about “complexity”. One of its primary selling points is that, for example, it doesn’t attach infinite complexity to a model class that contains one real-valued parameter, if that model class isn’t very flexible (i.e., it says only “the data points are greater than R”). But VC complexity doesn’t plug into standard probability theory as easily as Solomonoff induction.
In Solomonoff induction it is important to use a two-tape Turing machine where one tape is for the program and one is for the input and work space. The program tape is an infinite random string, but the program length is defined to be the number of bits that the Turing machine actually reads during its execution. This way the set of possible programs becomes a prefix free set. It follows that the prior probabilities will add up to one when you weight by 2^(-l) where l is program length. (I believe this was realized by Leonid Levin. In Solomonoff’s original scheme the prior probabilities did not add to one.) This also allows the beautiful interpretation that the program tape is assigned by independent coin flips for each bit, and the 2^-l weighting arises naturally rather than as an artificial assumption. I believe this is discussed in the information theory book by Cover and Thomas.
Eliezer,
“I don’t expect my own environment to be random noise, but that has nothing to do with witchcraft...”
I think I misinterpreted the math and now see what you’re getting at. Would it be an accurate translation to human language to say, “a sequence like 10101010 may favor witchcraft over the hypothesis that nothing weird is going on (i.e. the coinflips are random), but it will never favor witchcraft over the simpler hypothesis that something weird is going on that isn’t witchcraft”?
I find it awkward to think of “witchcraft” as just a content-free word; what “witchcraft” means to me is something like the possibility that reality includes human-mind-like things with personalities and with preferences that they achieve through unknown nonstandard causal means. If you coded that up, it would probably no longer be content-free; it would allow shortening the rest of the program generating the sequences in some cases and require lengthening it in some other cases. In all realistic cases the resulting program would still be longer than necessary.
Good comments, all!
Steven, yes. Stephen, also yes.
Eli, you said:
An enormous bolt of electricity comes out of the sky and hits something, and the Norse tribesfolk say, “Maybe a really powerful agent was angry and threw a lightning bolt.” The human brain is the most complex artifact in the known universe. If anger seems simple, it’s because we don’t see all the neural circuitry that’s implementing the emotion. (Imagine trying to explain why Saturday Night Live is funny, to an alien species with no sense of humor. But don’t feel superior; you yourself have no sense of fnord.) The complexity of anger, and indeed the complexity of intelligence, was glossed over by the humans who hypothesized Thor the thunder-agent.
I think it’s worth noting that Norse tribesfolk already knew about human beings, so whatever model of the universe they made had to include angry agents in it somewhere.
I agree. I feel like the post is poking a bit of fun at hokey religion, and in so doing falls into an error. The Norse would do quite badly in life if they switched to a prior based on description lengths in Turing machines rather than a description length in their own language, because their language embodies useful bias concerning their environment. Similarly, English description lengths contain useful bias for our environment. The formalism of Solomonoff induction does not tell us which universal language to use, and English is a fine choice. The “thunder god” theory is not bad because of Occam’s razor, but because it doesn’t hold up when we investigate empirically! Similarly, if the Norse believed that earthquakes were caused by giant animals moving under the earth, it would not be such a bad theory given what evidence they had (even though animals are complex from a Turing-machine perspective); animals caused many things in their environment. We just know it is wrong today, based on what we know now.
What you are talking about in terms of Solmonoff induction is usually called algorithmic information theory and the shortest-program-to-produce-a-bit-string is usually called Kolmogorov-Chaitin information. I am sure you know this. Which begs the question, why didn’t you mention this? I agree, it is the neatest way to think about Occam’s razor. I am not sure why some are raising PAC theory and VC-dimension. I don’t quite see how they illuminate Occam. Minimalist inductive learning is hardly the simplest “explanation” in the Occam sense, and is actually closer to Shannon entropy in spirit, in being more of a raw measure. Gregory Chaitin’s ‘Meta Math: The Search for Omega’, which I did a review summary of is a pretty neat look at this stuff.
Venkat: I think there is a very good reason to mention PAC learning. Namely, Kolmogorov complexity is uncomputable, so Solomonoff induction is not possible even in principle. Thus one must use approximate methods instead such as PAC learning.
Occam’s razor is not conclusive and it’s not science. It is not unscientific but I would say that it fits into the category of philosophy. In science you do not get two theories, take the facts you know, and then conclude based on the simplest theory. If you’re doing this, you need to do better experiments to determine the facts. Occam’s razor can be a useful heuristic to suggest what experiments should be done. Just like mathematical elegance, Occam’s razor suggests that something is on the right track but it is not decisive. To look back at the facts and then interpret it through Occam’s razor is just an exercise in hindsight bias.
Your analogy with Norse tribesfolk reminds me of the NRA slogan, “Guns don’t kill people, people kill people”. There are many different levels of causation. The gun can be said to be the secondary cause of why someone died. The person pulling the trigger would be the primary cause. The secondary cause of thunder is nature but the first cause that brought things into existence and created the system is God. Nature cannot be its own cause.
The rest of what you wrote sounds like you’re pulling numbers out of your arse. The last sentence should be read in your best Norse tribesfolk accent.
Science is just a method of filtering hypothesis. Which is exactly what Occam’s razor is. Occam’s razor is not a philosophy, it is a statistical prediction. To claim that Occam’s razor is not a science would be to claim that statistics is not a science.
Example: You leave a bowl with milk in it over night, you wake up in the morning and its gone. Two possibly theories, are one, your cat drank it, or two, someone broke into your house, and drank it, then left.
Well, we know that cats like milk, and you have a cat, so you know the probability of there being a cat is 1:1, and you also know your cat likes to steal food when your sleeping, so based on past experience you might say the probability of the cat stealing the milk is 1:2, so you know theres two high probabilities. But when we consider the burglar hypothesis, we know that its extremely rare for someone to break into our house, thus the probability for that situation, while being physically possible, is very low say 1 in 10,000. We know that burglars tend to break into houses to steal expensive things, not milk from a bowl, thus the probability of that happening is say 1 in a million.
This is Occams razor at work, its 1⁄1 1⁄2 vs 1⁄10,000 1⁄1,000,000. Its statistics, and its science. Nothing I described here would be inaccessible to experimentation and control groups.
I think that the God reference and foul language used in Cure_of_Ars comment have misdirected an important criticism to this article, which I for one would like to hear your responses to, so please for those who downvoted and saved the criticism for his comment, I would like to hear your thoughts and have it explained to me; for me, it is not trivial that he has no point in his first paragraph.
But to clarify, I’d restate my open questions on the subject which were partly described by his comment.
The original formulation of this principle is: “Entities should not be multiplied without necessity.” This formulation is not that clear to me; what I can understand from it is that one shouldn’t add unnecessary complexity to a theory unless he has to.
A clear example where Occam’s razor may be used as intended is as following: assume I have a program that takes a single number as an input and returns a number. Now, if we observe the following sequence: f(1) = 2, f(4) = 16 and f(10) = 1024, we might be tempted say f(x) = 2^x. But this is not the only option; we could have: f(x) = {x > 0 → 2^x, x ⇐ 0 → 10239999999} or even f(x) = {1 → 2, 4 ->16, 10->1024, [ANY OTHER INPUT TO OUTPUT]}.
Since these examples all make the same predictions in all experimental tests so far, it follows we should choose the simplest one, being 2^x [and if more experimental tests will follow in the future, we could have chosen in advance similarly complex alternatives that would have predicted correct observations as 2^x for these tests just as well. In fact, we can only make a finite amount of experimental tests, and as such there are an infinite amount of hypotheses that would correctly predict these tests and have an additional, useless, layer of complexion added to them.]
What exactly entities mean, or how multiplication of them is defined, I could only guess based on my understanding of these concepts and the popular interpretations of this principle, such as: “Occam’s razor says that when presented with competing hypotheses that make the same predictions, one should select the solution with the fewest assumptions”
In any case, I sense (after reading multiple sources that emphasize this) that there is an emphasis here that isn’t properly addressed in this article and skipped over in these replies, and it is that Occam’s razor is not meant to be a way of choosing between hypotheses that make different predictions.
In the article, the question of how to weight simplicity over precision arises; if we have two theorems, T1 and T2, which have different precision (say T1 has 90% success rate where T2 has 82%) and different complexity (but T1 is more complex than T2) how can we decide between the two?
From my understanding, and this is where I would like to hear your thoughts, this question cannot be solved by Occam’s razor. That being said, I think this question is even more interesting and important than the one that Occam’s razor attempts at solving. And to answer that question, it appears that Occam’s razor has been generalized, to something like: “The explanation requiring the fewest assumptions is most likely to be correct.” These generalizations are even given a different name (the law of parsimony, or the rule of simplicity) to stress they are not the same as Occam’s razor.
But that is neither the original purpose of the principle, nor is it a proven fact. The following quote stresses this issue: “The principle of simplicity works as a heuristic rule of thumb, but some people quote it as if it were an axiom of physics, which it is not. [...] The law of parsimony is no substitute for insight, logic and the scientific method. It should never be relied upon to make or defend a conclusion. As arbiters of correctness, only logical consistency and empirical evidence are absolute.”
A usage of this principle that does appeal to my logic is to get rid of hypothetical absurdities, esp. if they cannot be tested using the scientific method. This has been done in the field of physics, and this quote illustrates my point:
“In physics we use the razor to shave away metaphysical concepts. [...] The principle has also been used to justify uncertainty in quantum mechanics. Heisenberg deduced his uncertainty principle from the quantum nature of light and the effect of measurement.
Stephen Hawking writes in A Brief History of Time:
”We could still imagine that there is a set of laws that determines events completely for some supernatural being, who could observe the present state of the universe without disturbing it. However, such models of the universe are not of much interest to us mortals. It seems better to employ the principle known as Occam’s razor and cut out all the features of the theory that cannot be observed.”″
My point here is not to disagree with the rule of simplicity (and surely not the original razor) but to stress why it is somewhat philosophical (after all, it was invented in the 14th century, much before the scientific method,) or at least, that it isn’t proven that this law is right for all cases; there are strong cases in history that support it, but that is not the same as being proven.
I think that this law is a very good heuristic. Especially when we try to locate our belief in belief-space. But I believe this razor is wielded with less care than it should be—please let me know if and why you disagree.
Additionally, I do not think I have gained a practical tool to evaluate precision vs. simplicity. Solomonoff’s induction seems highly impossible to use in real life, esp. when evaluating theories outside of the laboratories (in our actual life!) I do understand it’s a very hard problem, but Rationality’s purpose is all about using our brains, with all its weaknesses and biases, to the best of our abilities, in order to have the maximum chance to reach Truth. This implies practical, however imperfect they may be (hopefully, as least imperfect as possible,) tools to deal with these kinds of problems in our private lives. I do not think that Solomonoff’s induction is such a tool, and I do think we could use some heuristic to help us in this task.
To dudeicus: one cannot argue a theory by an example to it and then conclude by saying “if it would be tested with proper research, it will be proven.” This is not the scientific method at work. What I do take from your comment is only that this has not been formally proven—thus relating to the philosophy discussion again.
In science you do not get two theories
You’re right—there are an infinite number of theories consistent with any set of observations. Any set. All observed facts are technically consistent with the prediction that gravity will reverse in one hour, but nobody believes that because of… Occam’s Razor!
I don’t think this is what’s actually going on in the brains of most humans.
Suppose there were ten random people who each told you that gravity would be suddenly reversing soon, but each one predicted a different month. For simplicity, person 1 predicts the gravity reversal will come in 1 month, person 2 predicts it will come in 2 months, etc.
Now you wait a month, and there’s no gravity reversal, so clearly person 1 is wrong. You wait another month, and clearly person 2 is wrong. Then person 3 is proved wrong, as is person 4 and then 5 and then 6 and 7 and 8 and 9. And so when you approach the 10-month mark, you probably aren’t going to be expecting a gravity-reversal.
Now, do you not suspect the gravity-reversal at month ten simply because it’s not as simple as saying “there will never a be a gravity reversal,” or is your dismissal substantially motivated by the fact that the claim type-matches nine other claims that have already been disproven? I think that in practice most people end up adopting the latter approach.
The rest of what you wrote sounds like you’re pulling numbers out of your arse.
Cure of Ars, I should prefer it if you no longer commented on my posts. There may be a place on Overcoming Bias for Catholics; but none for those who despise math they don’t understand.
MIT Press has just published Peter Grünwald’s The Minimum Description Length Principle. His Preface, Chapter 1, and Chapter 17 are available at that link. Chapter 17 is a comparison of different conceptions of induction.
I don’t know this area well enough to judge Peter’s wok, but it is certainly informative. Many of his points echo Eliezer’s. If you find this topic interesting, Peter’s book is definitely worth checking out.
“Different inductive formalisms are penalized by a worst-case constant factor relative to each other”
You mean a constant term; it’s additive, not multiplicative.
That depends on whether you’re thinking of the length or the probability. Since the length is the log-probability, it works out.
Occam’s razor actually suggest that entities are not to be multiplied without necessity.
Unfortunately, most people happily bastardize Occam’s Razor, abusing it to suggest the simpler explanation is usually the better one.
First off, define simple. Simple how? Can you objectively define simplicity? (It’s not easily done.) Second, explanations must fit the facts. Third, this is a heuristic-based argument, not a logical proof of something. (This same argument was used against Boltzman and his idea of the atom, but Boltzman was right.) Fourth, what does “usually” mean anyway? Define that objectively. Black swans events are seemingly impossible, yet they happen much more regularly than people imagine (because they are based on power laws/fractal statistics, not the Bell Curve reality we often think in terms of where the past gives us some sense of what to expect).
Consequently, I don’t consider an offhand mention of Occam’s razor as a compelling argument. I would stop shaking your head and reconsider what it is you think you know.
Several of these points are explicitly addressed in the article.
shanerg is right, Occam’s razor is not “The simplest answer is usually the right one.” It is, “do not suggest entities for which there are no need”.
That is a common misrepresentation of Occam’s razor, and it is extremely vague and I think it shouldn’t be used, it has too many hidden assumptions. Now I do agree with everything that was written in the article, but everything in the article was the underlying explanation for why Occam’s razor is true, which simply put, has to do with statistics. I was disappointed though, that this article that was about Occam’s razor, didn’t actually have Occam’s razor in it.
I’m sure we could have a fruitful discussion about the proper form of Occam’s Razor, generally speaking it is taken slightly differently than the precise wording attributed to William of Occam.
However, shanerg’s post includes several questions answered explicitly and prominently in the post to which ey is responding. Based on this, I expected that a lengthy philosophical response would be wasted.
Coming back to this post, I finally noticed that “emotional” is a necessary word in the quoted sentence. If we leave it out, the sentence might just become false! That is, if you believe there’s some sort of simple mathematical “key” to intelligence (my impression is that you do believe that), then you also ought to believe that the Solomonoff prior makes an intelligent god quite probable apriori. Maybe even more probable than the currently most elegant known formulations of physical laws, which include a whole zoo of elementary particles etc. Of course, if we take into account the evidence we’ve seen so far, it looks like our universe is based on physics rather than a “god”.
What sort of specification for Thor are you thinking of that could possibly be simpler than Maxwell’s equations? A description of macroscopic electrical phenomena is more complex, as is “a being that wants to simulate Maxwell’s equations.”
If you’re thinking of comparing all “god-like” hypotheses to Maxwell’s equations, sure. But that comparison is a bit false—you should really be comparing all “god-like” hypotheses to all “natural law-like” hypotheses, in which case I confidently predict that the “natural law-like” hypotheses will win handily.
Yeah, I agree. The shortest god-programs are probably longer than the shortest physics-programs, just not “enormously” longer.
Probably enormously longer if you want it to produce a god that would cause the world to act in a way as if basic EM held.
ie, you don’t just need a mind, you need to specify the sort of mind that would want to cause the world to be in a specific way...
Is there a nice way to quantify how fast does these theoretical priors drop off with the length of something? By how much should I favor simple explanation X over only mediumly more complicated explanation Y.
Interesting question. If you have a countable infinity of mutually exclusive explanations (e.g. they are all finite strings using letters from some finite alphabet), then your only constraint is that the infinite sum of all their prior probabilities must converge to 1. Otherwise you’re free to choose. You could make the convergence really fast (say, by making the prior of a hypothesis inversely proportional to the exponent of the exponent of its length), or slower if you wish to. A very natural and popular choice is restricting the hypotheses to form a “prefix-free set” (no hypothesis can begin with another shorter hypothesis) and then assigning every hypothesis of N bits a prior of 2^-N, which makes the sum converge by Kraft’s inequality.
What is the reasoning behind using a prefix-free set?
Apart from giving a simple formula for the prior, it comes in handy in other theoretical constructions. For example, if you have a “universal Turing machine” (a computer than can execute arbitrary programs) and feed it an infinite input stream of bits, perhaps coming from a random source because you intend to “execute a random program”… then it needs to know where the program ends. You could introduce an end-of-program marker, but a more general solution is to make valid programs form a prefix-free set, so that when the machine has finished reading a valid program, it knows that reading more bits won’t result in a longer but still valid program. (Note that adding an end-of-program marker is one of the ways to make your set of programs prefix-free!)
Overall this is a nice example of an idea that “just smells good” to a mathematician’s intuition.
Ah! I must have had a brain-stnank—this makes total sense in math / theoretical CS terms, I was substituting an incorrect interpretation of “hypothesis” when reading the comment out of context. Thanks :)
And, in particular, we’re looking at god-programs that produce the output we’ve observed, which seems to cut out a lot of them (and specifically a lot of simple ones).
Occam’s Razor is “entities must not be multiplied beyond necessity” (entia non sunt multiplicanda praeter necessitatem)
NOT “The simplest explanation that fits the facts.”
Now thats just purely definition. I think both are true. I think there are problems with both. The problem with Occams razor, is that yes its true, however, it doesn’t cover all the bases. There is a deeper underlying principle that makes Occams razor true, which is the one you described in the article. However summing up your article as “The simplest explanation that fits the facts” is also misleading as in, while it does seem to cover all the bases, it only does so if you use a very specific definition of simple which really doesnt fit with everyday language.
Example: Stonehenge, let me suggest two theories, 1. it was built by ancient humans, 2. it fell together through purely random geological process. Both theories fit with the facts, we know that both are physically possible (yes 2. is vastly less probable, ill get to that in a second). Occams razor suggest 2. as the answer, and “The simplest explanation” appears to be 2. also. Both seem to be failing. The real underlying principle as to why Occams razor is true, is statistics, not simplicity. Now dont get me wrong, I understand why “The simplest explanation that fit the facts” actually points to 1., but then you have to go through this long process of what you actually mean by simplest, which basically just ends up being a long explanation of how “simple” actually means “probable”.
Anyways, I’m just arguing over semantics, I do in fact agree with everything you said. I just wish there was no Occams razor, it should just be “The theory which is the most statistically probable, is usually the right one.” This is what people actually mean to say when they say “The simplest explanation that fits the facts.”
The form you list it in is the historical form of Occam’s Razor, but it isn’t the form that the Razor has been applied in for a fairly long time. Among other problems, defining what one means by distinct entities is problematic. And we really do want to prefer simpler explanations to more complicated ones. Indeed, the most general form of the razor doesn’t even need to have an explanatory element (I in general prefer a low degree polynomial to interpolate some data to a high degree polynomial even if I have no explanation attached to why I should expect the actual phenomenon to fit a linear or quadratic polynomial.)
I may be missing something here -- Occam’s Razor is “entities must not be multiplied beyond necessity” (entia non sunt multiplicanda praeter necessitatem)
-- but isn’t the post using the first definition anyway? So even if he explicitly wrote the second definition instead of the first, he was clearly aware of the first since that’s what corresponds with his argument.
In statistics generally the model that has the least variables and is the most statistically probable is the one used. See things like AIC or Bayesian Information Criterion on how to choose a good model. This means that Occam’s razor is accurate. Given that is is possible to continuously add variables to a model and get a perfect fit but have the model be blown apart with the addition of an additional observation that is not otherwise influential, then, unless you are defining probability to include an Information Criterion, your formulation is less useful.
I think replacing witchcraft with godhood is also a common mistake
What I don’t understand is so much insistence that Occam’s Razor applies only to explanations you address to God. Or else how do you avoid the observation that the simplicity of an explanation is a function of whom you are explaining to ? In the post, you actually touch on the issue, only to observe that there are difficulties interpreting Occam’s Razor in the frame of explaining things to humans (in their own natural language), so let’s transpose to a situation where humans are completely removed from the picture. Curiously enough, where the same issue occurs in the context of machine languages it is quickly “solved”. Makes one wonder what Occam—who had no access to Turing machines—himself had in might.
Also, if you deal in practice with shortening code length of actual programs, at some point you have exploited all the low lying fruit; further progress can come after a moment of contemplation made you observe that distinct paths of control through the code have “something in common” that you may try to enhance to the point where you can factor it out. This “enhancing” follows from the quest for minimal “complexity”, but it drives you to do locally, on the code, just the contrary of what you did during the “low-lying fruit” phase, you “complexify” rather than “simplify” two distinct areas of the code to make them resemble each other (and the target result emerges during the process, fun). What I mean to say, I guess, is that even the frame proposed by Chaitin-Kolmogorov complexity gives only fake reasons to neglect bias (from shared background or the equivalent).
“each program is further weighted by its fit to all data observed so far. This gives you a weighted mixture of experts that can predict future bits.”
I don’t see it explained anywhere what algorithm is used to weight the experts for this measure. Does it matter? And how are the “fit” probabilities and “complexity” probabilities combined? Multiply and normalize?
bayes theorem
What I find fascinating is that Solomonoff Induction (and the related concepts from Kolmogorov complexity) very elegantly solves the classical philosophical problem of induction, as well as resolving a lot of other problems:
What is the correct “prior” in Bayesian inference, and isn’t the choice of prior all subjective?
What does Occam’s razor really mean, and what is a “simple” theory?
Why do physicists insist that their theories are “simple” when only they can understand them?
Despite this, it is almost unheard of in the general philosophical (analytic philosophy) community. I’ve read literally dozens of top-grade philosophers discussing these topics, with the implication that these are still big unsolved problems, and in complete ignorance that there is a very rich mathematical theory in this area. And the theory’s not exactly new either… dates back to the 1960s.
Anyone got an explanation for the disconnect?
Philosophers don’t read those things. If that explanation seems lacking, I feel like referring to Feynman.
Possibly because .Solomonoff induction isnt very suitable to answering the kinds of questions philosophers want answered, questions of fundamental ontology.. It can tell you what programme would generate observed data, but it doesn’t tell you what the programme is running on..the laws of physics, Gods mind, .or a giant simulation. OTOH, traditional Occams razor can exclude a range of ontological hypotheses.
There is also the problem that there is no absolute measure of the complexity of a programme: a programming language is still a language, and some languages can express some things more concisely than others, as explained in kokotajlods other comment. http://lesswrong.com/lw/jhm/understanding_and_justifying_solomonoff_induction/ady8
I don’t think Solomonoff Induction solves any of those three things. I really hope it does, and I can see how it kinda goes half of the way there to solving them, but I just don’t see it going all the way yet. (Mostly I’m concerned with #1. The other two I’m less sure about, but they are also less important.)
I don’t know why the philosophical community seems to be ignoring Solomonoff Induction etc. though. It does seem relevant. Maybe the philosophers are just more cynical than we are about Solomonoff Induction’s chances of eventually being able to solve 1, 2, and 3.
I found this paragraph confusing. How about
Does that mean the same thing?
Upcoming formal philosophy conference on the foundations of Occam’s razor here. Abstracts included.
I’ll be there!
I don’t think it’s quite necessary for people to even be consciously aware of Occam’s Razor. The right predictions will eventually win out because there will exist an economic profit somewhere which will be exploited. If you can think of an area which is overrun with market inefficiencies due to something related to this post, please let me know and I will be sure to grab whatever I can of the economic profits while they last.
OK, I am coming in way late, but I can tell you that all of you are wrong. Occam’ s razor theory is based on observations of human behavior over a long period of time. Some Humans want to attribute mystical or supernatural significance to any event out of the ordinary that occurs in their lives. Advanced brains like yourselves seek to apply equations and theorems that reduce life events to an equation. Sometimes shit happens that just can’t or won’t fit into anyone’s strongly held beliefs or theories. Live life, stop trying to use math or what the fuck ever to explain it!
Hello,
I need some help understanding the article after Unless your program is being smart, and compressing the data, it should do no good just to move one bit from the data into the program description.
How is the connection being made from complexity and fit to data and program description?
Thanks in advance! :)
Complexity, as defined in Solomonoff Induction, means program description—that is, code length in bits.
Sidenote: thank you for reminding me that Eliezer was talking about better versions of SI in 2007, before starting his quantum mechanics sequence.
I found a reference to a very nice overview for the mathematical motivations of Occam’s Razor on wikipedia.
It’s Chapter 28: Model Comparison and Occam’s Razor; from (page 355 of) Information Theory, Inference, and Learning Algorithms (legally free to read pdf) by David J. C. MacKay.
The Solomonoff Induction stuff went over my head, but this overview’s talk of trade-offs between communicating increasing numbers of model parameters vs having to communicate less residuals (ie. offsets from real data); was very informative.
My own way of thinking of Occam’s Razor is through model selection. Suppose you have two competing statements H1 (the which did it) and H2 (it was chance or possibly something other than a which caused it (H2=¬H1)) and some observations D (the sequence came up 0101010101). Then the preferred statement is whichever is more probable calculated as
this is simply Bayes rule where
and the model is parametrized by some parameters θ.
Now all this is just the mathematical way of writing that a hypothesis that has more parameters (or more specifically more possible values that it predicts), will not be as strong a statement that predicts a smaller state of outcomes.
In the witch example this would be:
The way I stated the hypotheses p(D|H1)=p(D|H2)⋅Fraction of outcomes that look like a pattern
Now what remains is to estimate the priors and the the fraction of outcomes that look like a pattern. We can skip p(D) as we are interested in p(H1|D):p(H2|D).
Now comparing the amount of conditionals in the hypotheses and how surprised I am by them I would roughly estimate a ratio of the priors as something like 2100 in favor to chance, as the witch hypothesis goes against many of my formed beliefs of the world collected over many years, it includes weird choices of living for this hypothetical alien entity, it picks out me as a possible agent of many in the neighborhood, it singles out an arbitrary action of mine and an arbitrary set of outcomes.
For the sake of completeness. The fraction of outcomes that look like a pattern is kind of hard to estimate exactly. However, my way of thinking about it is how soon in the sequence would I postulate the specific sequence that it ended up in. After 0101, I think that the sequence 0101010101 is the most obvious pattern to continue it in. So roughly this is six bits of evidence.
In conclusion, I would say that the probability of the witch hypothesis is lacking around 94 bits of evidence for me to believe it as much as the chance hypothesis.
The downside of this approach to the Solomonoff induction and the minimum message length is that it is clunkier to use and it might be easy to forget to include conditionals or complexity in the priors the same way they can be lost in the English language. The upside is that as a model it is simpler, less ad hoc and builds directly on the product rule in probability and that probabilities sum to one and should thus be preferred by Occam’s Razor ;).