Intuitive Explanation of Solomonoff Induction
Update: Alex Altair has now finished this tutorial!
A while ago I began writing a Solomonoff Induction tutorial, but now it’s one of those Less Wrong articles I probably will never have time to write.
But, maybe a few of you can take what I’ve started, team up, and finish the thing. I’m basically just following along with the Solomonoff induction tutorials from Shane Legg (2008, 2004), but making them simpler and readable by a wider audience. I think this would be a valuable thing for Less Wrong to have, and one that would draw even more traffic to our community, but I don’t have time to write it myself anymore!
Who wants to be a hero and finish this thing? The original markdown source is available here.
This is a tutorial for beginners. Those familiar with Solomonoff induction may prefer to read Rathmanner & Hutter (2011).
People disagree about things. Some say that vaccines cause autism; others say they don’t. Some say everything is physical; others believe in a god. Some say that complicated financial derivatives are essential to a modern competitive economy; others think a nation’s economy will do better without them. It’s hard to know what is true.
And it’s hard to know how to figure out what is true. Some think you should assume the things you are most certain about and then deduce all other beliefs from your original beliefs. Others think you should accept at face value the most intuitive explanations of your personal experience. Others think you should generally agree with the scientific consensus until it is disproven.
Wouldn’t it be nice if finding out what’s true was like baking a cake? What if there was a recipe for finding out what is true? All you’d have to do is follow the written instruction exactly, and after the last instruction you’d inevitably find yourself with some sweet, tasty truth!As it turns out, there is an exact recipe for finding truth. It was discovered in the 1960s.
The problem is that you don’t have time to follow the recipe. To find the truth to even a simple question using this recipe would require you to follow one step after another until long after the heat death of the universe, and you can’t do that.
But we can find shortcuts. Suppose you know the exact recipe for baking a cake requires you to count out one molecule of H2O at a time until you have exactly 0.5 cups of water. If you did that, you might not finish before the heat death of the universe. But you could approximate that part of the recipe by measuring out something very close to 0.5 cups of water, and you’d probably still end up with a pretty good cake.
Similarly, once we know the exact recipe for finding truth, we can try to approximate it in a way that allows us to finish all the steps sometime before the heat death of the universe.
This tutorial explains the exact recipe for finding truth, Solomonoff induction, and also explains some attempts to approximate it in ways that allow us to figure out now what is (probably) true. Fear not; I shall not assume you know anything beyond grade-school math.
Like my Crash Course in the Neuroscience of Human Motivation, this tutorial is long. You may not have time for it; that’s fine. But if you do read it, I recommend you read it in sections.
Contents:
Algorithms
At an early age you learned a set of precisely-defined steps — a ‘recipe’ or, more formally, an algorithm — that you could use to find the largest number in an unsorted list of numbers like this:
21, 18, 4, 19, 55, 12, 30
The algorithm you learned probably looked something like this:
Note the leftmost item as the largest you’ve seen in this list so far. If this is the only item on the list, output it as the largest number on the list. Otherwise, proceed to step 2.
Look at the next item. If it is larger than the largest item noted so far, note it as the largest you’ve seen in this list so far. Proceed to step 3.
If you have not reached the end of the list, return to step 2. Otherwise, output the last noted item as the largest number in the list.
Other algorithms could be used to solve the same problem. For example, you could work your way from right to left instead of from left to right. But the point is that if you follow this algorithm exactly, and you have enough time to complete the task, you can’t fail to solve the problem. You can’t get confused about what one of the steps means or what the next step is. Every instruction tells you exactly what to do next, all the way through to the answer.
You probably learned other algorithms, too, like how to find the greatest common divisor of any two integers (see image on right).
But not just any set of instructions is a precisely-defined algorithm. Sometimes, instructions are unclear or incomplete. Consider the following instructions based on an article about the scientific method:
Make an observation.
Form a hypothesis that explains the observation.
Conduct an experiment that will test the hypothesis.
If the experimental results disconfirm the hypothesis, return to step #2 and form a hypothesis not yet used. If the experimental results confirm the hypothesis, provisionally accept the hypothesis.
This is not an algorithm.
First, many of the terms are not clearly defined. What counts as an observation? What counts as a hypothesis? What would a hypothesis need to be like in order to ‘explain’ the observation? What counts as an experiment that will ‘test’ the hypothesis? What does it mean for experimental results to ‘confirm’ or ‘disconfirm’ a hypothesis?
Second, the instructions may be incomplete. What do we do if we reach step 4 and the experimental results neither ‘confirm’ nor ‘disconfirm’ the hypothesis under consideration, but instead are in some sense ‘neutral’ toward the hypothesis? These instructions don’t tell us what to do in that case.
An algorithm is a well-defined procedure that takes some value or values as input and, after a finite series of steps, generates some value or values as output.
For example, the ‘find the largest number’ algorithm above could take the input {21, 18, 4, 19, 55, 12, 30} and would, after 13 steps, produce the following output: {55}. Or it could take the input {34} and, after 1 step, produce the output: {34}.
What we’re looking for is a precise algorithm that will produce truth as its output.
Induction
Whether we are a detective trying to catch a thief, a scientist trying to discover a new physical law, or a businessman attempting to understand a recent change in demand, we are all in the process of collecting information and trying to infer the underlying causes.a
The problem of inference is this: We have a collection of observations (data), and we have a collection of hypotheses about the underlying causes of those observations. We’d like to know which hypothesis is correct, so we can use that knowledge to predict future events.
Suppose your data concern a large set of stock market price changes and other events in the world. You’d like to know the processes responsible for the stock market price changes, because then you can predict what the stock market will do in the future, and make some money.
Or, suppose you are a parent. You come home from work to find a chair propped against the refrigerator, with the cookie jar atop the fridge a bit emptier than before. One hypothesis that leaps to mind is that your young daughter used the chair to reach the cookies. However, many other hypothesis explain the data. Perhaps a very short thief broke into your home and stole some cookies. Perhaps your daughter put the chair in front of the fridge because the fridge door is broken and no longer stays shut, and you forgot that your friend ate a few cookies when he visited last night. Perhaps you moved the chair and ate the cookies yourself while sleepwalking the night before.All these hypotheses are possible, but intuitively it seems like some hypotheses are more likely than others. If you’ve seen your daughter access the cookies this way before but have never been burgled, then the ‘daughter hypothesis’ seems more plausible. If some expensive things from your bedroom and living room are missing and there is hateful graffiti on your door at the eye level of a very short person, then then ‘short thief’ hypothesis seems more plausible than before. If you suddenly remember that your friend ate a few cookies and broke the fridge door last night, the ‘broken fridge door’ hypothesis gains credibility. If you’ve never been burgled and your daughter is out of town and you have a habit of moving and eating things while sleepwalking, the ‘sleepwalking’ hypothesis is less bizarre.
But the weight you give to each hypothesis depends greatly on your prior knowledge. What if you had just been hit on the head and lost all past memories, and for some reason the most urgent thing you wanted to do was to solve the mystery of the chair in front of the fridge door? Then how would weigh the likelihood of the available hypotheses?
Occam’s Razor
Consider a different inductive problem. A computer program outputs the following sequence of numbers:
1, 3, 5, 7
Which number comes next? If you guess correctly, you’ll win $500.
In order to predict the next number in the sequence, you make a hypothesis about the process the computer is using to generate these numbers. One obvious hypothesis is that it is simply listing all the odd numbers in ascending order from 1. If that’s true, you should guess 9 to win the $500.
But perhaps the computer is using a different algorithm to generate the numbers. Suppose that n is the step in the sequence, so that n=1 when it generated ‘1’, n=2 when it generated ‘3’, and so on. Maybe the computer used this equation to calculate each number in the sequence:
2n − 1 + (n − 1)(n − 2)(n − 3)(n − 4)
If so, the next number in the sequence will be 33. (Go ahead, check the calculations.) But doesn’t the first hypothesis seem more likely?
The principle behind this intuition, which goes back to William of Occam, could be stated:
Among all hypotheses consistent with the observations, the simplest is the most likely.
The principle is called Occam’s razor because it ‘shaves away’ unnecessary assumptions.
For example, think about the case of the missing cookies again. In most cases, the ‘daughter’ hypothesis seems to make fewer unnecessary assumptions than the ‘short thief’ hypothesis does. You already know you have a daughter that likes cookies and knows how to move chairs to reach cookies. But in order for the short thief hypothesis to be plausible, you have to assume that (1) a thief found a way to break into your home, that (2) the thief wanted cookies more than anything else from your home, that (3) the thief was, unusually, too short to reach the top of the fridge without the help of a chair, and many other unnecessary assumptions.
Occam’s razor sounds right, but can it be made more precise, and can it be justified? We will return to those questions later. For now, let us consider another important part of induction, that of updating our beliefs in response to new evidence.
Updating Beliefs
You’re a soldier in combat, crouching in a trench. You know for sure there is just one enemy soldier left on the battlefield, about 400 yards away. You also know that if the remaining enemy is a regular army troop, there’s only a small chance he could hit you with one shot from that distance. But if the remaining enemy is a sniper, then there’s a very good chance he can hit you with one shot from that distance. But snipers are rare, so it’s probably just a regular army troop.You peek your head out of the trench, trying to get a better look.
Bam! A bullet glance off your helmet and you duck down again.
“Okay,” you think. “I know snipers are rare, but that guy just hit me with a bullet from 400 yards away. I suppose it might still be a regular army troop, but there’s a seriously good chance it’s a sniper, since he hit me from that far away.”
After another minute, you dare to take another look, and peek your head out of the trench again.
Bam! Another bullet glances off your helmet! You duck down again.
“Damn,” you think. “It’s definitely a sniper. No matter how rare snipers are, there’s no way that guy just hit me twice in a row from that distance if he’s a regular army troop. He’s gotta be a sniper. I’d better call for support.”
This is an example of updating beliefs in response to evidence, and we do it all the time.
You start with some prior beliefs, and all of them are uncertain. You are 99.99% certain the Earth revolves around the sun, 90% confident your best friend will attend your birthday party, and 40% sure that the song on the radio you’re listening to was played by The Turtles.
Then, you encounter new evidence — new observations — and update your beliefs in response.
Suppose you start out 85% confident the one remaining enemy soldier is not a sniper, leaving only 15% credence to the hypothesis that he is a sniper. But then, a bullet glances off your helmet — an event far more likely if the enemy soldier is a sniper than if he is not. So now you’re only 40% confident he’s not a sniper, and 60% confident he is. Another bullet glances off your helmet, and you update again. Now you’re only 2% confident he’s not a sniper, and 98% confident he is a sniper.
Now, I’m about to show you some math, but don’t run away. The math isn’t supposed to make sense if you haven’t studied it. Don’t worry; I’m going to explain it all.
There’s a theorem in probability theory that tells you how likely one observation is given some other observations. It’s called Bayes’ Theorem.
At this point you may want to take a break and read either tutorial #1, tutorial #2, tutorial #3, or tutorial #4 on Bayes’ Theorem. I’ll only say a little bit more about Bayes’ Theorem in this tutorial.
The short form of Bayes’ Theorem looks like this:
Let’s unpack this. The H refers to some hypothesis, and the E responds to some evidence. p(x) refers to the probability of x. The pipe symbol | means ‘given’ or ‘assuming’. So, Bayes’ Theorem reads:
[The probability of hypothesis H given evidence E] is equal to [the probability of evidence E given hypothesis H] times [the probability of hypothesis H] divided by [the probability of evidence E].
You can see how this tells us what we’d like to know. We want to know the probability of some hypothesis — some model of the world that, if correct, will allow us to successfully predict the future — given the evidence that we have. And that’s what Bayes’ Theorem tells us. Now we just plug the numbers in, and solve for p(H|E).
Of course, it’s not easy to “just plug the numbers in.” You aren’t an all-knowing god. You don’t know exactly how likely it is that the enemy soldier would hit your helmet if he’s a sniper, compared to how likely that is if he’s not a sniper. But you can do your best. With enough evidence, it will become overwhelmingly clear which hypothesis is correct.
At this point, I’ll let the tutorials on Bayes’ Theorem above do the heavy lifting on how to update beliefs. Let me get back to the part of induction those tutorials don’t explain: the choice of priors.
The Problem of Priors
In the example above where you’re a soldier in combat, I gave you your starting probabilities: 85% confidence that the enemy soldier was a sniper, and 15% confidence he was not. But what if you don’t know your “priors”? What then?
When using Bayes’ Theorem to calculate your probabilities, your choice of prior can influence your results greatly. But how can we determine the probability of a hypothesis before seeing any data? What does that even mean?
Legg (2008) explains:
Bayesians respond to this in a number of ways. Firstly, they point out that the problem is generally small, in the sense that with a reasonable prior and quantity of data, the posterior distribution P(h|D) depends almost entirely on D rather than the chosen prior P(h). In fact on any sizable data set, not only does the choice of prior not especially matter, but Bayesian and classical statistical methods typically produce similar results, as one would expect. It is only with relatively small data sets or complex models that the choice of prior becomes an issue.
If classical statistical methods could avoid the problem of prior bias when dealing with small data sets then this would be a significant argument in their favour. However Bayesians argue that all systems of inductive inference that obey some basic consistency principles define, either explicitly or implicitly, a prior distribution over hypotheses. Thus, methods from classical statistics make assumptions that are in effect equivalent to defining a prior. The difference is that in Bayesian statistics these assumptions take the form of an explicit prior distribution. In other words, it is not that prior bias in Bayesian statistics is necessarily any better or worse than in classical statistics, it is simply more transparent.
But this doesn’t solve our problem. It merely points out that other statistical techniques don’t fare any better.
What we’d like is to reduce the potential for abusing one’s opportunity to select priors, and to use agreed-upon priors when possible. Thus, many standard “prior distributions” have been developed. Generally, they distribute probability equally across hypotheses.
But to solve the problem of priors once and for all, we’d like to have an acceptable, universal prior distribution, so that there’s no fuzziness about the process of induction. We need a recipe, and algorithm, for finding out the truth. And there can’t be any fuzziness in an algorithm.
Binary Sequence Prediction
[intro to binary sequence prediction]
Solomonoff and Kolmogorov
[summarize the contributions of Solomonoff and Kolmogorov]
Solomonoff Lightsaber
[explain the result: Solomonoff Induction]
Approximations
[survey a few of the approximations of Solomonoff Induction in use today]
Philosophical Implications
[survey of some philosophical implications of unviersal induction]
Notes
a This quote and some of the examples to follow are from Legg (2008).
References
Legg (2008). Machine Superintelligence. PhD thesis, Department of Informatics, University of Lugano.
Rathmanner & Hutter (2011). A philosophical treatise of universal induction.
- An Intuitive Explanation of Solomonoff Induction by 11 Jul 2012 8:05 UTC; 159 points) (
- 20 Jan 2012 6:56 UTC; 4 points) 's comment on Human consciousness as a tractable scientific problem by (
- 17 Jun 2012 5:30 UTC; 1 point) 's comment on Open Problems Related to Solomonoff Induction by (
- 6 Jun 2012 16:59 UTC; 1 point) 's comment on Open Problems Related to Solomonoff Induction by (
- 8 May 2012 2:47 UTC; 0 points) 's comment on [SEQ RERUN] When Science Can’t Help by (
At our current state of knowledge, we cannot have any mathematical theorem about Solomonoff induction finding “truth” in our world, because we don’t know the mathematical underpinnings of our world. We can only have theorems about SI outperforming certain other methods of inference on all possible inputs.
The simplest of such theorems says SI outperforms all computable algorithms at the log-score game. This is not a very difficult feat because SI is allowed to be uncomputable. (SI can be constructed as a “mixture” of all computable algorithms, and arbitrarily changing the weights in the “mixture” will give you different but equally powerful versions of SI.)
A more interesting theorem says that SI is also optimal in its own class: possibly uncomputable distributions that can be specified as “lower-semicomputable semimeasures”. That’s genuinely surprising because most classes of distributions don’t have an optimal member within the class itself (see p.11 of Hutter’s slides, with only one “Yes” on the main diagonal). But it doesn’t seem very relevant to practice and somewhat undermines the special status of SI, because you can make universal mixtures of an even wider and more uncomputable class that will outperform SI the same way SI outperforms computable algorithms.
Thinking further along these lines, finding truth is likely to be an instrumental value rather than a terminal one, so we probably need to look at games other than the log-score game and see if SI can still make money when the rules for rewards and punishments are changed. This road leads to interesting unsolved questions, I wrote about some of them :-)
Right now I’m not sure there’s any “deep truth” about applying SI to our world. In particular, there’s the troubling possibility that SI’s answers to your questions will depend on how you intend to use these answers!
When you describe a problem as conclusively solved when it’s not, you risk making your readers lose their sense of wonder. Maybe you could take a page from Feynman? He wasn’t afraid of giving the reader a glimpse of unsolved problems now and then. Eliezer also wrote about the perils of presenting a topic to students as if it were set in stone.
Solomonoff induction isn’t particularly optimal at real tasks. Its performance on real-world tasks will depend somewhat on the choice of reference machine.
I don’t think so. I put my criticism of that idea there.
I’m confused. You don’t need SI to find the exact truth if you have all information, but without all information, SI finds only the probable. But even then it’s based on Kolmogorov complexity which is not computable, so SI is not computable, so “exactness” isn’t ideal. Isn’t it just that other approaches do worse on average?
Remember that crazy comic book/cartoon spin off Bayes-Factor? Where Razor, Hyperparameter, Conjugate, and the gang fight Null, Two-Tails, Propensity, and all the other epic villains? The Bayes-Factor team could handle this problem!
Yes of course. We are always engaged in a probabilistic search for truth.
The ideal, if we had infinite computing power, would be to discover what is most probably true. But the ideal is uncomputable, so we seek out approximations of that ideal that are computable, especially ones that are computable within a certain domain of problems, like the Bayesian information criterion.
And a good thing, too, if I understand Tarski’s undefinability theorem correctly.
While you don’t have to mention Tarski, you do need to remember that some smart people will object to the claim that (in principle) there exists a recipe for truth.
I assume some of this, the parts I didn’t write, are not meant to be blockquoted?
It was meant as a draft of an alternative, based on my limited understanding. I see two thresholds where I think you see one. The recipe is uncomputable, so it would take longer than “long after the heat death of the universe” or any other finite amount of time to finish. Also, the computable functions most similar to it would take more steps than there is time to do.
Yes, approximations of Solomonoff Induction need to be not just computable but also tractable.
It’s a good sign our understandings match, but consider that after simply reading the explanation I thought you meant something other than what you did.
I’m going to write an (introductory) class paper on Kolmogorov complexity and Solomonoff induction soon-ish (within the next 2-3 months), and I already intended to write a less serious article or two about it while I’m at it (though not necessarily for LW). I’m not committing to anything, but I might well take you up on this, if it’s still open by then.
I just wrote up my understanding of Solomonoff induction. Unfortunately it’s a PDF, since I wanted to try out GNU TeXmacs. This might work for the “Binary Sequence Prediction” section:
http://www.ocf.berkeley.edu/~ehy/solomonoff_lightsaber.pdf
It’s my first time doing this kind of thing, so please tell me how I can improve! Also, I tried to be careful, but I’m only a freshman, so I may have seriously misunderstood some things...
Updated.
Is the .pdf link still working? I am unable to access it.
Hmm, this doesn’t show up in the recent comments section. Is this because the post is too old or because I am too new?
It was said by others (like lessdazed) but I insist on the difference between “very long to compute” and “not computable”. Finding the winning moves in chess may require longer than the universe lifetime, but is theoricaly possible given enough computer power. Kolmogorov complexity is not computable, due to the impossibility of having an halting oracle. A full paragraph about those issues and how to get around them would be worth it IMHO. Reading the text without knowing that makes it feel that SI is just the same kind of “too long to be directly used” problem like winning at chess, while in fact it’s a “level” harder (not even computable).
In my view this is not the main practical problem with SI. In practice it doesn’t matter much that you can’t find the Kolmogorov complexity; you can still get closer and closer to it and as you do so you will get closer and closer to the truth.
The main practical problem with SI as a principle of statistics is that the Kolmogorov complexity is not an absolute quantity in the limited-data regime where most statistical analysis takes place. Imagine a psychologist who hands out N=400 surveys with 10 questions apiece, and each question has 8 possible answers. Then a naive encoding of the data set is 12,000 bits; this quantity is small compared to the Turing machine translation constants and so your estimate of the KC is entirely dependent on your choice of Turing machine.
The main practical problem with Solomonoff Induction surely is our lack of good quality computable approximations to it. If you don’t have much data, you can usually just gather some more data—there’s a whole world of it out there. It’s a trivial solution, but it resolves an awful lot of problems.
I have a question about Solomonoff Induction:
Solomonoff Induction does not deal directly with sense data that is known to be uncertain. Uncertainty in sense data can affect predictions. So, for example, if I know apriori that my bit detector has a 40% chance of producing a random bit—and so far I have seen: 11101101… - then my estimate of the next symbol being a “0” would be higher as a result of my apriori knowledge.
Is there a “recognised” way of converting sensory biases into “plausible prefixes”—to deal with this issue? Or is there some other way of handling it?
Is it really necessary to explain the concept of algorithm in an “intuitive” explanation? This looks like it will turn into a decent explanation for the (IQ > 115 && curious) crowd. I think it’s going to have to sacrifice a lot more detail if you want to get down to the (IQ ~ 100 && not particularly curious) crowd. Maybe I will make an attempt.
Luke, can you fix the link to the “Rathmanner & Hutter” article in the references section? It’s missing an “http://” in front of it.
I made an introduction to Solomonoff induction a while ago.
I have an introduction to sequence prediction and an introduction to machine forecasting also.
These are in the public domain—so feel free to reuse.
In addition to the issues raised by cousin_it I’d like to raise another related issue: If one uses a strict Solomonoff prior, then one can’t ever have a non-zero probability for a sequence to be uncomputable. So say for example we look at the first few thousand digits of the fine-structure constant in binary and find that they correspond exactly to some easily specifiable uncomputable sequence (say involving the Halting problem or the Busy Beaver function). An entity using Solomonoff induction cannot, no matter how many digits it sees, assign the obvious hypothesis any likelyhood.
I think this issue is a red herring. SI can be viewed as a purely predictive device that doesn’t assign probabilities to hypotheses, only to observable events. You will find that even if you’re endowed with the privileged knowledge that the fine structure constant is a halting oracle, that knowledge provably can’t help you win a prediction game against SI, because your computable brain is not very good at predicting a halting oracle. When you’re losing to a machine, saying it can’t “really entertain a hypothesis” is like saying a submarine can’t “really swim” or Rybka can’t “really play chess”.
We can frequently compute the first several terms in a non-computable sequence, so this statement seems false.
When people talk about the impossibility of “winning” against SI, they usually mean it’s impossible to win by more than a constant in the long run.
In that case, I’d say that your response involves special pleading. SI priors are uncomputable. If the fine structure constant is uncomputable, then any uncomputable prior that assigns probability 1 to the constant having its actual value will beat SI in the long run. What is illicit about the latter sort of uncomputable prior that doesn’t apply to SI priors? Or am I simply confused somehow? (I’m certainly no expert on this subject.)
SI belongs to a class of priors that could be described as “almost computable” in a certain technical sense. The term is lower-semicomputable semimeasure. An interesting thing about SI is that it’s also optimal (up to a constant) within its own class, not just better than all puny computable priors. The uncomputable prior you mention does not belong to that class, in some sense it’s “more uncomputable” than SI.
Uncomputable sequences have computable approxiomations. Solomonoff induction could do very well at predicting such sequences. However, it wouldn’t assume that they are uncomputable. Why should it? That is a bizarre hypothesis—not an “obvious” one. It is surely more likely that the observed finite prefix was generated by some computable method.
Being a less likely hypothesis is not the same thing as having probability zero.
Solomonoff induction only deals with finite sequences. It doesn’t assign p=0 to any sequence consistent with its observations so far. Uncomputable sequences are necessarily inifinite—and though Solomonoff induction can’t handle them, neither can the observable universe. I think that the case that they matter remains to be made.