What is the advantage of the Kolmogorov complexity prior?
As I understand it, Solomonoff induction works by the a procedure loosely stated as saying we start with a Kolmogorov complexity universal prior (formalized Occam’s razor), then update using Bayesian inference any time we see new data.
More precisely, suppose the universe is a computable sequence X of 0s and 1s. We want to predict the nth term before it happens, using the first n-1 terms.
To do this, we use the following ‘algorithm’:
List all computable sequences in increasing order of the complexity of the algorithm used to generate them to get a list A1, A2, A3, …
We start with the prior P(Ak)=1/2k.
With this prior, we predict to see a string b of observations with probability P(b)=∑k P(Ak)[b<Ak], where [b<Ak]=1 if Ak starts with the string b, and [b<Ak]=0 otherwise.
After we see a string b of observations, we update our beliefs using Bayes theorem: P(Ak|b) = P(b|Ak)P(Ak)/P(b) = [b<Ak]P(Ak)/P(b), and similarly P(bc|b) = P(b|bc)P(bc)/P(b) = P(bc)/P(b).
I’m putting the word ‘algorithm’ in quotes, because there is no list of all computable sequences where [b<Ak] is a computable function of k and b, let alone one where the sequences are listed in increasing Kolmogorov complexity. You can think of it as an algorithm relative to an oracle that computes such a function.
If we follow this ‘algorithm’, we get a nice theorem:
For all ε>0 there is some N such that if b is an initial segment of X of length at least N, then P(bc|b)>1-ε if bc is an initial segment of X, and P(bc|b)<ε, otherwise.
So in the limit, we can correctly extrapolate from what we’ve seen with high confidence.
Proof: Let K = {k : X=Ak}. Let p0 = ∑k in K 1⁄2k. Let M ≥ -log(εp0). Let b be a sufficiently long initial segment of X such that b is not an initial segment of any Am, where m≤M and m is not in K. Then if bc is an initial segment of X,
P(bc) = ∑k 1⁄2k[bc<Ak] ≥ p0,
but on the other hand
P(b) = ∑k 1⁄2k[b<Ak] ≤ p0 + ∑k>M 1⁄2k = p0 + 2-M ≤ (1+ε)p0.
Therefore,
P(bc|b)=P(bc)/P(b) ≥ p0/(1+ε)p0 = 1/(1+ε) > 1-ε.
What if the universe isn’t actually computable? Maybe there is some true randomness in the universe, as quantum-mechanics suggests? If you assume the universe is plucked at random from some computable probability distribution on the set of all binary sequences, you get almost the same result. By repeating the same procedure as above, but instead of listing all computable sequences, you list all computable probability distributions on the set of binary sequences and do a similar sort of Bayesian inference, you again get a theorem telling you that as you make observations, your distribution (conditioned on what you’ve seen so far) converges to the true distribution (conditioned on what you’ve seen so far).
This is all well and good. But in the end, what does it have to do with Kolmogorov complexity? The exact prior distribution we started with and the order of the list of all computable sequences was not used in any essential way in the proof. What’s the advantage of Occam’s razor? After all, we could skip steps 1 and 2 completely, start with an arbitrary list of all computable sequences, and an arbitrary prior supported on the set of computable sequences, and we would end up with the exact same theorem. (For the random universe version, we need to start with a prior that is a convex combination of every computable probability distribution, but again, we don’t need Kolmogorov complexity at all.)
Is there a stronger theorem that we get when we use the Kolmogorov complexity prior that we don’t get otherwise? Why are people making a big deal about the relationship between Solomonoff induction and Occam’s razor, if the part of Solomonoff induction that corresponds to Occam’s razor is not necessary for Solomonoff induction to work? Occam’s razor seems to work well in practice, but I don’t see how it’s in any way necessary (or even useful!) for the math.
So, why use a prior generated like this? There are a number of arguments:
1) It seems that “Occam’s razor” is a fundamental part of inductive inference, and that the Kolmogorov approach formulates this successfully. “Simple” strings are ones that can be generated using short programs; “complex” strings are ones that can only be generated using long programs (or short programs that are then post-padded with a large string of input data).
This is less than convincing, because it reduces to Occam worship, which is the point you raise. Also, there are other ways of formulating Occam’s razor to measure simplicity of hypotheses. Most of these do reduce to special cases of Kolmogorov complexity though.
2) Approximate computability, combined with open-mindedness. You need your prior to be in some sense computable, at least by a series of approximations, or you can’t actually use it.
The Solomonoff prior is computable in this weak sense… you can compute it by a series of increasing approximations. You run the computer in parallel over all its possible inputs, and every time it stops with output A after reading in n input bits, increment the approximation to Ps(A) by 1/(2^n).
On the other hand, you also want your prior to give some weight to every hypothesis whose consequences you can compute. What this means is that for any probabilistic hypothesis H, where you can compute H(A), you want your prior to assign some probability h to H being correct (i.e. a true probabilistic law). This means there is some constant factor h such that in your prior you want P(A) >= h.H(A) for every possible output A.
It turns out that these two conditions require you to use a prior which is the Solomonoff prior for some universal computer. The only choice you get is which universal computer to pick. But since any universal computer can simulate any other (with a suitable program) your choice doesn’t affect the values Ps(A) by more than a constant multiplicative factor.
This seems to reflect what we want out of Bayesian reasoning. The choice of prior is to some extent subjective, but it is not that subjective. You can’t arbitrarily make up a prior and still be rational.
3) Decomposition of a hypothesis into independent binary postulates. Suppose we want to explain our evidence A (expressed as a binary string) by some sort of underlying hypothesis Ha. We would like to think of Ha as making a number of independent propositional claims about the world, each of which could just as easily be true as false. Basically each of its propositional claims is a brute irreducible fact or “bit”. So if Ha consists of brute propositional claims p1, …, , pn, it seems we should assign it prior probability 1/2^n.
Now, suppose we want Ha to unambiguously explain A, so that there is no doubt at all that it does indeed explain A. Well in that case, we want the propositional claims included in Ha to logically imply that the evidence is A and not anything else. Interestingly, this works for probabilistic explanations as well. We just have to supplement the probabilistic hypothesis H with some additional binary propositions identifying where on the distribution function of H the actual A occurs. (Is it in the top half? Yes. Well is it in the top quartile then? No. Well is it between the 5th and 6th octile then? Yes… Continue until A is fixed uniquely).
This logical entailment of A by Ha also means that there has to be some formal proof which allows us to deduce A from Ha. Now a nice fact about proofs is that if they exist, there is some short algorithm which can find them and verify them. So there is some computer which can take a description of Ha and use it to find the proof that the evidence is A, verify it, and then output A. Further, this computer turns out to be another sort of universal computer, and such that the minimal number of bits for expressing a hypothesis corresponds to a minimal program for computing the consequences of that hypothesis. So we get back to the Kolmogorov prior again, though with a particular choice of universal computer (one which is a universal theorem prover).
There are other lines of argument as well. One of them is a pragmatic one, that the Solomonoff prior is in a strong sense optimal, in that it will converge on a true underlying distribution at least as fast as any other computable (or approximately computable) prior, after some initial finite overhead of evidence. So even if you happen to be remarkably lucky with a different prior, and it converges on true facts and physical laws really quickly, the Solomonoff prior will still catch up with you fast.
Thanks for the summaries. In:
does the period stand for multiplication, and H(A) mean the probability of A given H?
Argument 2 is extremely powerful. Where can I find the argument spelled out?
Yes, and yes.
A classic text is Li and Vitanyi “An Introduction to Kolmogorov Complexity and it’s Applications”—see: http://homepages.cwi.nl/~paulv/kolmogorov.html
If you want an online (condensed) version but without the formal proofs, then have a look at: http://www.scholarpedia.org/article/Algorithmic_probability
Priors do tend to get swamped by the data—in a lot of cases.
Choosing a sensible prior does help when you don’t have very much data in some domain, though.
Depending on the complexity of the prior it can take much more data then you’d expect.
It seems to me that the correct reasoning behind Occam’s razor is that the more assumptions that a hypothesis must make the lower the prior probability must be. Likewise, the more specific a hypothesis is, the lower the prior probability. For example, the prior probability that “a red F150 will pass by my house within the next five minutes” is lower than the prior probability that “a motor vehicle of some sort will pass by my house within the next five minutes” for reasons that I think are fairly self-explanatory.
Just a pointer, your description of the prior here is not correct when using Kolmogorov complexity (or Solomonoff induction).
For each sequence of bits A, you need to take as prior P(A) = 1 / (2 ^ K(A)) where K(A) is the length of the shortest binary program whose output is the sequence A.
On a technical note: the programs need to form a “prefix-free” set, or this prior will not give you a converging sum of probabilities. One way to do that is to encode a special “END” symbol at the end of a program, so that no program is a prefix of any other. Another approach is to declare total length of the program at the start of the program, so that the compiler (or interpreter) knows where to stop.
Solomonoff induction is not quite the same, but is very similar. In this case, we imagine giving a computer completely random program and data, with each bit of inout generated independently and with equal probability of being 0 or 1, and consider the probability that it will halt with output string “A”; then this is the prior Ps(A). These priors can be shown to be equivalent up to a small multiplicative factor. It’s fairly easy to see why the Solomonoff prior Ps(A) is always a bit bigger than the Kolmogorov complexity prior P(A) (since the computer might have been given the very shortest program for generating A as random input). It takes more effort to show that it is not very much bigger, and that the difference doesn’t matter in practice.
Following that technical note, I’ll address your question below about why use it.
The limiting argument is mostly irrelevant. Frequentist procedures also have nice properties in the limit, but so what? The justification for Occam’s razor is philosophical, not mathematical, and AFAIK it doesn’t justify any particular mathematical form of the razor.
I use the term “justification” loosely above because it’s sort of circular. See Where Recursive Justification Hits Bottom
Not a problem. Suppose you flip a quantum coin ten times. If you record the output, the K-complexity is ten bits. As such, there’s a 1/1024 prior probability of getting that exact output. This is exactly what you’d get if you assumed it was random.
Basically, K-complexity is treating the very laws of physics as random. Any randomness on top of that works the same way as it would as part of that.
The problem is that it seems that things that are definable but not computable should be above random chance. For example, the K-complexity of a halting oracle is infinite, but it can be defined in finite space. Would the probability of the fine structure constant being a halting oracle be infinitesimal?
One reason to use K-complexity is that so far, it’s worked far better than anything else. As far as we know, we can fit the laws of physics on a note card, yet the universe contains well over 10^80 particles, and don’t get me started on the amount of computing power necessary to run it.
Assuming a lawful universe, doing the distibution per algorithm is better than per universe. But there are many possibilities how to do the distribution per algorithm. Even selecting a different (Turing-complete) programming language provides a different K-complexity.
As I understood the question in the article, if we already do the distribution per algorithm, the only necessary thing is to give a nonzero starting probability to any algorithm. This is enough; the Bayesian updates will then make the following predictions more reliable.
So the question is: assuming that we give nonzero prior probability to any algorithm:
a) is there any advantage in using Kolmogorov complexity? or
b) are in some sense all other priority distributions just variants of K-complexity? or
c) something else...?
But you can’t fit a description of the current state of those particles on a note card, which you would need in order to actually make predictions.
The laws of physics, combined with the initial conditions of the universe, is sufficient to describe the state of all the particles for all eternity.
We don’t really have much of an idea of what the initial conditions are, but there’s no reason to believe that they’re complicated.
Are there any reasons to believe they’re not complicated that don’t rely on assuming a K-complexity prior?
No, nor is there a reason to assume that anything else we don’t know about physics isn’t complicated.
That being said, the probability that even just the parts we know are so simple by coincidence is vanishingly small.
Not when you realize that the parts we know are the parts that were simple enough for us to figure out.
No, they’re the parts that are obvious enough to figure out. If there was something complicated that made a big difference, we’d have noticed it, and just not figured it out. The parts that we don’t understand are parts that make such a tiny difference that they can’t properly be experimented with.
This kind of depends on what you mean by “big difference” and “complicated enough”.
For example, I can see humanity not yet detecting that spontaneous fission isn’t random but decided by a really good random number generator.
One might also imagine a ridiculously complicated program that makes elementary particles appear to behave randomly, except that when a huge number of such particles are “assembled” in a strange loop of sufficient complexity to be considered “a human-level self-aware entity” the pseudo-random micro-behavior leads to powerful effects at the macro-scale. Pretty much anything about what brains do that we don’t yet understand in detail could be hypothesised to arise from this. (I.e., the gods jiggle the dice to make humans, e.g., risk-adverse. Or send hurricanes towards groups of people who aren’t capitalist enough, whatever.)
Of course, such hypotheses are very complex and we’d usually say that it’s unreasonable to contemplate them. But using Occam’s razor would be circular in this particular thread.
There are infinitely many possible realities that look simple to program but aren’t, but most complicated realities don’t look nearly this simple. The Copenhagen interpretation looks the same as Many Worlds, but that’s because it was discovered while trying to figure out this reality, not because most possibilities do.
But no alternative has been put forward where this is more likely, let alone one that gives nearly the probability of a reality that looks like this than Occam’s razor does.
Well, yeah (I mean, there might have been, I just don’t know of any). Don’t get me wrong, I’m not arguing against Occam’s razor.
I just meant that you seem to have given the assertion “the probability that even just the parts we know [about physics] are so simple by coincidence is vanishingly small” as a not-quite-but-kind-of justification for Occam’s razor—even though just in the previous paragraph you said there’s no reason to think not-yet-known laws of physics aren’t complicated, the “that being said” seems to kind of contradict it—but to judge that the probability of a coincidence is small, or to say “most complicated realities don’t look this simple”, would need either Occam’s razor (simpler is likelier) or some kind of uniform prior (all possible realities are as likely, and there are more of the complicated-looking ones), or another alternative that hasn’t been put forward yet, as you say. Thus, the circularity.
If for some reason we thought that extremely complicated realities that simulate very simple rules at some scales are extremely more likely to observe than all others, then “the probability that even just the parts we know are so simple” would be large. I could see some anthropic argument for this kind of thing. Suppose we discover that conscious observers cannot exist in environments with too complicated behavior; then we’d expect to only see environments with simple behavior; but (with a “uniform prior”) there are vastly more complex systems that simulate simple behaviors within than simple ones. (For any simple behavior, there is an infinity of much more complicated systems that simulate it in a part of themselves.) Thus, under the “limit complexity for observers” hypothesis we should expect a few layers of simple rules and then an incomprehensibly complicated system they arise out of “by accident”.
I know it’s not a problem. I explained exactly how to modify Solomonoff induction to handle universes that are generated randomly according to some computable law, as opposed to being generated deterministically according to an algorithm.
Maybe it is, maybe it isn’t. Maybe your definition of Kolmogorov complexity is such that the Kolmogorov complexity of every string is at least 3^^^3, because you defined Kolmogorov complexity using a really dumb universal machine. Maybe it’s 2, because you programmed a universal machine that was really good at compressing the string 0101110010, because you like it, and it so happens that was what you flipped. If you flip a coin a very large number of times, it’s very likely that the Kolmogorov complexity of the output is at least the number of flips, but it could be much smaller due to chance.
False, because you don’t actually know the complexity was 10.
No, it’s not doing that at all. Using a complexity-weighted prior is assuming that the universe follows some random computable law, which is more likely to be simple than complex. I suppose this is true to the extent that any probability distribution on a countable set vanishes in the limit (for any enumeration of the countable set!), but I see no reason to bring Kolmogorov complexity into it.
I have no idea how to make sense of this question. Are infinitely many bits of fine structure constant even physically meaningful? If beyond a certain point of precision, the fine structure constant can’t be measured even in principle because it has no detectable physical effects on accessible time/space scales, it makes no sense to even have an answer to the question “is the fine structure constant a halting oracle?” let alone the probability of such an event.
I’ll grant you that Occam’s razor works well in science. That wasn’t the question. The question is what is the advantage, if any, of starting with a Kolmogorov prior for Solomonoff induction, as opposed to any other arbitrary prior. You haven’t answered my question.
The page ReaysLemma gives some useful links to why simpler theories are more likely to be correct, other things being equal.
Hmm, if you start with ‘uniform’ prior, akin to feeding random strings to an universal Turing machine, or running random Turing machines on random data, it seems to me that if you run the machine for a while then look at it, you’ll see either a halted machine or a machine doing something really dull, and of the interesting behaviours it seems to me that it’d be predominantly very simple algorithms that will be munching the data in the way that’s not unlike simulation of an universe with very complicated state and very simple laws of physics. A sequence of transitions is exponentially unlikely over the length not to halt or degenerate .
The easy-to-visualize example of how simple algorithms arise from randomness is Conway’s game of life with random start state. The complicated structures will be very rare. You’ll end up with predominantly very simple ‘machines’ doing very simple things; even a glider gun will be very rare.
This will most probably not answer your question but I hope you will find this interesting.
The Kolmogorov complexity prior is, as you have stated, merely a formalization of Occam’s razor. Another possibility to formalize Occam’s razor is Schmidhuber’s prior where the probability of an algorithm is its speed, roughly speaking. It has the advantage of being computable in the limit, as opposed to the other option.
The interesting thing is now that we can formalize various inductive hypotheses as priors such as “Everything goes” as a uniform distribution. There was a discussion on this a few weeks before. The point is that, to my knowledge, beyond practicability there is no theoretical justification of Occam’s razor. For an atheist though, Occam’s razor has the nice property of cutting off the probability of any algorithm that incorporates the concept of a god.
A uniform distribution on what? If you start with a uniform distribution on binary sequences, you don’t get to perform inductive reasoning at all, as the observables X(1), X(2), etc. are all independent in that distribution. If you wanted to start with a uniform distribution on computable universes, you can’t, because there is no uniform distribution with countable support.
A uniform distribution on all algorithms, that is a uniform distribution on all binary strings. Intuitively we can compute probability ratios for any two algorithms given evidence since the identical prior probability cancels in that ratio.
But, as you say, the problem is that there is actually no uniform distribution with countable support. At best, we can circumvent the problem by computing the probability ratios which is almost as good.
Did you find the rest of my post useful?
God is the least of all problems with a uniform (improper) prior. The winning hypothesis will always, provably, be a “just so” hypothesis, where everything that happened was absolutely necessary with probability 1. However, if the truth is simpler, then these sorts of hypotheses “overfit the curve” and give predictions different from what humans make. Every time a scientist makes a correct prediction, the “just so” hypothesis would imply that they were just lucky.
That depends on the language you use for your Kolmogorov prior.
Er… isn’t this like saying “To my knowledge, there is no theoretical justification of biblical inerrancy. For a Christian though, biblical inerrancy has the nice property of cutting off the probability of any algorithm that incorporates atheism”?
Occam’s razor better have an independent reason to be sensible, other than reaffirming a previous (held on which basis?) belief in atheism.
Not exactly. There is no way to discount biblical evidence a priori. We can, however massively discount biblical evidence by deriving empirical predicitons and find that they do not match reality. For example, earth is massively older than the computed 6000 years.
The point of a prior distribution is to incorporate any principle or knowledge we have at hand. If we have no prior knowledge, that is, no evidence, no observation, a literal tabula rasa, the only guide we have in designing a prior are principles. That is the point which I tried to make: Before we look at the evidence we can think about which hypotheses we would prefer. Problem is, that we can now make another prior distribution for the principles in designing a prior. And another one ad infinitum. That is the fundamental problem of Bayesian reasoning that there is no canonical way to choose a prior, which is not as bad as it sounds since for infinitely much evidence the posterior will approach the true distribution.
Occam’s razor is an axiom. We can justify it by practicability considerations such as “simpler hypotheses are easier to compute” but ultimately there is no “fundamental” reason for it. My last remark about atheism was just a sidenote to illustrate that one motivation for Occam’s razor could be belief in atheism, but as you have already noted, atheism itself has to be founded on some basis: That is the position of agnosticism, that wich has no evidence speaking for or against it we can not decide. Occam’s razor merely disregards hypotheses that assert the existence of entities with no empirical effects.