So, why use a prior generated like this? There are a number of arguments:
1) It seems that “Occam’s razor” is a fundamental part of inductive inference, and that the Kolmogorov approach formulates this successfully. “Simple” strings are ones that can be generated using short programs; “complex” strings are ones that can only be generated using long programs (or short programs that are then post-padded with a large string of input data).
This is less than convincing, because it reduces to Occam worship, which is the point you raise. Also, there are other ways of formulating Occam’s razor to measure simplicity of hypotheses. Most of these do reduce to special cases of Kolmogorov complexity though.
2) Approximate computability, combined with open-mindedness. You need your prior to be in some sense computable, at least by a series of approximations, or you can’t actually use it.
The Solomonoff prior is computable in this weak sense… you can compute it by a series of increasing approximations. You run the computer in parallel over all its possible inputs, and every time it stops with output A after reading in n input bits, increment the approximation to Ps(A) by 1/(2^n).
On the other hand, you also want your prior to give some weight to every hypothesis whose consequences you can compute. What this means is that for any probabilistic hypothesis H, where you can compute H(A), you want your prior to assign some probability h to H being correct (i.e. a true probabilistic law). This means there is some constant factor h such that in your prior you want P(A) >= h.H(A) for every possible output A.
It turns out that these two conditions require you to use a prior which is the Solomonoff prior for some universal computer. The only choice you get is which universal computer to pick. But since any universal computer can simulate any other (with a suitable program) your choice doesn’t affect the values Ps(A) by more than a constant multiplicative factor.
This seems to reflect what we want out of Bayesian reasoning. The choice of prior is to some extent subjective, but it is not that subjective. You can’t arbitrarily make up a prior and still be rational.
3) Decomposition of a hypothesis into independent binary postulates. Suppose we want to explain our evidence A (expressed as a binary string) by some sort of underlying hypothesis Ha. We would like to think of Ha as making a number of independent propositional claims about the world, each of which could just as easily be true as false. Basically each of its propositional claims is a brute irreducible fact or “bit”. So if Ha consists of brute propositional claims p1, …, , pn, it seems we should assign it prior probability 1/2^n.
Now, suppose we want Ha to unambiguously explain A, so that there is no doubt at all that it does indeed explain A. Well in that case, we want the propositional claims included in Ha to logically imply that the evidence is A and not anything else. Interestingly, this works for probabilistic explanations as well. We just have to supplement the probabilistic hypothesis H with some additional binary propositions identifying where on the distribution function of H the actual A occurs. (Is it in the top half? Yes. Well is it in the top quartile then? No. Well is it between the 5th and 6th octile then? Yes… Continue until A is fixed uniquely).
This logical entailment of A by Ha also means that there has to be some formal proof which allows us to deduce A from Ha. Now a nice fact about proofs is that if they exist, there is some short algorithm which can find them and verify them. So there is some computer which can take a description of Ha and use it to find the proof that the evidence is A, verify it, and then output A. Further, this computer turns out to be another sort of universal computer, and such that the minimal number of bits for expressing a hypothesis corresponds to a minimal program for computing the consequences of that hypothesis. So we get back to the Kolmogorov prior again, though with a particular choice of universal computer (one which is a universal theorem prover).
There are other lines of argument as well. One of them is a pragmatic one, that the Solomonoff prior is in a strong sense optimal, in that it will converge on a true underlying distribution at least as fast as any other computable (or approximately computable) prior, after some initial finite overhead of evidence. So even if you happen to be remarkably lucky with a different prior, and it converges on true facts and physical laws really quickly, the Solomonoff prior will still catch up with you fast.
So, why use a prior generated like this? There are a number of arguments:
1) It seems that “Occam’s razor” is a fundamental part of inductive inference, and that the Kolmogorov approach formulates this successfully. “Simple” strings are ones that can be generated using short programs; “complex” strings are ones that can only be generated using long programs (or short programs that are then post-padded with a large string of input data).
This is less than convincing, because it reduces to Occam worship, which is the point you raise. Also, there are other ways of formulating Occam’s razor to measure simplicity of hypotheses. Most of these do reduce to special cases of Kolmogorov complexity though.
2) Approximate computability, combined with open-mindedness. You need your prior to be in some sense computable, at least by a series of approximations, or you can’t actually use it.
The Solomonoff prior is computable in this weak sense… you can compute it by a series of increasing approximations. You run the computer in parallel over all its possible inputs, and every time it stops with output A after reading in n input bits, increment the approximation to Ps(A) by 1/(2^n).
On the other hand, you also want your prior to give some weight to every hypothesis whose consequences you can compute. What this means is that for any probabilistic hypothesis H, where you can compute H(A), you want your prior to assign some probability h to H being correct (i.e. a true probabilistic law). This means there is some constant factor h such that in your prior you want P(A) >= h.H(A) for every possible output A.
It turns out that these two conditions require you to use a prior which is the Solomonoff prior for some universal computer. The only choice you get is which universal computer to pick. But since any universal computer can simulate any other (with a suitable program) your choice doesn’t affect the values Ps(A) by more than a constant multiplicative factor.
This seems to reflect what we want out of Bayesian reasoning. The choice of prior is to some extent subjective, but it is not that subjective. You can’t arbitrarily make up a prior and still be rational.
3) Decomposition of a hypothesis into independent binary postulates. Suppose we want to explain our evidence A (expressed as a binary string) by some sort of underlying hypothesis Ha. We would like to think of Ha as making a number of independent propositional claims about the world, each of which could just as easily be true as false. Basically each of its propositional claims is a brute irreducible fact or “bit”. So if Ha consists of brute propositional claims p1, …, , pn, it seems we should assign it prior probability 1/2^n.
Now, suppose we want Ha to unambiguously explain A, so that there is no doubt at all that it does indeed explain A. Well in that case, we want the propositional claims included in Ha to logically imply that the evidence is A and not anything else. Interestingly, this works for probabilistic explanations as well. We just have to supplement the probabilistic hypothesis H with some additional binary propositions identifying where on the distribution function of H the actual A occurs. (Is it in the top half? Yes. Well is it in the top quartile then? No. Well is it between the 5th and 6th octile then? Yes… Continue until A is fixed uniquely).
This logical entailment of A by Ha also means that there has to be some formal proof which allows us to deduce A from Ha. Now a nice fact about proofs is that if they exist, there is some short algorithm which can find them and verify them. So there is some computer which can take a description of Ha and use it to find the proof that the evidence is A, verify it, and then output A. Further, this computer turns out to be another sort of universal computer, and such that the minimal number of bits for expressing a hypothesis corresponds to a minimal program for computing the consequences of that hypothesis. So we get back to the Kolmogorov prior again, though with a particular choice of universal computer (one which is a universal theorem prover).
There are other lines of argument as well. One of them is a pragmatic one, that the Solomonoff prior is in a strong sense optimal, in that it will converge on a true underlying distribution at least as fast as any other computable (or approximately computable) prior, after some initial finite overhead of evidence. So even if you happen to be remarkably lucky with a different prior, and it converges on true facts and physical laws really quickly, the Solomonoff prior will still catch up with you fast.
Thanks for the summaries. In:
does the period stand for multiplication, and H(A) mean the probability of A given H?
Argument 2 is extremely powerful. Where can I find the argument spelled out?
Yes, and yes.
A classic text is Li and Vitanyi “An Introduction to Kolmogorov Complexity and it’s Applications”—see: http://homepages.cwi.nl/~paulv/kolmogorov.html
If you want an online (condensed) version but without the formal proofs, then have a look at: http://www.scholarpedia.org/article/Algorithmic_probability