I was confused about Solomonoff induction a while ago. Since code from any part of whatever program is running could produce whatever string is observed, why would shorter programs be more likely to have produced the observed string? My understanding of the answer I received was that, since the Turing machine would produce its output linearly starting from the beginning of the program, a program with extra code before the piece that produced the observed string would have produced a different string. This made sense at the time, but since then I’ve thought of a variant of the problem involving not knowing the full length of the string, and I don’t think that answer addresses it.
Since the code that produces the string can be arbitrarily long, and when trying to apply the principles of Solomonoff induction as a general means of induction outside of computer science we often can’t observe the full string that whatever the code producing our observed string may have produced (for example, trying to find laws of physics, or the source of some event that happened in an uncontained / low-surveillance environment), why is a shorter program more likely? The program’s length could be a billion times that of the shortest program to produce the string and be producing a ton of unobserved effects. I could wave my hands, say something about Occam’s razor, and move on, but I thought Solomonoff induction was supposed to explain Occam’s razor.
In order for the prior probabilities you assign to programs to be well-formed, they must limit to 0 as the length of the program goes to infinity. Otherwise the probabilities won’t add up to 1.
No matter what prior you choose, you’ll always end up with a variant of Occam’s Razor where programs beyond a particular length are always assigned very low probabilities.
You don’t necessarily have to decrease proportionally to 2^(-length), instead of say 3^(-length) or 1/Ackermann(length), like the Solomonoff prior does. However, since each additional bit lets you identify a value from a space that’s twice as big (i.e. the precision grows like 2^length), it seems like a natural choice to penalize by that much.
(Personally, I prefer a prior that also lightly penalizes for running time. That removes computability issues and makes the induction approximable by dove-tailing.)
I was confused about Solomonoff induction a while ago. Since code from any part of whatever program is running could produce whatever string is observed, why would shorter programs be more likely to have produced the observed string? My understanding of the answer I received was that, since the Turing machine would produce its output linearly starting from the beginning of the program, a program with extra code before the piece that produced the observed string would have produced a different string. This made sense at the time, but since then I’ve thought of a variant of the problem involving not knowing the full length of the string, and I don’t think that answer addresses it.
Since the code that produces the string can be arbitrarily long, and when trying to apply the principles of Solomonoff induction as a general means of induction outside of computer science we often can’t observe the full string that whatever the code producing our observed string may have produced (for example, trying to find laws of physics, or the source of some event that happened in an uncontained / low-surveillance environment), why is a shorter program more likely? The program’s length could be a billion times that of the shortest program to produce the string and be producing a ton of unobserved effects. I could wave my hands, say something about Occam’s razor, and move on, but I thought Solomonoff induction was supposed to explain Occam’s razor.
Solomonoff induction is a formalization of Occam’s razor.
You may find this article useful.
In order for the prior probabilities you assign to programs to be well-formed, they must limit to 0 as the length of the program goes to infinity. Otherwise the probabilities won’t add up to 1.
No matter what prior you choose, you’ll always end up with a variant of Occam’s Razor where programs beyond a particular length are always assigned very low probabilities.
You don’t necessarily have to decrease proportionally to 2^(-length), instead of say 3^(-length) or 1/Ackermann(length), like the Solomonoff prior does. However, since each additional bit lets you identify a value from a space that’s twice as big (i.e. the precision grows like 2^length), it seems like a natural choice to penalize by that much.
(Personally, I prefer a prior that also lightly penalizes for running time. That removes computability issues and makes the induction approximable by dove-tailing.)