This is a completely different explanation, and I am not sure about its correctness. But it may help your intuition.
Imagine that you randomly generate programs in a standard computer language. And—contrary to the S.I. assumption—let’s say that all those programs have the same length N.
Let’s try a few examples. For simplicity, let’s suppose that the input is in variable x, and the programming language has only two commands: “x++” (increases x by one) and “return x” (returns the current value of x as a result). There are four possible programs:
x++; x++;
x++; return x;
return x; x++;
return x; return x;
The first program does not return a value. Let’s say that it is not a valid program, and remove it from the pool. The second program returns the value “x+1”. The third and fourth programs both return the value “x”, because both are equivalent to one-line program “return x”.
It seems that the shorter programs have an advantage, because a longer program can give the same results as the shorter program. More precisely, for every short program, there are infinitely many equivalent longer programs; as you increase the maximum allowed length N, you increase their numbers exponentially (though not necessarily 2^x exponentially; this is just a result of our example language having 2 instructions). This is why in this example the one-line program won the competition, even if it wasn’t allowed to participate.
OK, this is not a proof, just an intuition. Seems to me that even if you use something else instead of S.I., the S.I. may appear spontaneously anyway; so it is a “natural” distribution. (Similarly, the Gauss curve is a natural distribution for adding independent random variables. If you don’t like it, just choose a different curve F(x), and then calculate what happens if you add together thousand or million random numbers selected by F(x)… the result will be more and more similar to the Gauss curve.)
The algorithmic probability of a string is the probability that the binary sequence will be obtained when you feed infinite string of random digits on input tape to universal prefix Turing machine (look up prefix Turing machine for more info).
The sum weighted by powers of two is simply the alternative way of writing it down as the probability of obtaining specific prefix of length l via coin flips is 2^-l , in the model whereby you split the string into some ‘prefix’ that is executed as program, and the ‘data’. This division is pretty arbitrary as the prefix may be an interpreter.
AFAIK, Solomonoff induction can be specified in such way: probability of seeing 1 after [string] is algorithmic probability ( [ string ] 1 ) / algorithmic_probability ( [ string ] )
Further reading: papers by Marcus Hutter.
edit: i’ve always said that markdown is for people who use italics more than algebra.
edit2: and if its not clear from the post, the programs are self delimiting; this is what specifies the end of program. Prefix turing machine does not have special ‘stop’ marker. The end of program is largely in the eye of the beholder; you can make a prefix that would proceed to read the input tape and output the digits from input tape; you can view it as a short program, or you can view it as a set of of infinitely long programs.
This is a completely different explanation, and I am not sure about its correctness. But it may help your intuition.
Imagine that you randomly generate programs in a standard computer language. And—contrary to the S.I. assumption—let’s say that all those programs have the same length N.
Let’s try a few examples. For simplicity, let’s suppose that the input is in variable x, and the programming language has only two commands: “x++” (increases x by one) and “return x” (returns the current value of x as a result). There are four possible programs:
x++;
x++;
x++;
return x;
return x;
x++;
return x;
return x;
The first program does not return a value. Let’s say that it is not a valid program, and remove it from the pool. The second program returns the value “x+1”. The third and fourth programs both return the value “x”, because both are equivalent to one-line program “return x”.
It seems that the shorter programs have an advantage, because a longer program can give the same results as the shorter program. More precisely, for every short program, there are infinitely many equivalent longer programs; as you increase the maximum allowed length N, you increase their numbers exponentially (though not necessarily 2^x exponentially; this is just a result of our example language having 2 instructions). This is why in this example the one-line program won the competition, even if it wasn’t allowed to participate.
OK, this is not a proof, just an intuition. Seems to me that even if you use something else instead of S.I., the S.I. may appear spontaneously anyway; so it is a “natural” distribution. (Similarly, the Gauss curve is a natural distribution for adding independent random variables. If you don’t like it, just choose a different curve F(x), and then calculate what happens if you add together thousand or million random numbers selected by F(x)… the result will be more and more similar to the Gauss curve.)
The algorithmic probability of a string is the probability that the binary sequence will be obtained when you feed infinite string of random digits on input tape to universal prefix Turing machine (look up prefix Turing machine for more info).
The sum weighted by powers of two is simply the alternative way of writing it down as the probability of obtaining specific prefix of length l via coin flips is 2^-l , in the model whereby you split the string into some ‘prefix’ that is executed as program, and the ‘data’. This division is pretty arbitrary as the prefix may be an interpreter.
AFAIK, Solomonoff induction can be specified in such way: probability of seeing 1 after [string] is algorithmic probability ( [ string ] 1 ) / algorithmic_probability ( [ string ] )
Further reading: papers by Marcus Hutter.
edit: i’ve always said that markdown is for people who use italics more than algebra.
edit2: and if its not clear from the post, the programs are self delimiting; this is what specifies the end of program. Prefix turing machine does not have special ‘stop’ marker. The end of program is largely in the eye of the beholder; you can make a prefix that would proceed to read the input tape and output the digits from input tape; you can view it as a short program, or you can view it as a set of of infinitely long programs.