But a universal prior is doing something stronger than that: it gives (in a particular way) higher probabilities to simpler programs as well.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far. So it’s not that the longer programs are given lower probability—it’s that they are given undefined probability, and are entirely “represented” by the most truncated version.
I learned this by reading Anatoly Vorobey’s forthcoming article!
How do you define an extended version of a program?
I agree to what you’re saying only as long an “extended version” of a program is the same program, just padded with meaningless bits at the end.
I don’t agree that its true regarding longer programs in general. A “longer version” of a program is any program that produces the same output, including programs that use different algorithms, programs that are padded at the beginning, padded in the middle etc.
To get the mathematical expression of the Universal prior, we count all programs, and here the extended versions comes into play. Intuitively, a program padded with a single bit is counted twice (as it may be padded with 0 or 1), a program padded with 2 bits is counted 4 times, etc. So a program with length L+k is 2^k times less likely than as program with length L. That’s where the 2^{-length(p)} probability comes from.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far.
Did not know that! It seems like that would leave some probability mass unassigned, how do you rebalance? Even if you succeed, it seems likely that (for large enough outputs) there’ll be lots of programs that have epsilon difference—that are basically the same, for all practical purposes.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far. So it’s not that the longer programs are given lower probability—it’s that they are given undefined probability, and are entirely “represented” by the most truncated version.
That depends on how one defines “Solomonoff induction”.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far. So it’s not that the longer programs are given lower probability—it’s that they are given undefined probability, and are entirely “represented” by the most truncated version.
I learned this by reading Anatoly Vorobey’s forthcoming article!
How do you define an extended version of a program?
I agree to what you’re saying only as long an “extended version” of a program is the same program, just padded with meaningless bits at the end.
I don’t agree that its true regarding longer programs in general. A “longer version” of a program is any program that produces the same output, including programs that use different algorithms, programs that are padded at the beginning, padded in the middle etc.
To get the mathematical expression of the Universal prior, we count all programs, and here the extended versions comes into play. Intuitively, a program padded with a single bit is counted twice (as it may be padded with 0 or 1), a program padded with 2 bits is counted 4 times, etc. So a program with length L+k is 2^k times less likely than as program with length L. That’s where the 2^{-length(p)} probability comes from.
Did not know that! It seems like that would leave some probability mass unassigned, how do you rebalance? Even if you succeed, it seems likely that (for large enough outputs) there’ll be lots of programs that have epsilon difference—that are basically the same, for all practical purposes.
Normalize!
Solomonoff induction is just defined for binary data. Differences are a minimum of 1 bit,, which is enough.
That depends on how one defines “Solomonoff induction”.