I have been thinking that the universal prior is tautological. Given a program, there are an infinite number of longer programs which perform the same computation (or an indistinguishable variation) but only a finite number of shorter programs having this characteristic. If you count every possible program equally, you’ll find that each short program represents a host of longer programs. However, now that I write this down, I’m no longer sure about it. Can someone say why/if it’s wrong?
[EDIT note: This is completely different from what I originally wrote in response to lavalamp’s question, because originally I completely misunderstood it. Sorry.]
You can’t “count every possible program equally”. (What probability will you give each possible program? If it’s positive then your total probability will be infinite. If it’s zero then your total probability will be zero. You can do a lot of probability-like things on a space with infinite total measure, in which case you could give every program equal weight, but that’s not generally what one does.)
Leaving that aside: Your argument suggests that whatever probabilities we give to programs, the resulting probabilities on outcomes will end up favouring outcomes that can be produced by simpler programs. That’s true. But a universal prior is doing something stronger than that: it gives (in a particular way) higher probabilities to simpler programs as well. So outcomes generated by simpler programs will (so to speak) be doubly favoured: once because those simple programs have higher probabilities, and once because there are “more” programs that generate those outcomes.
In fact, any probability assignment with finite total probability (in particular, any with total probability 1, which is of course what we usually require) must “in the limit” give small probabilities to long programs. But a universal prior is much more specific, and says how program length corresponds to probability.
But a universal prior is doing something stronger than that: it gives (in a particular way) higher probabilities to simpler programs as well.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far. So it’s not that the longer programs are given lower probability—it’s that they are given undefined probability, and are entirely “represented” by the most truncated version.
I learned this by reading Anatoly Vorobey’s forthcoming article!
How do you define an extended version of a program?
I agree to what you’re saying only as long an “extended version” of a program is the same program, just padded with meaningless bits at the end.
I don’t agree that its true regarding longer programs in general. A “longer version” of a program is any program that produces the same output, including programs that use different algorithms, programs that are padded at the beginning, padded in the middle etc.
To get the mathematical expression of the Universal prior, we count all programs, and here the extended versions comes into play. Intuitively, a program padded with a single bit is counted twice (as it may be padded with 0 or 1), a program padded with 2 bits is counted 4 times, etc. So a program with length L+k is 2^k times less likely than as program with length L. That’s where the 2^{-length(p)} probability comes from.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far.
Did not know that! It seems like that would leave some probability mass unassigned, how do you rebalance? Even if you succeed, it seems likely that (for large enough outputs) there’ll be lots of programs that have epsilon difference—that are basically the same, for all practical purposes.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far. So it’s not that the longer programs are given lower probability—it’s that they are given undefined probability, and are entirely “represented” by the most truncated version.
That depends on how one defines “Solomonoff induction”.
According to the SchoIarpedia article, it’s not required that the probabilities add up to 1:
∑x m(x) < 1
It’s simpler to defined the Universal Prior in the way that not-halting programs are “not counted”. So the sum is not over all program, just the halting ones.
You can’t “count every possible program equally”. (What probability will you give each possible program? If it’s positive then your total probability will be infinite. If it’s zero then your total probability will be zero. You can do a lot of probability-like things on a space with infinite total measure, in which case you could give every program equal weight, but that’s not generally what one does.)
You can also do probability like things on an infinite space where each finite subset has measure zero.
I have been thinking that the universal prior is tautological. Given a program, there are an infinite number of longer programs which perform the same computation (or an indistinguishable variation) but only a finite number of shorter programs having this characteristic. If you count every possible program equally, you’ll find that each short program represents a host of longer programs. However, now that I write this down, I’m no longer sure about it. Can someone say why/if it’s wrong?
[EDIT note: This is completely different from what I originally wrote in response to lavalamp’s question, because originally I completely misunderstood it. Sorry.]
You can’t “count every possible program equally”. (What probability will you give each possible program? If it’s positive then your total probability will be infinite. If it’s zero then your total probability will be zero. You can do a lot of probability-like things on a space with infinite total measure, in which case you could give every program equal weight, but that’s not generally what one does.)
Leaving that aside: Your argument suggests that whatever probabilities we give to programs, the resulting probabilities on outcomes will end up favouring outcomes that can be produced by simpler programs. That’s true. But a universal prior is doing something stronger than that: it gives (in a particular way) higher probabilities to simpler programs as well. So outcomes generated by simpler programs will (so to speak) be doubly favoured: once because those simple programs have higher probabilities, and once because there are “more” programs that generate those outcomes.
In fact, any probability assignment with finite total probability (in particular, any with total probability 1, which is of course what we usually require) must “in the limit” give small probabilities to long programs. But a universal prior is much more specific, and says how program length corresponds to probability.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far. So it’s not that the longer programs are given lower probability—it’s that they are given undefined probability, and are entirely “represented” by the most truncated version.
I learned this by reading Anatoly Vorobey’s forthcoming article!
How do you define an extended version of a program?
I agree to what you’re saying only as long an “extended version” of a program is the same program, just padded with meaningless bits at the end.
I don’t agree that its true regarding longer programs in general. A “longer version” of a program is any program that produces the same output, including programs that use different algorithms, programs that are padded at the beginning, padded in the middle etc.
To get the mathematical expression of the Universal prior, we count all programs, and here the extended versions comes into play. Intuitively, a program padded with a single bit is counted twice (as it may be padded with 0 or 1), a program padded with 2 bits is counted 4 times, etc. So a program with length L+k is 2^k times less likely than as program with length L. That’s where the 2^{-length(p)} probability comes from.
Did not know that! It seems like that would leave some probability mass unassigned, how do you rebalance? Even if you succeed, it seems likely that (for large enough outputs) there’ll be lots of programs that have epsilon difference—that are basically the same, for all practical purposes.
Normalize!
Solomonoff induction is just defined for binary data. Differences are a minimum of 1 bit,, which is enough.
That depends on how one defines “Solomonoff induction”.
According to the SchoIarpedia article, it’s not required that the probabilities add up to 1:
∑x m(x) < 1
It’s simpler to defined the Universal Prior in the way that not-halting programs are “not counted”. So the sum is not over all program, just the halting ones.
I see, thanks!
I did know this and should have phrased my sentence hypothetically. :)
You can also do probability like things on an infinite space where each finite subset has measure zero.