It is important here to distinguish between two models of SI: there is one which regards the universal prior as a probability distribution over programs that generate a definite output, and there is another one that considers the universal prior over a set of computable distributions (that must contain the correct one). The first SI, given a random string (that is, incompressible), will generate hypothesis with the same length of the string, since it’s constantly pruning those hypothesis that doesn’t match exactly with the given input. The second SI, given a random string (that is, drawn from a uniform distribution), will with probability 1 assign a very high probability to the uniform distribution.
I’m not sure why we need to make that distinction. The Solomonoff and Levin constructions are equivalent. The prior built from all deterministic programs that output bit strings, and the prior built from all computable probability distributions, turn out to be the same prior. See e.g. here for proofs and references.
They’re equivalent from the point of view of a consumer of the prediction, they’re not equivalent from the point of view of an implementation. And since this is a discussion about how does it work, the distinction is useful.
Then I’m confused, because the two would seem to produce two very different answers on the same string. Since a string with very high Kolmogorov complexity can be clearly produced by a uniform distribution, the Solomonoff prior would converge to a very high complexity hypothesis, while the Levin mixture would just assign 0.5 to 0 and 0.5 to 1. What am I missing here?
The Solomonoff prior would have many surviving hypotheses at each step, and the total weight of those that predict a 0 for the next bit would be about equal to the total weight of those that predict a 1. If the input distribution is biased, e.g. 0 with probability 5⁄6 and 1 with probability 1⁄6, then the Solomonoff prior will converge on that as well. That works for any computable input distribution, with probability 1 according to the input distribution.
It is important here to distinguish between two models of SI: there is one which regards the universal prior as a probability distribution over programs that generate a definite output, and there is another one that considers the universal prior over a set of computable distributions (that must contain the correct one).
The first SI, given a random string (that is, incompressible), will generate hypothesis with the same length of the string, since it’s constantly pruning those hypothesis that doesn’t match exactly with the given input.
The second SI, given a random string (that is, drawn from a uniform distribution), will with probability 1 assign a very high probability to the uniform distribution.
For posterity, the convention is to call the two models Universal/Solomonoff prior M and Universal/Levin mixture ξ, respectively.
I’m not sure why we need to make that distinction. The Solomonoff and Levin constructions are equivalent. The prior built from all deterministic programs that output bit strings, and the prior built from all computable probability distributions, turn out to be the same prior. See e.g. here for proofs and references.
They’re equivalent from the point of view of a consumer of the prediction, they’re not equivalent from the point of view of an implementation. And since this is a discussion about how does it work, the distinction is useful.
Then I’m confused, because the two would seem to produce two very different answers on the same string.
Since a string with very high Kolmogorov complexity can be clearly produced by a uniform distribution, the Solomonoff prior would converge to a very high complexity hypothesis, while the Levin mixture would just assign 0.5 to 0 and 0.5 to 1.
What am I missing here?
The Solomonoff prior would have many surviving hypotheses at each step, and the total weight of those that predict a 0 for the next bit would be about equal to the total weight of those that predict a 1. If the input distribution is biased, e.g. 0 with probability 5⁄6 and 1 with probability 1⁄6, then the Solomonoff prior will converge on that as well. That works for any computable input distribution, with probability 1 according to the input distribution.
nitpick: the prior does not converge, the prior is what you have before you start observing data, then it is a posterior.
Many thanks, I get it now.
What matters is the probability that they assign to the next bit being equal to one.