Please correct me if I am wrong, but I feel that there is a … well, not mistake… assumption that is not necessarily true. What I mean is the following. Let us consider the space of all possible inputs and the space of all possible outputs for the Turing machine (yeah, both are infinitely dimensional, who cares). The data (our Universe) is in the space of outputs, theory to test in the space of inputs. Now, before any assumptions about data and theory, what is the probability for the arbitrarily chosen input of length n lead to output with length N (since the output is all the observed data from our Universe, N is pretty large) - this is what is prior probability, correct?
No? Perhaps you were trying to do something else, but the above is not a description of Solomonoff induction.
Where exactly is the faulty assumption here?
In Solomonoff induction, the observations of the universe (the evidence) are the inputs. We also enumerate all possible algorithms (the hypotheses modeling the universe) and for each algorithm run it to see if it produces the same evidence observed. As we gain new bits of evidence, we discard any hypothesis that contradicts the evidence observed so far, because it is incorrect.
The fact that we are looking always for something simpler is an assumption of simplicity. Our Universe apparently happened to be governed by the set of simple laws so it works. However, this is the assumption, or axiom. It is not corollary from some math—from math prior should be awfully complex hypothesis.
What probability should you assign to the proposition that the next observed bit will be a 1? How should we choose between the infinite remaining models that have not yet contradicted observations? That’s the question of priors. We have to weight them with some probability distribution, and (when normalized), they must sum to a probability of 100%, by definition of “probability”. We obviously can’t give them all equal weight or our sum will be “infinity”. Giving them increasing weights would also blow up. Therefore, in the limit probabilities must decrease as we enumerate the hypotheses.
To be more precise, for every ϵ > 0, there is some length l such that the probability of all programs longer than l is at most ϵ.
Otherwise, I can say that probability decays logarithmically with complexity, and you will say that it decays exponentially, and we will get totally different prior probabilities and totally different results.
Can you? It’s not enough that it decays; it must decay fast enough to not diverge to infinity. Faster than the harmonic series (which is logarithmically divergent), for example.
Solomonoff’s prior is optimal in some sense, but it is not uniquely valid. Other decaying distributions could converge on the correct model, but more slowly. The exact choice of probability distribution is not relevant to our discussion here, as long as we use a valid one.
No? Perhaps you were trying to do something else, but the above is not a description of Solomonoff induction.
Where exactly is the faulty assumption here?
In Solomonoff induction, the observations of the universe (the evidence) are the inputs. We also enumerate all possible algorithms (the hypotheses modeling the universe) and for each algorithm run it to see if it produces the same evidence observed. As we gain new bits of evidence, we discard any hypothesis that contradicts the evidence observed so far, because it is incorrect.
What probability should you assign to the proposition that the next observed bit will be a 1? How should we choose between the infinite remaining models that have not yet contradicted observations? That’s the question of priors. We have to weight them with some probability distribution, and (when normalized), they must sum to a probability of 100%, by definition of “probability”. We obviously can’t give them all equal weight or our sum will be “infinity”. Giving them increasing weights would also blow up. Therefore, in the limit probabilities must decrease as we enumerate the hypotheses.
Can you? It’s not enough that it decays; it must decay fast enough to not diverge to infinity. Faster than the harmonic series (which is logarithmically divergent), for example.
Solomonoff’s prior is optimal in some sense, but it is not uniquely valid. Other decaying distributions could converge on the correct model, but more slowly. The exact choice of probability distribution is not relevant to our discussion here, as long as we use a valid one.