It turns out that it’s pretty simple. Let’s take a look at Bayes’ theorem.
P(H|E)=P(E|H)P(H)∑iP(E|Hi)P(Hi)
For a review:
H is the particular hypothesis in question.
E is the evidence, or data that we have observed.
P(H) is the “prior” probability of the hypothesis alone.
P(E|H) is the probability of seeing the evidence given that the hypothesis is true.
P(H|E) is the probability that the hypothesis is true, given that you’ve seen the evidence.
Hi is an arbitrary element in the set of all hypotheses.
In terms of Solomonoff induction:
H is the set of all binary sequences which we input to the universal Turing machine.
E is the binary sequence of data that we are given to match.
P(H) is 2−l(H) where l(H) is the length of the binary sequence H. This is a basic premise of Solomonoff induction.
P(E|H) is the probability that we will see data sequence E, given that we run program H on the universal Turing machine. Because this is deterministic it is either 1, if H outputs E, or 0, if H does not output E.
P(H|E) is still what we’re looking for. Of course, if H does not output E, then P(E|H) = 0, which means P(H|E) = 0. This makes sense; if a program does not output the data you have, it cannot be the true program which output the data you have.
Hi is an arbitrary binary sequence.
The denominator is the same meaning as the numerator, except as a sum for every possible hypothesis. This essentially normalizes the probability in the numerators. Any hypotheses that do not match the data E exactly will cause P(E|Hi) = 0, and therefore that term will contribute nothing to the sum. If the hypothesis does output E exactly, then P(E|Hi) = 1, and the matching hypothesis contributes its weight to the renormalizing sum in the denominator.
Let’s see an example with these things substituted. Here, the set of Hi is the set of hypotheses that match.
P(H|E)=2−l(H)∑i2−l(Hi)
In summary; Bayes’ theorem says that once we find all matching hypotheses, we can find their individual probability by dividing their individual weight of 2−l(H) by the weights of all the matching hypotheses.
This is intuitive, and matches Bayes’ theorem both mathematically and philosophically. Updating will occur when you get more bits of evidence E. This will eliminate some of the hypotheses Hi, which will cause the renormalization in the denominator to get smaller.
How Bayes’ theorem is consistent with Solomonoff induction
You’ve read the introduction to Bayes’ theorem. You’ve read the introduction to Solomonoff induction. Both describe fundamental theories of epistemic rationality. But how do they fit together?
It turns out that it’s pretty simple. Let’s take a look at Bayes’ theorem.
P(H|E)=P(E|H)P(H)∑iP(E|Hi)P(Hi)
For a review:
H is the particular hypothesis in question.
E is the evidence, or data that we have observed.
P(H) is the “prior” probability of the hypothesis alone.
P(E|H) is the probability of seeing the evidence given that the hypothesis is true.
P(H|E) is the probability that the hypothesis is true, given that you’ve seen the evidence.
Hi is an arbitrary element in the set of all hypotheses.
In terms of Solomonoff induction:
H is the set of all binary sequences which we input to the universal Turing machine.
E is the binary sequence of data that we are given to match.
P(H) is 2−l(H) where l(H) is the length of the binary sequence H. This is a basic premise of Solomonoff induction.
P(E|H) is the probability that we will see data sequence E, given that we run program H on the universal Turing machine. Because this is deterministic it is either 1, if H outputs E, or 0, if H does not output E.
P(H|E) is still what we’re looking for. Of course, if H does not output E, then P(E|H) = 0, which means P(H|E) = 0. This makes sense; if a program does not output the data you have, it cannot be the true program which output the data you have.
Hi is an arbitrary binary sequence.
The denominator is the same meaning as the numerator, except as a sum for every possible hypothesis. This essentially normalizes the probability in the numerators. Any hypotheses that do not match the data E exactly will cause P(E|Hi) = 0, and therefore that term will contribute nothing to the sum. If the hypothesis does output E exactly, then P(E|Hi) = 1, and the matching hypothesis contributes its weight to the renormalizing sum in the denominator.
Let’s see an example with these things substituted. Here, the set of Hi is the set of hypotheses that match.
P(H|E)=2−l(H)∑i2−l(Hi)
In summary; Bayes’ theorem says that once we find all matching hypotheses, we can find their individual probability by dividing their individual weight of 2−l(H) by the weights of all the matching hypotheses.
This is intuitive, and matches Bayes’ theorem both mathematically and philosophically. Updating will occur when you get more bits of evidence E. This will eliminate some of the hypotheses Hi, which will cause the renormalization in the denominator to get smaller.