Use a prefix-free encoding for the hypotheses. There’s not 2^n hypotheses of length n: Some of the length-n bitstrings are incomplete and you’d need to add more bits in order to get a hypothesis; others are actually a length <n hypothesis plus some gibberish on the end.
Then the sum of the probabilities of all programs of all lengths combined is 1.0. After excluding the programs that don’t halt, the normalization constant is Chaitin’s Omega.
Unfortunately Chaitin’s Omega’s incomputable, but even if it wasn’t I don’t see how it would work as a normalizing constant. Chaitin’s Omega is a real number, there is an infinite number of hypotheses, and (IIRC) there is no real number r such that r multiplied by infinite equals one, so I don’t see how Chaitin’s Omega could possible work as a normalizing constant.
If we assign a value of 2^-n to each hypothesis (that is, each halting program) of length n, then the total value summed over all hypotheses is Chaitin’s Omega.
Thus, if we assign a value of 2^-n/Omega to each hypothesis, the total is 1, and this gives us a valid probability distribution.
I still don’t see how that would make all the hypotheses sum to 1. Wouldn’t that only make the probabilities of all the hypotheses of length n sum to 1, and thus make the sum of all hypotheses sum to > 1? For example, consider all the hypotheses of length 1. Assuming Omega = 1 for simplicity, there are 2 hypotheses, each with a probability of 2^-1/1 = 0.5, making all them sum to 1. There are 4 hypotheses of length 2, each with a probability of 2^-2/1 = 0.25, making them sum to 1. Thus, the sum of the probabilities of all hypotheses of length ⇐ 2 = 2 > 1.
Is Omega doing something I don’t understand? Would all hypotheses be required to be some set length?
First: the hypotheses/programs (recall: we are representing hypotheses by computer programs in some suitable language) need to be written in a “prefix-free” representation, meaning one in which e.g. if 100111 is a valid program then nothing else beginning 100111 can be. So you don’t get two programs of length 1, four of length 2, etc. If you do this, and if you also arrange your coding so that every infinite string of bits starts with some legal program, then the sum over all legal programs of 2^-length is exactly 1. (So Pr(program) = 2^-length(program) defines a probability distribution over programs. It’s the same probability distribution as you get from the following procedure: keep generating random bits until the bits you’ve generated form a legal program.)
Second: you can’t just take Ω=1 “for simplicity”; Ω is defined to be what you get by adding up 2^-length for all programs that halt. (Remember that hypotheses are being represented as computer programs here.) Equivalently, if you imagine picking a random well-formed program by choosing random bits, Ω is the probability that the resulting program halts. So it’s certainly <1.
[EDITED to fix an ugly but hopefully not confusing typo.]
The hypotheses are encoded in a prefix-free way: no hypothesis is a prefix of another hypothesis.
In particular, this eliminates your example: if both strings of length 1 (“0” and “1“) are valid hypotheses, then no string of longer length can be a valid hypothesis (then we can take Omega=1 and assign a probability of 1⁄2 to each). Or we could have “0”, “10”, and “110” be valid hypotheses; then we can take Omega = 7⁄8 and assign probabilities of 4⁄7, 2⁄7 and 1⁄7 to these.
In general, the prefix-free condition ensures that the sum over all hypotheses converges, by Kraft’s inequality, which is the real heavy lifter here; the constant is merely there to make the sum over all hypotheses converge to 1 instead.
You might imagine that hypotheses are required to be grammatically correct English sentences (I guess encoded in ASCII or something). In that case, no hypothesis is a prefix of another, because each hypothesis ends with a period.
Right; I forgot that it used a prefix-free encoding. Apologies if the answer to this is painfully obvious, but does having a prefix-free encoding entail that there is a finite number of possible hypotheses?
It doesn’t: for example, the set of strings 0, 10, 110, 1110, 11110, 111110, … is infinite and prefix-free. So is the set of C-style (null-terminated) strings.
It does. The infinite sum will always converge, so by normalizing we can make it converge to 1.
Take gjm’s approach to explaining the normalization, in which the initial weight of 2^-length assigned to a hypothesis is the probability that you obtain that hypothesis by choosing bits at random. Then the normalization is just a conditional probability: you divide by the probability that, when choosing bits at random, you do eventually end up hitting a hypothesis.
Remark: these are special cases of the schema “all strings that contain exactly one instance of pattern X in possible positions Y”, where in the first case X is “0“ and Y is “anywhere” and in the second X is “00000000” and Y is “any position that’s a multiple of 8 bits”.
(Of course there are plenty of other prefix-free sets of bitstrings. For instance: interpret blocks of 8 bits as characters as in Kindly’s second example, and say that a valid program is anything that begins with a “(”, has properly matched parentheses, and ends with the ”)” matching the first “(”. This doesn’t have the convenient property that every bitstring begins with a valid program; for examples that do, take the exponential Golomb coding or the Elias omega coding of natural numbers.
Use a prefix-free encoding for the hypotheses. There’s not 2^n hypotheses of length n: Some of the length-n bitstrings are incomplete and you’d need to add more bits in order to get a hypothesis; others are actually a length <n hypothesis plus some gibberish on the end.
Then the sum of the probabilities of all programs of all lengths combined is 1.0. After excluding the programs that don’t halt, the normalization constant is Chaitin’s Omega.
Unfortunately Chaitin’s Omega’s incomputable, but even if it wasn’t I don’t see how it would work as a normalizing constant. Chaitin’s Omega is a real number, there is an infinite number of hypotheses, and (IIRC) there is no real number r such that r multiplied by infinite equals one, so I don’t see how Chaitin’s Omega could possible work as a normalizing constant.
If we assign a value of 2^-n to each hypothesis (that is, each halting program) of length n, then the total value summed over all hypotheses is Chaitin’s Omega.
Thus, if we assign a value of 2^-n/Omega to each hypothesis, the total is 1, and this gives us a valid probability distribution.
That would assign a zero probability of hypotheses that take more than n bits to specify, would it not? That sounds far from optimal.
Sorry, I mean that we do this for each value of n. So, given a hypothesis H, it is assigned a probability of 2^-length(H)/Omega.
I still don’t see how that would make all the hypotheses sum to 1. Wouldn’t that only make the probabilities of all the hypotheses of length n sum to 1, and thus make the sum of all hypotheses sum to > 1? For example, consider all the hypotheses of length 1. Assuming Omega = 1 for simplicity, there are 2 hypotheses, each with a probability of 2^-1/1 = 0.5, making all them sum to 1. There are 4 hypotheses of length 2, each with a probability of 2^-2/1 = 0.25, making them sum to 1. Thus, the sum of the probabilities of all hypotheses of length ⇐ 2 = 2 > 1.
Is Omega doing something I don’t understand? Would all hypotheses be required to be some set length?
I think you may be missing two things.
First: the hypotheses/programs (recall: we are representing hypotheses by computer programs in some suitable language) need to be written in a “prefix-free” representation, meaning one in which e.g. if 100111 is a valid program then nothing else beginning 100111 can be. So you don’t get two programs of length 1, four of length 2, etc. If you do this, and if you also arrange your coding so that every infinite string of bits starts with some legal program, then the sum over all legal programs of 2^-length is exactly 1. (So Pr(program) = 2^-length(program) defines a probability distribution over programs. It’s the same probability distribution as you get from the following procedure: keep generating random bits until the bits you’ve generated form a legal program.)
Second: you can’t just take Ω=1 “for simplicity”; Ω is defined to be what you get by adding up 2^-length for all programs that halt. (Remember that hypotheses are being represented as computer programs here.) Equivalently, if you imagine picking a random well-formed program by choosing random bits, Ω is the probability that the resulting program halts. So it’s certainly <1.
[EDITED to fix an ugly but hopefully not confusing typo.]
The hypotheses are encoded in a prefix-free way: no hypothesis is a prefix of another hypothesis.
In particular, this eliminates your example: if both strings of length 1 (“0” and “1“) are valid hypotheses, then no string of longer length can be a valid hypothesis (then we can take Omega=1 and assign a probability of 1⁄2 to each). Or we could have “0”, “10”, and “110” be valid hypotheses; then we can take Omega = 7⁄8 and assign probabilities of 4⁄7, 2⁄7 and 1⁄7 to these.
In general, the prefix-free condition ensures that the sum over all hypotheses converges, by Kraft’s inequality, which is the real heavy lifter here; the constant is merely there to make the sum over all hypotheses converge to 1 instead.
You might imagine that hypotheses are required to be grammatically correct English sentences (I guess encoded in ASCII or something). In that case, no hypothesis is a prefix of another, because each hypothesis ends with a period.
Right; I forgot that it used a prefix-free encoding. Apologies if the answer to this is painfully obvious, but does having a prefix-free encoding entail that there is a finite number of possible hypotheses?
It doesn’t: for example, the set of strings 0, 10, 110, 1110, 11110, 111110, … is infinite and prefix-free. So is the set of C-style (null-terminated) strings.
I see. Does the method of normalization you gave work even when there is an infinite number of hypotheses?
It does. The infinite sum will always converge, so by normalizing we can make it converge to 1.
Take gjm’s approach to explaining the normalization, in which the initial weight of 2^-length assigned to a hypothesis is the probability that you obtain that hypothesis by choosing bits at random. Then the normalization is just a conditional probability: you divide by the probability that, when choosing bits at random, you do eventually end up hitting a hypothesis.
Remark: these are special cases of the schema “all strings that contain exactly one instance of pattern X in possible positions Y”, where in the first case X is “0“ and Y is “anywhere” and in the second X is “00000000” and Y is “any position that’s a multiple of 8 bits”.
(Of course there are plenty of other prefix-free sets of bitstrings. For instance: interpret blocks of 8 bits as characters as in Kindly’s second example, and say that a valid program is anything that begins with a “(”, has properly matched parentheses, and ends with the ”)” matching the first “(”. This doesn’t have the convenient property that every bitstring begins with a valid program; for examples that do, take the exponential Golomb coding or the Elias omega coding of natural numbers.