I’m saying that SI’s probability of next bit being 1 doesn’t converge to 5⁄6, for some encodings and (very importantly) using single outputs instead of probability distributions.
For example, suppose we take a program encoding where it does converge to 5⁄6 then expand it so every “0” becomes a “00“ and every “1” becomes a “01”. Then we add rules to the encoding such that, for every “10” or “11″ starting at an even position (so the original re-coding would have none), an extra zero is appended to the program’s output just before halting.
Suppose that, after 100 values in the sequence, our new encoding had somehow converged on 5⁄6 chance of “1”. What happens when we extend the shortest satisfying programs by two more characters? Well, for each shortest program, putting a “01“ or “00” at the end will give the 5⁄6 proportionality we want. But putting them elsewhere will likely break the intermediate sequence, so those programs will be eliminated and not count. But most importantly, all those places we can’t put a “01” or “00” will take a “10“ or “11” without breaking the sequence so far. So there’s 1 place we can put a new character to get 5⁄6 as desired, and (about) sixty five places to put the dumb “then append 0” characters.
So starting from the 5⁄6 we’re supposed to converge to, we diverge to ~1/78. And this is a problem with the encoding that making the programs longer doesn’t fix. In fact, it makes it worse as there are proportionally more and more places to put a “10” and “11″. Even though all of those programs with dumb characters are getting eliminated anytime a “1” is output, the space of programs is so infested with them that the probability still converges to 0 instead of 5⁄6.
Maybe we’re using different definitions of SI? The version that I’m thinking of (which I believe is the original version) quantifies over all possible deterministic programs that output a sequence of bits, using an arbitrary encoding of programs. Then it sums up the weights of all programs that are consistent with the inputs seen so far, and outputs a probability distribution. That turns out to be equivalent to a weighted mixture of all possible computable distributions, though the proof isn’t obvious. If we need to adapt this version of SI to output a single guess at each step, we just take the guess with the highest probability.
Does that sound right? If yes, then this page from Li and Vitanyi’s book basically says that SI’s probability of the next bit being 1 converges to 5⁄6 with probability 1, unless I’m missing something.
I said the programs have a single output, instead of a probability distribution. It matches the sequence so far or it doesn’t, maybe programs are 100% eliminated or 100% not eliminated. The probabilistic nature comes entirely from the summing-over-all-programs part.
If programs can output a probability distribution then clearly the program “return {0:1/6, 1:5/6}” will do very well and cause the results to converge appropriately.
I said the programs have a single output, instead of a probability distribution. It matches the sequence so far or it doesn’t, maybe programs are 100% eliminated or 100% not eliminated. The probabilistic nature comes entirely from the summing-over-all-programs part.
Yes, I’m using the same definition. The “deterministic programs” mentioned in my comment are programs that output a sequence of bits, not programs that output a probability distribution.
That definition is equivalent to a mixture of all possible computable distributions. I suppose that equivalence is an important and surprising fact about SI: how come it makes no difference whether you quantify over programs that output a sequence of bits, or programs that output a probability distribution? But yes, it is true.
He has repeatedly said that he’s talking about an SI that outputs a specific prediction instead of a probability distribution of them, and you even quoted him saying so.
It seems to me that the arithmetic decoding programs you mention in your first comment churn ad nauseam on their infinite compressed stream. So they don’t halt and the instructions “10” and “11″ won’t matter. SI picks from a space of infinite programs, so the instructions can’t wait until the end of the stream either.
What can happen, closest to the skew you mention I can think of, is that a program can contain code to stop arithmetic decoding after the first 100 values and output zeros from then on. This code carries a penalty which increases with the number of values it needs to count to. Which should make the weight of the program no greater than 1/n where n is the number of observed values.
Please, correct me if I’m wrong, I’m just learning.
I was thinking of each program as emitting a finite sequence and that was the prediction. As the target sequence got longer, you’d be using larger programs which halted after a longer time. It’s not too hard to change the rules so to make non-halting variants also fail.
For example, suppose I create a program encoding that unfairly favors direct output. If the first bit is “1” then the output is just the remaining bits. If the first bit is “0″ then it’s a normal encoding… except only every tenth bit matters. The other 90% of bits are simply ignored and inaccessible. This penalizes the 5⁄6 arithmetic encoder so much that it is beaten by using the raw encoding solution, and you’ll find the prediction staying near 50⁄50 instead of 5⁄6.
I do think some variants of SI work even for maliciously chosen program encodings. It helps to output probability distributions, and it helps to react to input instead of having unconditional output. But clearly not all variants are secure against bad encodings.
In principle, SI is choosing fairly from a space of infinite programs. It’s only practical to see some programs as finite with weight proportional to the weight of all the infinite programs this finite program can be extended into. But no program knows its length, unless it explicitly counts to when to stop.
The wasteful encoding you propose does not make a difference to SI. What the wasteful encoding does is that the arithmetic encoding programs will be 10 times longer and thus 2^10 times more penalized, but there will be 2^10 times more of them. So in the sum-over-all-programs, arithmetic coding programs will take over the direct output programs just the same as before.
Programs that are 10 bits longer are penalized by 2^10. Programs that are 10 times longer are penalized by 2^(10n), where n is the size of the original program. The penalty isn’t washed out over time… it gets significantly worse.
Yes, you’re right, I was sloppy. Still, the programs are exactly that much more numerous, so their weight ends up being the same in your wasteful encoding as in a sane encoding.
Hmmm, right. It’s not enough to ignore the intermediate bits. Have to make them break the program unless they are all zero. Like if any of them are 1 then the program had no output except “syntax error” (but the raw output still allows them).
I’m saying that SI’s probability of next bit being 1 doesn’t converge to 5⁄6, for some encodings and (very importantly) using single outputs instead of probability distributions.
For example, suppose we take a program encoding where it does converge to 5⁄6 then expand it so every “0” becomes a “00“ and every “1” becomes a “01”. Then we add rules to the encoding such that, for every “10” or “11″ starting at an even position (so the original re-coding would have none), an extra zero is appended to the program’s output just before halting.
Suppose that, after 100 values in the sequence, our new encoding had somehow converged on 5⁄6 chance of “1”. What happens when we extend the shortest satisfying programs by two more characters? Well, for each shortest program, putting a “01“ or “00” at the end will give the 5⁄6 proportionality we want. But putting them elsewhere will likely break the intermediate sequence, so those programs will be eliminated and not count. But most importantly, all those places we can’t put a “01” or “00” will take a “10“ or “11” without breaking the sequence so far. So there’s 1 place we can put a new character to get 5⁄6 as desired, and (about) sixty five places to put the dumb “then append 0” characters.
So starting from the 5⁄6 we’re supposed to converge to, we diverge to ~1/78. And this is a problem with the encoding that making the programs longer doesn’t fix. In fact, it makes it worse as there are proportionally more and more places to put a “10” and “11″. Even though all of those programs with dumb characters are getting eliminated anytime a “1” is output, the space of programs is so infested with them that the probability still converges to 0 instead of 5⁄6.
Maybe we’re using different definitions of SI? The version that I’m thinking of (which I believe is the original version) quantifies over all possible deterministic programs that output a sequence of bits, using an arbitrary encoding of programs. Then it sums up the weights of all programs that are consistent with the inputs seen so far, and outputs a probability distribution. That turns out to be equivalent to a weighted mixture of all possible computable distributions, though the proof isn’t obvious. If we need to adapt this version of SI to output a single guess at each step, we just take the guess with the highest probability.
Does that sound right? If yes, then this page from Li and Vitanyi’s book basically says that SI’s probability of the next bit being 1 converges to 5⁄6 with probability 1, unless I’m missing something.
That’s not the same definition I was using.
I said the programs have a single output, instead of a probability distribution. It matches the sequence so far or it doesn’t, maybe programs are 100% eliminated or 100% not eliminated. The probabilistic nature comes entirely from the summing-over-all-programs part.
If programs can output a probability distribution then clearly the program “return {0:1/6, 1:5/6}” will do very well and cause the results to converge appropriately.
Yes, I’m using the same definition. The “deterministic programs” mentioned in my comment are programs that output a sequence of bits, not programs that output a probability distribution.
That definition is equivalent to a mixture of all possible computable distributions. I suppose that equivalence is an important and surprising fact about SI: how come it makes no difference whether you quantify over programs that output a sequence of bits, or programs that output a probability distribution? But yes, it is true.
He has repeatedly said that he’s talking about an SI that outputs a specific prediction instead of a probability distribution of them, and you even quoted him saying so.
It seems to me that the arithmetic decoding programs you mention in your first comment churn ad nauseam on their infinite compressed stream. So they don’t halt and the instructions “10” and “11″ won’t matter. SI picks from a space of infinite programs, so the instructions can’t wait until the end of the stream either.
What can happen, closest to the skew you mention I can think of, is that a program can contain code to stop arithmetic decoding after the first 100 values and output zeros from then on. This code carries a penalty which increases with the number of values it needs to count to. Which should make the weight of the program no greater than 1/n where n is the number of observed values.
Please, correct me if I’m wrong, I’m just learning.
I was thinking of each program as emitting a finite sequence and that was the prediction. As the target sequence got longer, you’d be using larger programs which halted after a longer time. It’s not too hard to change the rules so to make non-halting variants also fail.
For example, suppose I create a program encoding that unfairly favors direct output. If the first bit is “1” then the output is just the remaining bits. If the first bit is “0″ then it’s a normal encoding… except only every tenth bit matters. The other 90% of bits are simply ignored and inaccessible. This penalizes the 5⁄6 arithmetic encoder so much that it is beaten by using the raw encoding solution, and you’ll find the prediction staying near 50⁄50 instead of 5⁄6.
I do think some variants of SI work even for maliciously chosen program encodings. It helps to output probability distributions, and it helps to react to input instead of having unconditional output. But clearly not all variants are secure against bad encodings.
In principle, SI is choosing fairly from a space of infinite programs. It’s only practical to see some programs as finite with weight proportional to the weight of all the infinite programs this finite program can be extended into. But no program knows its length, unless it explicitly counts to when to stop.
The wasteful encoding you propose does not make a difference to SI. What the wasteful encoding does is that the arithmetic encoding programs will be 10 times longer and thus 2^10 times more penalized, but there will be 2^10 times more of them. So in the sum-over-all-programs, arithmetic coding programs will take over the direct output programs just the same as before.
Programs that are 10 bits longer are penalized by 2^10. Programs that are 10 times longer are penalized by 2^(10n), where n is the size of the original program. The penalty isn’t washed out over time… it gets significantly worse.
Yes, you’re right, I was sloppy. Still, the programs are exactly that much more numerous, so their weight ends up being the same in your wasteful encoding as in a sane encoding.
Hmmm, right. It’s not enough to ignore the intermediate bits. Have to make them break the program unless they are all zero. Like if any of them are 1 then the program had no output except “syntax error” (but the raw output still allows them).
I see. And don’t know the answer. I’m curious how SI fends off this one.