Strilanc comments on Solomonoff induction on a random string

Strilanc 9 Apr 2014 2:02 UTC
7 points
A truly perfectly random string isn’t really interesting because nothing predicts it. So I’m going to talk about biased randomness.

I’m also going to assume we’re working with the “outputs a single prediction” variant of Solomonoff Inductors, instead of the ones that output probability distributions (where the answer is kind of clearly “Yes they deal with it”). Oh, and I should note that Solomonoff Inductors are hilariously intractable, and unfortunately Cartesian in their workings.

With that aside...

Suppose you generate a random string by rolling a die and recording “Off” when it comes up 1, and “On” when it comes up 2,3,4,5, or 6. How well will a Solomonoff Inductor predict this sequence?

The thing to notice is that although the input is not algorithmic, it is compressible. Because “On” is so much more likely than “Off”, we can save space by using shorter encodings for strings like “On On On” than for “Off Off Off”. In fact, by using Arithmetic Coding, we can re-encode our sequences to use only 65% as many On/Off characters (on average).

When you do Solomonoff Induction, the next prediction is dominated by the shortest programs that haven’t been eliminated. What will these programs look like in the case of our biased sequence? They will be a compressed encoding of the sequence, alongside a decoder. The encoding schemes that best match the input will dominate the predictions, and these will be the ones that model it as a biased random sequence and assume the right probability (like a well tuned arithmetic coder).

What do you get when you look at an arithmetic coder’s next output, giving it a perfectly random input if it needs more data before producing another output? You get a biased random value. In our case it will be “On” about ⁵⁄₆ of the time, and “Off” ¹⁄₆ of the time.

So, when the Solomonoff Inductor looks at the next output when making a prediction… a whole bunch of the shortest programs are outputting “On” ⁵⁄₆ of the time and “Off” ¹⁄₆ of the time. Which will push the predictions towards the right probabilities! Instead of a single program determining the prediction, we get a huge group of programs working together to push the prediction in the right direction.

There will still be skew in the results. It’s not hard to come up with program encodings that strongly favor “Off”, for example. Nevertheless, there is at least some built-in functionality for dealing with randomness.
- cousin_it 9 Apr 2014 10:21 UTC
  2 points
  Parent
  
  There will still be skew in the results. It’s not hard to come up with program encodings that strongly favor “Off”, for example.
  
  What do you mean? If you feed SI a stream of i.i.d. random bits which are 1 with probability ⁵⁄₆ and 0 with probability ¹⁄₆, then SI’s beliefs about the next bit will converge to the true distribution pretty fast, no matter what encoding of programs is used. The easiest way to see that is by noticing that the true distribution is computable (which just means the probability of any bit string is computable from that bit string), therefore it can’t get a higher log score than SI in the long run (there’s a general result saying that).
  - Strilanc 9 Apr 2014 12:19 UTC
    1 point
    Parent
    Keep in mind that I’m talking about SI where the programs output single values as predictions, instead of probability distributions. So each is eliminated all at once or not at all, instead of being proportionally penalized.
    
    An example where SI will do worse is if your program encoding simply makes it easier to output a zero than a one. For example, if the program encoding is ternary and twos must mean “OutputAnotherZeroOnHalt”. This makes twos totally useless except for appending zeroes to your output. In that case extending the best-performing programs (as must be done to predict more partially-compressible values) is going to tend to introduce a lot more zeroes than the thing being modeled would. This will skew the predictions towards “next value zero” by a fixed amount, no matter how long you run the inductor.
    - cousin_it 9 Apr 2014 13:24 UTC
      2 points
      Parent
      
      Keep in mind that I’m talking about SI where the programs output single values as predictions, instead of probability distributions.
      
      Let’s reformulate it in terms of games. For example, asking SI to output probabilities is equivalent to making it play a game where rewards are proportional to log score. As far as I can tell, you’re talking about a different game where SI outputs a single guess at each step, and wins a dollar for each correct guess. In that game I do know several ways to construct input sequences that allow a human to beat SI by an unbounded amount of money, but I’d be surprised if a sequence of i.i.d. random bits was also such a sequence. And I’d be even more surprised if it depended on the encoding of programs used by SI. Not saying that you’re wrong, but can you try to sketch a more detailed proof? In particular, can you explain how it gets around the fact that SI’s probability of the next bit being 1 converges to 5/6?
      - Strilanc 9 Apr 2014 13:50 UTC
        1 point
        Parent
        I’m saying that SI’s probability of next bit being 1 doesn’t converge to ⁵⁄₆, for some encodings and (very importantly) using single outputs instead of probability distributions.
        
        For example, suppose we take a program encoding where it does converge to ⁵⁄₆ then expand it so every “0” becomes a “00“ and every “1” becomes a “01”. Then we add rules to the encoding such that, for every “10” or “11″ starting at an even position (so the original re-coding would have none), an extra zero is appended to the program’s output just before halting.
        
        Suppose that, after 100 values in the sequence, our new encoding had somehow converged on ⁵⁄₆ chance of “1”. What happens when we extend the shortest satisfying programs by two more characters? Well, for each shortest program, putting a “01“ or “00” at the end will give the ⁵⁄₆ proportionality we want. But putting them elsewhere will likely break the intermediate sequence, so those programs will be eliminated and not count. But most importantly, all those places we can’t put a “01” or “00” will take a “10“ or “11” without breaking the sequence so far. So there’s 1 place we can put a new character to get ⁵⁄₆ as desired, and (about) sixty five places to put the dumb “then append 0” characters.
        
        So starting from the ⁵⁄₆ we’re supposed to converge to, we diverge to ~1/78. And this is a problem with the encoding that making the programs longer doesn’t fix. In fact, it makes it worse as there are proportionally more and more places to put a “10” and “11″. Even though all of those programs with dumb characters are getting eliminated anytime a “1” is output, the space of programs is so infested with them that the probability still converges to 0 instead of ⁵⁄₆.
        cousin_it 9 Apr 2014 14:40 UTC
        3 points
        Parent
        Maybe we’re using different definitions of SI? The version that I’m thinking of (which I believe is the original version) quantifies over all possible deterministic programs that output a sequence of bits, using an arbitrary encoding of programs. Then it sums up the weights of all programs that are consistent with the inputs seen so far, and outputs a probability distribution. That turns out to be equivalent to a weighted mixture of all possible computable distributions, though the proof isn’t obvious. If we need to adapt this version of SI to output a single guess at each step, we just take the guess with the highest probability.
        
        Does that sound right? If yes, then this page from Li and Vitanyi’s book basically says that SI’s probability of the next bit being 1 converges to ⁵⁄₆ with probability 1, unless I’m missing something.
        Strilanc 10 Apr 2014 4:03 UTC
        1 point
        Parent
        That’s not the same definition I was using.
        
        I said the programs have a single output, instead of a probability distribution. It matches the sequence so far or it doesn’t, maybe programs are 100% eliminated or 100% not eliminated. The probabilistic nature comes entirely from the summing-over-all-programs part.
        
        If programs can output a probability distribution then clearly the program “return {0:1/6, 1:5/6}” will do very well and cause the results to converge appropriately.
        cousin_it 10 Apr 2014 8:50 UTC
        1 point
        Parent
        
        I said the programs have a single output, instead of a probability distribution. It matches the sequence so far or it doesn’t, maybe programs are 100% eliminated or 100% not eliminated. The probabilistic nature comes entirely from the summing-over-all-programs part.
        
        Yes, I’m using the same definition. The “deterministic programs” mentioned in my comment are programs that output a sequence of bits, not programs that output a probability distribution.
        
        That definition is equivalent to a mixture of all possible computable distributions. I suppose that equivalence is an important and surprising fact about SI: how come it makes no difference whether you quantify over programs that output a sequence of bits, or programs that output a probability distribution? But yes, it is true.
        christopherj 9 Apr 2014 15:44 UTC
        −2 points
        Parent
        He has repeatedly said that he’s talking about an SI that outputs a specific prediction instead of a probability distribution of them, and you even quoted him saying so.
        pivo 9 Apr 2014 21:01 UTC
        0 points
        Parent
        It seems to me that the arithmetic decoding programs you mention in your first comment churn ad nauseam on their infinite compressed stream. So they don’t halt and the instructions “10” and “11″ won’t matter. SI picks from a space of infinite programs, so the instructions can’t wait until the end of the stream either.
        
        What can happen, closest to the skew you mention I can think of, is that a program can contain code to stop arithmetic decoding after the first 100 values and output zeros from then on. This code carries a penalty which increases with the number of values it needs to count to. Which should make the weight of the program no greater than 1/n where n is the number of observed values.
        
        Please, correct me if I’m wrong, I’m just learning.
        Strilanc 10 Apr 2014 3:56 UTC
        0 points
        Parent
        I was thinking of each program as emitting a finite sequence and that was the prediction. As the target sequence got longer, you’d be using larger programs which halted after a longer time. It’s not too hard to change the rules so to make non-halting variants also fail.
        
        For example, suppose I create a program encoding that unfairly favors direct output. If the first bit is “1” then the output is just the remaining bits. If the first bit is “0″ then it’s a normal encoding… except only every tenth bit matters. The other 90% of bits are simply ignored and inaccessible. This penalizes the ⁵⁄₆ arithmetic encoder so much that it is beaten by using the raw encoding solution, and you’ll find the prediction staying near ⁵⁰⁄₅₀ instead of ⁵⁄₆.
        
        I do think some variants of SI work even for maliciously chosen program encodings. It helps to output probability distributions, and it helps to react to input instead of having unconditional output. But clearly not all variants are secure against bad encodings.
        pivo 10 Apr 2014 7:50 UTC
        0 points
        Parent
        In principle, SI is choosing fairly from a space of infinite programs. It’s only practical to see some programs as finite with weight proportional to the weight of all the infinite programs this finite program can be extended into. But no program knows its length, unless it explicitly counts to when to stop.
        
        The wasteful encoding you propose does not make a difference to SI. What the wasteful encoding does is that the arithmetic encoding programs will be 10 times longer and thus 2^10 times more penalized, but there will be 2^10 times more of them. So in the sum-over-all-programs, arithmetic coding programs will take over the direct output programs just the same as before.
        Strilanc 11 Apr 2014 3:23 UTC
        0 points
        Parent
        Programs that are 10 bits longer are penalized by 2^10. Programs that are 10 times longer are penalized by 2^(10n), where n is the size of the original program. The penalty isn’t washed out over time… it gets significantly worse.
        pivo 11 Apr 2014 8:10 UTC
        1 point
        Parent
        Yes, you’re right, I was sloppy. Still, the programs are exactly that much more numerous, so their weight ends up being the same in your wasteful encoding as in a sane encoding.
        Strilanc 11 Apr 2014 14:07 UTC
        0 points
        Parent
        Hmmm, right. It’s not enough to ignore the intermediate bits. Have to make them break the program unless they are all zero. Like if any of them are 1 then the program had no output except “syntax error” (but the raw output still allows them).
        Expand this thread
        pivo 11 Apr 2014 15:53 UTC
        0 points
        Parent
        I see. And don’t know the answer. I’m curious how SI fends off this one.
    - V_V 9 Apr 2014 16:11 UTC
      0 points
      Parent
      The point of SI is that the bias inherent with the choice of the universal Turing machine is washed out as more data are observed.
      - Strilanc 10 Apr 2014 3:59 UTC
        0 points
        Parent
        Not in the variant I described… which probably means it violates some precondition of the optimality proof and that I shouldn’t call it a Solomonoff Inductor. It still makes predictions by weighting and eliminating programs, but in too simplistic a way.
        V_V 10 Apr 2014 17:06 UTC
        0 points
        Parent
        I don’t think so.
        
        Let X be the sequence of bits so far an y be the next bit to predict. SI (in the specific version we are discussing) looks for short programs which output Xy and then halt.
        
        If X is empty then the output is just the initial bias of the inductor. In your example it will presumably output a 0 because for any program length it has more programs which output 0 and halt than programs which output 1 and halt (assuming that if you removed the “2″ opcode it would get a roughly half split).
        If X is not empty, there are programs made of a prefix P which just computes X followed by some suffix S which computes y. If we restrict the inductor to these programs it will just keep outputting whatever its initial bias was, and not learn anything.
        But the Solomonoff inductor is not restricted to such programs.
        Any finite X of length n is the prefix of some (actually, infinitely many) computable infinite sequences. Let Z_X be a non-halting program which generates one of these infinite sequences. We can therefore generate X by the program:
        ”Run Z_X in an interpreter until it has emitted n bits, then halt”
        And we can extend this to also generate the next bit y as:
        ”Run Z_X in an interpreter until it has emitted n+1 bits, then halt”
        note that the length difference between these two program is less than one bit on average, because encoding n takes log(n)+log(log(n)) bits. In the previous case, on the other hand, the length difference between P and PS is always at least one bit.
        Therefore, as n increases, the Solomonoff inductor increasingly gives preference to the second type of programs, which notably don’t contain your dreaded “2” opcode.