Solomonoff induction on a random string
So, I’ve been hearing a lot about the awesomeness of Solomonoff induction, at least as a theoretical framework. However, my admittedly limited understanding of Solomonoff induction suggests that it would form an epicly bad hypothesis if given a random string. So my question is, if I misunderstood, how does it deal with randomness? And if I understood correctly, isn’t this a rather large gap?
Edit: Thanks for all the comments! My new understanding is that Solomonoff induction is able to understand that it is dealing with a random string (because it finds itself giving equal weight to programs that output a 1 or a 0 for the next bit), but despite that is designed to keep looking for a pattern forever. While this is a horrible use of resources, the SI is a theoretical framework that has infinite resources, so that’s a meaningless criticism. Overall this seems acceptable, though if you want to actually implement a SI you’ll need to instruct it on giving up. Furthermore, the SI will not include randomness as a distinct function in its hypothesis, which could lead to improper weighting of priors, but will still have good predictive power -- and considering that Solomonoff induction was only meant for computable functions, this is a pretty good result.
A truly perfectly random string isn’t really interesting because nothing predicts it. So I’m going to talk about biased randomness.
I’m also going to assume we’re working with the “outputs a single prediction” variant of Solomonoff Inductors, instead of the ones that output probability distributions (where the answer is kind of clearly “Yes they deal with it”). Oh, and I should note that Solomonoff Inductors are hilariously intractable, and unfortunately Cartesian in their workings.
With that aside...
Suppose you generate a random string by rolling a die and recording “Off” when it comes up 1, and “On” when it comes up 2,3,4,5, or 6. How well will a Solomonoff Inductor predict this sequence?
The thing to notice is that although the input is not algorithmic, it is compressible. Because “On” is so much more likely than “Off”, we can save space by using shorter encodings for strings like “On On On” than for “Off Off Off”. In fact, by using Arithmetic Coding, we can re-encode our sequences to use only 65% as many On/Off characters (on average).
When you do Solomonoff Induction, the next prediction is dominated by the shortest programs that haven’t been eliminated. What will these programs look like in the case of our biased sequence? They will be a compressed encoding of the sequence, alongside a decoder. The encoding schemes that best match the input will dominate the predictions, and these will be the ones that model it as a biased random sequence and assume the right probability (like a well tuned arithmetic coder).
What do you get when you look at an arithmetic coder’s next output, giving it a perfectly random input if it needs more data before producing another output? You get a biased random value. In our case it will be “On” about 5⁄6 of the time, and “Off” 1⁄6 of the time.
So, when the Solomonoff Inductor looks at the next output when making a prediction… a whole bunch of the shortest programs are outputting “On” 5⁄6 of the time and “Off” 1⁄6 of the time. Which will push the predictions towards the right probabilities! Instead of a single program determining the prediction, we get a huge group of programs working together to push the prediction in the right direction.
There will still be skew in the results. It’s not hard to come up with program encodings that strongly favor “Off”, for example. Nevertheless, there is at least some built-in functionality for dealing with randomness.
What do you mean? If you feed SI a stream of i.i.d. random bits which are 1 with probability 5⁄6 and 0 with probability 1⁄6, then SI’s beliefs about the next bit will converge to the true distribution pretty fast, no matter what encoding of programs is used. The easiest way to see that is by noticing that the true distribution is computable (which just means the probability of any bit string is computable from that bit string), therefore it can’t get a higher log score than SI in the long run (there’s a general result saying that).
Keep in mind that I’m talking about SI where the programs output single values as predictions, instead of probability distributions. So each is eliminated all at once or not at all, instead of being proportionally penalized.
An example where SI will do worse is if your program encoding simply makes it easier to output a zero than a one. For example, if the program encoding is ternary and twos must mean “OutputAnotherZeroOnHalt”. This makes twos totally useless except for appending zeroes to your output. In that case extending the best-performing programs (as must be done to predict more partially-compressible values) is going to tend to introduce a lot more zeroes than the thing being modeled would. This will skew the predictions towards “next value zero” by a fixed amount, no matter how long you run the inductor.
Let’s reformulate it in terms of games. For example, asking SI to output probabilities is equivalent to making it play a game where rewards are proportional to log score. As far as I can tell, you’re talking about a different game where SI outputs a single guess at each step, and wins a dollar for each correct guess. In that game I do know several ways to construct input sequences that allow a human to beat SI by an unbounded amount of money, but I’d be surprised if a sequence of i.i.d. random bits was also such a sequence. And I’d be even more surprised if it depended on the encoding of programs used by SI. Not saying that you’re wrong, but can you try to sketch a more detailed proof? In particular, can you explain how it gets around the fact that SI’s probability of the next bit being 1 converges to 5/6?
I’m saying that SI’s probability of next bit being 1 doesn’t converge to 5⁄6, for some encodings and (very importantly) using single outputs instead of probability distributions.
For example, suppose we take a program encoding where it does converge to 5⁄6 then expand it so every “0” becomes a “00“ and every “1” becomes a “01”. Then we add rules to the encoding such that, for every “10” or “11″ starting at an even position (so the original re-coding would have none), an extra zero is appended to the program’s output just before halting.
Suppose that, after 100 values in the sequence, our new encoding had somehow converged on 5⁄6 chance of “1”. What happens when we extend the shortest satisfying programs by two more characters? Well, for each shortest program, putting a “01“ or “00” at the end will give the 5⁄6 proportionality we want. But putting them elsewhere will likely break the intermediate sequence, so those programs will be eliminated and not count. But most importantly, all those places we can’t put a “01” or “00” will take a “10“ or “11” without breaking the sequence so far. So there’s 1 place we can put a new character to get 5⁄6 as desired, and (about) sixty five places to put the dumb “then append 0” characters.
So starting from the 5⁄6 we’re supposed to converge to, we diverge to ~1/78. And this is a problem with the encoding that making the programs longer doesn’t fix. In fact, it makes it worse as there are proportionally more and more places to put a “10” and “11″. Even though all of those programs with dumb characters are getting eliminated anytime a “1” is output, the space of programs is so infested with them that the probability still converges to 0 instead of 5⁄6.
Maybe we’re using different definitions of SI? The version that I’m thinking of (which I believe is the original version) quantifies over all possible deterministic programs that output a sequence of bits, using an arbitrary encoding of programs. Then it sums up the weights of all programs that are consistent with the inputs seen so far, and outputs a probability distribution. That turns out to be equivalent to a weighted mixture of all possible computable distributions, though the proof isn’t obvious. If we need to adapt this version of SI to output a single guess at each step, we just take the guess with the highest probability.
Does that sound right? If yes, then this page from Li and Vitanyi’s book basically says that SI’s probability of the next bit being 1 converges to 5⁄6 with probability 1, unless I’m missing something.
That’s not the same definition I was using.
I said the programs have a single output, instead of a probability distribution. It matches the sequence so far or it doesn’t, maybe programs are 100% eliminated or 100% not eliminated. The probabilistic nature comes entirely from the summing-over-all-programs part.
If programs can output a probability distribution then clearly the program “return {0:1/6, 1:5/6}” will do very well and cause the results to converge appropriately.
Yes, I’m using the same definition. The “deterministic programs” mentioned in my comment are programs that output a sequence of bits, not programs that output a probability distribution.
That definition is equivalent to a mixture of all possible computable distributions. I suppose that equivalence is an important and surprising fact about SI: how come it makes no difference whether you quantify over programs that output a sequence of bits, or programs that output a probability distribution? But yes, it is true.
He has repeatedly said that he’s talking about an SI that outputs a specific prediction instead of a probability distribution of them, and you even quoted him saying so.
It seems to me that the arithmetic decoding programs you mention in your first comment churn ad nauseam on their infinite compressed stream. So they don’t halt and the instructions “10” and “11″ won’t matter. SI picks from a space of infinite programs, so the instructions can’t wait until the end of the stream either.
What can happen, closest to the skew you mention I can think of, is that a program can contain code to stop arithmetic decoding after the first 100 values and output zeros from then on. This code carries a penalty which increases with the number of values it needs to count to. Which should make the weight of the program no greater than 1/n where n is the number of observed values.
Please, correct me if I’m wrong, I’m just learning.
I was thinking of each program as emitting a finite sequence and that was the prediction. As the target sequence got longer, you’d be using larger programs which halted after a longer time. It’s not too hard to change the rules so to make non-halting variants also fail.
For example, suppose I create a program encoding that unfairly favors direct output. If the first bit is “1” then the output is just the remaining bits. If the first bit is “0″ then it’s a normal encoding… except only every tenth bit matters. The other 90% of bits are simply ignored and inaccessible. This penalizes the 5⁄6 arithmetic encoder so much that it is beaten by using the raw encoding solution, and you’ll find the prediction staying near 50⁄50 instead of 5⁄6.
I do think some variants of SI work even for maliciously chosen program encodings. It helps to output probability distributions, and it helps to react to input instead of having unconditional output. But clearly not all variants are secure against bad encodings.
In principle, SI is choosing fairly from a space of infinite programs. It’s only practical to see some programs as finite with weight proportional to the weight of all the infinite programs this finite program can be extended into. But no program knows its length, unless it explicitly counts to when to stop.
The wasteful encoding you propose does not make a difference to SI. What the wasteful encoding does is that the arithmetic encoding programs will be 10 times longer and thus 2^10 times more penalized, but there will be 2^10 times more of them. So in the sum-over-all-programs, arithmetic coding programs will take over the direct output programs just the same as before.
Programs that are 10 bits longer are penalized by 2^10. Programs that are 10 times longer are penalized by 2^(10n), where n is the size of the original program. The penalty isn’t washed out over time… it gets significantly worse.
Yes, you’re right, I was sloppy. Still, the programs are exactly that much more numerous, so their weight ends up being the same in your wasteful encoding as in a sane encoding.
Hmmm, right. It’s not enough to ignore the intermediate bits. Have to make them break the program unless they are all zero. Like if any of them are 1 then the program had no output except “syntax error” (but the raw output still allows them).
I see. And don’t know the answer. I’m curious how SI fends off this one.
The point of SI is that the bias inherent with the choice of the universal Turing machine is washed out as more data are observed.
Not in the variant I described… which probably means it violates some precondition of the optimality proof and that I shouldn’t call it a Solomonoff Inductor. It still makes predictions by weighting and eliminating programs, but in too simplistic a way.
I don’t think so.
Let X be the sequence of bits so far an y be the next bit to predict. SI (in the specific version we are discussing) looks for short programs which output Xy and then halt.
If X is empty then the output is just the initial bias of the inductor. In your example it will presumably output a 0 because for any program length it has more programs which output 0 and halt than programs which output 1 and halt (assuming that if you removed the “2″ opcode it would get a roughly half split).
If X is not empty, there are programs made of a prefix P which just computes X followed by some suffix S which computes y. If we restrict the inductor to these programs it will just keep outputting whatever its initial bias was, and not learn anything.
But the Solomonoff inductor is not restricted to such programs.
Any finite X of length n is the prefix of some (actually, infinitely many) computable infinite sequences. Let Z_X be a non-halting program which generates one of these infinite sequences. We can therefore generate X by the program:
”Run Z_X in an interpreter until it has emitted n bits, then halt”
And we can extend this to also generate the next bit y as:
”Run Z_X in an interpreter until it has emitted n+1 bits, then halt”
note that the length difference between these two program is less than one bit on average, because encoding n takes log(n)+log(log(n)) bits. In the previous case, on the other hand, the length difference between P and PS is always at least one bit.
Therefore, as n increases, the Solomonoff inductor increasingly gives preference to the second type of programs, which notably don’t contain your dreaded “2” opcode.
Basically what Oscar_Cunningham said. If SI receives a random string sampled from the uniform prior (i.e. a string of independent fair coinflips), its posterior beliefs about the next bits will also agree with the uniform prior pretty well, because the uniform prior is computable and has a nonzero weight in the mixture. And the probabilities assigned to the next bit by SI will approach 50% pretty fast, because SI’s accumulated log score is not worse than any computable prior, including the uniform prior. I was also confused by this before, see this thread, especially the theorem that Wei Dai linked to.
It is important here to distinguish between two models of SI: there is one which regards the universal prior as a probability distribution over programs that generate a definite output, and there is another one that considers the universal prior over a set of computable distributions (that must contain the correct one).
The first SI, given a random string (that is, incompressible), will generate hypothesis with the same length of the string, since it’s constantly pruning those hypothesis that doesn’t match exactly with the given input.
The second SI, given a random string (that is, drawn from a uniform distribution), will with probability 1 assign a very high probability to the uniform distribution.
For posterity, the convention is to call the two models Universal/Solomonoff prior M and Universal/Levin mixture ξ, respectively.
I’m not sure why we need to make that distinction. The Solomonoff and Levin constructions are equivalent. The prior built from all deterministic programs that output bit strings, and the prior built from all computable probability distributions, turn out to be the same prior. See e.g. here for proofs and references.
They’re equivalent from the point of view of a consumer of the prediction, they’re not equivalent from the point of view of an implementation. And since this is a discussion about how does it work, the distinction is useful.
Then I’m confused, because the two would seem to produce two very different answers on the same string.
Since a string with very high Kolmogorov complexity can be clearly produced by a uniform distribution, the Solomonoff prior would converge to a very high complexity hypothesis, while the Levin mixture would just assign 0.5 to 0 and 0.5 to 1.
What am I missing here?
The Solomonoff prior would have many surviving hypotheses at each step, and the total weight of those that predict a 0 for the next bit would be about equal to the total weight of those that predict a 1. If the input distribution is biased, e.g. 0 with probability 5⁄6 and 1 with probability 1⁄6, then the Solomonoff prior will converge on that as well. That works for any computable input distribution, with probability 1 according to the input distribution.
nitpick: the prior does not converge, the prior is what you have before you start observing data, then it is a posterior.
Many thanks, I get it now.
What matters is the probability that they assign to the next bit being equal to one.
Truly random data is incompressible in the average case by the pigeonhole principle
Solomonoff induction still tries though. It assumes there is always more signal in the noise. I’m not sure how you would justify stopping that search, how can you ever be certain there’s not some complex signal we just haven’t found yet?
But you should end up with a bunch of theories with similar kolmogorov complexity.
I think you can justify stopping the search when you are hitting your resource limits and have long since ceased to find additional signal. You could be wrong, but it seems justified.
Solomonoff induction has no resource limit. It’s a theoretical framework for understanding machine learning when resources are not an issue, not an engineering proposal.
But it seems to me rather different to assume you can do any finite amount of calculation, vs relying on things that can only be done with infinite calculation. Can we ever have a hope of having infinite resources?
I think epsilon.
Just to clarify, though: in using universal induction, every hypothesis is finite in size, thereby at no point some process needs to run an infinite program to discover its output. The infinite part is of course the size of memory space, used to hold an infinite prior, and the infinite speed of its update.
Well when it tries to guess the next bit it gets 50% of its guesses right, which is as good as anything else.
Would it be legitimate to ask the SI to estimate the probability that its guess is correct? I suppose that if it sums up its programs’ estimates as to the next bit and finds itself predicting a 50% chance either way, it at least understands that it is dealing with random data but is merely being very persistent in looking for a pattern just in case it merely seemed random? That’s not as bad as I thought at first.
But, given 1 terabyte of data, will it not generate a ~1 terabyte program as it’s hypothesis? Even if it is as accurate as the best answer, this seems like a flaw.
Remember that SI works by accounting for all the infinite multitude of hypothesis that can generate the given string. Given an algorithmically random TB of data, SI will take into consideration surely a TB hypothesis with high probability but also all the bigger hypothesis with exponentially lower probabilities.
OK, so it will predict one of multiple different ~ 1 terabyte programs as having different likelihoods. I’d still rather it predict random{0,1} for less than 10 bytes, as the most probable. Inability to recognize noise as noise seems like a fundamental problem.
random{0,1} is not an algorithm, so...
Explained here.
There is an SI that works over programs (which I was referring to originally), and there is an SI that works over computable distributions (that will produce random{0,1} with high probability).
No, it will always make a prediction according to the the infinitely many programs which are consistent with the observed string. In the observed string is 1 terabyte of uniform random noise, the shortest of these programs will be most likely ~ 1 terabyte long, but Solomonoff induction also considers the longer ones.
I think it’s instructive to compare with the red card—blue card experiment. Like the human subjects, Solomonoff induction will entertain very complicated hypotheses, but this is fine because it doesn’t select its next card based on just one of them. A decision rule like AIXI will settle down on ~70% probability of blue, and then pick blue every time.