Mathematical Inconsistency in Solomonoff Induction?
What counts as a hypothesis for Solomonoff induction? The general impression I’ve gotten in various places is “a hypothesis can be anything (that you could write down)”. But I don’t think that’s quite it. E.g. evidence can be written down but is treated separately. I think a hypothesis is more like a computer program that outputs predictions about what evidence will or will not be observed.
If X and Y are hypotheses, then is “X and Y” a hypothesis? “not X”? “X or Y?” If not, why not, and where can I read a clear explanation of the rules and exclusions for Solomonoff hypotheses?
If using logic operators with hypotheses does yield other hypotheses, then I’m curious about a potential problem. When hypotheses are related, we can consider what their probabilities should be in more than one way. The results should always be consistent.
For example, suppose you have no evidence yet. And suppose X and Y are independent. Then you can calculate the probability of P(X or Y) in terms of the probability of P(X) and P(Y). You can also calculate the probability of all three based on their length (that’s the Solomonoff prior). These should always match but I don’t think they do.
The non-normalized probability of X is 1/2^len(X).
So you get:
P(X or Y) = 1/2^len(X) + 1/2^len(Y) − 1/2^(len(X)+len(Y))
and we also know:
P(X or Y) = 1/2^len(X or Y)
since the left hand sides are the same, that means the right hand sides should be equal, by simple substitution:
1/2^len(X or Y) = 1/2^len(X) + 1/2^len(Y) − 1/2^(len(X)+len(Y))
Which has to hold for any X and Y.
We can select X and Y to be the same length and to minimize compression gains when they’re both present, so len(X or Y) should be approximately 2len(X). I’m assuming a basis, or choice of X and Y, such that “or” is very cheap relative to X and Y, hence I approximated it to zero. Then we have:
1/2^2len(X) = 1/2^len(X) + 1/2^len(X) − 1/2^2len(X)
which simplifies to:
1/2^2len(X) = 1/2^len(X)
Which is false (since len(X) isn’t 0). And using a different approximation of len(X or Y) like 1.5len(X), 2.5len(X) or even len(X) wouldn’t make the math work.
So Solomonoff induction is inconsistent. So I assume there’s something I don’t know. What? (My best guess so far, mentioned above, is limits on what is a hypothesis.)
Also here’s a quick intuitive explanation to help explain what’s going on with the math: P(X) is both shorter and less probable than P(X or Y). Think about what you’re doing when you craft a hypotheses. You can add bits (length) to a hypothesis to exclude stuff. In that case, more bits (more length) means lower prior probability, and that makes sense, because the hypothesis is compatible with fewer things from the set of all logically possible things. But you can also add bits (length) to a hypothesis to add alternatives. It could be this or that or a third thing. That makes hypotheses longer but more likely rather than less likely. Also, speaking more generally, the Solomonoff prior probabilities are assigned according to length with no regard for consistency amongst themselves, so its unsurprising that they’re inconsistent unless the hypotheses are limited in such a way that they have no significant relationships with each other that would have to be consistent, which sounds hard to achieve and I haven’t seen any rules specified for achieving that (note that there are other ways to find relationships between hypotheses besides the one I used above, e.g. looking for subsets).
My understanding is that a hypothesis is a program which generates a complete prediction of all observations. So there is no specific hypothesis (X OR Y), for the same reason that there is no sequence of numbers which is (list of all primes OR list of all squares).
Note that by “complete prediction of all observations” I don’t mean things like “tomorrow you’ll see a blackbird”, but rather the sense that you get an observation in a MDP or POMDP. If you imagine watching the world through a screen with a given frame rate, every hypothesis has to predict every single pixel of that screen, for each frame.
I don’t know where this is explained properly though. In fact I think a proper explanation, which explains how these idealised “hypotheses” relate to hypotheses in the human sense, would basically need to explain what thinking is and also solve the entire embedded agency agenda. For that reason, I place very little weight on claims linking Solomonoff induction to bounded human or AI reasoning.
Thanks. So “There are no black swans.” is not a valid Solomonoff hypothesis? A hypothesis can’t exclude things, only make positive predictions?
Is a hypothesis allowed to make partial predictions? E.g. predict some pixels or frames and leave others unspecified. If so, then you could “and” together two partial hypotheses and run into a similar math consistency problem, right? But the way you said it sounds like a valid hypothesis may be required to predict absolutely everything, which would prevent conjoining two hypotheses since they’re already both complete and nothing more could be added.
Internally, the algorithm could work by ruling things out (“There are no black swans, so the world can’t be X”), but it must still completely specify everything. This may be clearer once you have the answer your question, “What counts as a hypothesis for Solomonoff induction?”: a halting program for some universal Turing machine. And the possible worlds are (in correspondence with) the elements of the space of possible outputs of that machine. So every “hypothesis” pins down everything exactly.
You may have also read some stuff about the Minimum Message Length formalization of Occam’s razor, and it may be affecting your intuitions. In this formalization, it’s more natural to use logical operations for part of your message. That is, you could say something like “It’s the list of all primes OR the list of all squares. Compressed data: first number is zero”. Here, we’ve used a logical operation on the statement of the model, but it’s made our lossless compression of the data longer. This is a meaningful thing to do in this formalization (whereas it’s not really in Solomonoff induction), but the thing we ended up with is definitely not the message with the shortest length. That means it doesn’t affect the prior because that’s all about the minimum message length.
“That is, you could say something like “It’s the list of all primes OR the list of all squares. Compressed data: first number is zero”″
Just to clarify here (because it took me a couple of seconds): you only need the first number of the compressed data because that is sufficient to distinguish whether you have a list of primes or a list of squares. But as Pongo said, you could describe that same list in a much more compressed way by skipping the irrelevant half of the OR statement.
I went through the maths in OP and it seems to check out. I think the core inconsistency is that Solomonoff Induction implies l(X∪Y)=l(X) which is obviously wrong. I’m going to redo the maths below (breaking it down step-by-step more). curi has 2l(X)=l(X) which is the same inconsistency given his substitution. I’m not sure we can make that substitution but I also don’t think we need to.
Let X and Y be independent hypotheses for Solomonoff induction.
According to the prior, the non-normalized probability of X (and similarly for Y) is: (1)
P(X)=12l(X)
what is the probability of X∪Y? (2)
P(X∪Y)=P(X)+P(Y)−P(X∩Y)=12l(X)+12l(Y)−12l(X)⋅12l(Y)=12l(X)+12l(Y)−12l(X)⋅2l(Y)=12l(X)+12l(Y)−12l(X)+l(Y)
However, by Equation (1) we have: (3)
P(X∪Y)=12l(X∪Y)
thus (4)
12l(X∪Y)=12l(X)+12l(Y)−12l(X)+l(Y)
This must hold for any and all X and Y.
curi considers the case where X and Y are the same length, starting with Equation (4), we get (5):
12l(X∪Y)=12l(X)+12l(Y)−12l(X)+l(Y)=12l(X)+12l(X)−12l(X)+l(X)=22l(X)−122l(X)=12l(X)−1−122l(X)
but (6)
12l(X)−1≫122l(X)
and (7)
0≈122l(X)
so: (8)
12l(X∪Y)≃12l(X)−1∴l(X∪Y)≃l(X)−1□
curi has slightly different logic and argues l(X∪Y)≃2l(X) which I think is reasonable. His argument means we get l(X)≃2l(X). I don’t think those steps are necessary but they are worth mentioning as a difference. I think Equation (8) is enough.
I was curious about what happens when l(X)≠l(Y). Let’s assume the following: (9)
l(X)<l(Y)∴12l(X)≫12l(Y)
so, from Equation (2): (10)
P(X∪Y)=12l(X)+12l(Y)−12l(X)+l(Y)liml(Y)→∞P(X∪Y)=12l(X)+12l(Y)0−12l(X)+l(Y)0∴P(X∪Y)≃12l(X)
by Equation (3) and Equation (10): (11)
12l(X∪Y)≃12l(X)∴l(X∪Y)≃l(X)⇒l(Y)≃0
but Equation (9) says l(X)<l(Y) --- this contradicts Equation (11).
So there’s an inconsistency regardless of whether l(X)=l(Y) or not.
Details can be found in the excellent textbook by Li and Vitanyi.
In this context, “hypothesis” means a computer program that predicts your past experience and then goes on to make a specific prediction about the future.
“X or Y” is not such a computer program—it’s a logical abstraction about computer programs.
Now, one might take two programs that have the same output, and then construct another program that is sorta like “X or Y” that runs both X and Y and then reports only one of their outputs by some pseudo-random process. In which case it might be important to you to know about how you can construct Solomonoff induction using only the shortest program that produces each unique prediction.
Li and Vitanyi write:
There’s no citation. There’s one Popper book in the references section, LScD, but it doesn’t contain the string “occam” (case insensitive search).
I also searched a whole folder of many Popper books and found nothing mentioning Occam (except it’s mentioned by other people, not Popper, in the Schlipp volumes).
If Popper actually said something about Occam’s razor, I’d like to read it. Any idea what’s going on? This seems like a scholarship problem from Li and Vitanyi. They also dismiss Popper’s solution to the problem of induction as unsatisfactory, with no explanation, argument, cite, etc.
Chapter 7 of LScD is about simplicity, but he does not express there the views that Li and Vitanyi attribute to him. Perhaps he said such things elsewhere, but in LScD he presents his view of simplicity as degree of falsifiability. The main difference I see between Popper and Li-Vitanyi is that Popper did not have the background to look for a mathematical formulation of his ideas.
Try searching “parsimony” maybe? Another way to express Occam.
Which section of the 850 page book contains a clear explanation of this? On initial review they seem to talk about hypotheses, for hundreds of pages, without trying to define them or explain what sorts of things do and do not qualify or how Solomonoff hypotheses do and do not match the common sense meaning of a hypothesis.
I’d rather frame this as good news. The good news is that if you want to learn about Solomonoff induction, the entire first half-and-a-bit of the book is a really excellent resource. It’s like if someone directed you to a mountain of pennies. Yes, you aren’t going to be able to take this mountain of pennies home anytime soon, and that might feel awkward, but it’s not like you’d be materially better off if the mountain was smaller.
If you just want the one-sentence answer, it’s as above—“X or Y” is not a Turing machine. If you want to be able to look the whole edifice over on your own, though, it really will take 200+ pages of work (it took me about 3 months of reading on the train) - starting with prefix-free codes and Kolmogorov complexity, and moving on to sequence prediction and basic Solomonoff induction and the proofs of its nice properties. Then you can get more applied stuff like thinking about how to encode what you actually want to ask in terms of Solomonoff induction, minimum message length prediction and other bounds that hold even if you’re not a hypercomputer, and the universal prior and the proofs that it retains the nice properties of basic Solomonoff induction.
If you are guessing a binary sequence the job of a hypothesis is top say whether the next bit is going to be 1 or 0. A combination where both are fine or equally predicted fails to be a hypothesis. If you have partial predictions of X1XX0X and XX11XX you can “or” them into X1110X. But if you try to combine 01000 and 00010 the result will not be 01010 but something like 0X0X0.
Mathematically one option to handle partial predictions is to model them as groups of concrete “non-open” predictions which are compatible with them. 0X0X0 is short hand for {01010, 00010,01000,00000}. But 0X0X0 is not a hypothesis but a hypothesis schema or some other way of referring to the full set.
Statement that has room of interpretation what symbol it predicts will come next can’t serve as a hypothesis. If we have a rorcharcs ink blot and a procedure to unambigiusly turn them into symbols that might not leave any wiggle room. But if the procedure is unspecified it fails to mean or predict anything for alignment with data checking purposes. You can’t be correct or wrong if you don’t call anything.
Even if you have “inkplot1” and “inkoplot2“ as predictive hypotheses it is uncertain that “inkplo1 or inkplot2” has any procedure to turn into a prediction and even if it could have it might not be deducable for the rules for inkplots without connecting “ors”. if “inkplot1” would predict 0 and “inkplot2” would predict 1 a generic combination will yield a “anything might happen” result which fails to be a prediction at all (ie “I have no idea what the next symbol is going to be”). But having a or rule that combined 010000 and 000010 into 0100010 has to be encoding specific.
Why? If I have two independent actions—flipping a coin and rolling a 6-sided die (d6) - am I not able to combine “the coin lands heads 50% of the time” and “the die lands even (i.e. 2, 4, or 6) 50% of the time”?
This is (very close to) a binary “or”, I roughly agree with you.
This is sort of like a binary “and”. Have the rules changed? And what are they now?
In order for the predcitions to be combatible they must be silent about each other.
If the base cases are H1,H2,H3,H4,H5,H6,T1,T2,T3,T4,T5,T6
then it makes sense to say that HX (H1,H2,H3,H4,H5,H6) is 50% of XX
and that {X2,X4,X6} is 50% of XX
However if the base cases are H,T,1,2,3,4,5,6 then H would be 50% of {H,T} but not of X and {2,4,6} would be 50% of {1,2,3,4,5,6} but not X
The case where “everything works out” would like the OR to output a prediction scope of {H1,H2,H3,H4,H5,H6,T2,T4,T6}. But someone mean could argue that the OR outputs a prediction scope of {H,2,4,6}
If the claims are about separate universes then they are not predictions about the same thing. A heads claim doesn’t concern the dice world so it is not a prediction on it. Predictiontions should be alternative descriptions what happens in the world of concern. So the predictions should have the same amount of holes in them and at the root level all details shoudl be filled out ie 0 holes. An OR operation would need to produce an object that has more holes than the inputs if the inputs speak about the same universe. That is H1,H2,H3,H4,H5,H6,T1,T2,T3,T4,T5,T6 are the base level hypotheses and {H1,H2,H3,H4,H5,H6,T1,T2,T3,T4,T5,T6} and {} are prediction scopes neither which is found among the hypotheses. Applying a prediction OR would push towards the former object. But hypotheses are required to be firm in the details while scopes can be slippery. That is predction scopes {H1} and {H2} can be ored to {H1,H2} but the hypotheses H1 and H2 can’t be ored to produce a hypothesis.
I think you’re seeing the problem of the universal prior. It’s a free variable and depending on how you like to talk about it, you set it normatively, arbitrarily, on faith, or to serve a particular purpose.