Understanding and justifying Solomonoff induction
I’ve been trying to understand the uses and limitations of Solomonoff induction. Following the principle that in order to fully understand something you should explain it others, here’s a try. I prefer to write such things in a form for dialogue, as that better reflects thought processes.
This is not a very-in-depth technical article—for example, cousin_it and Wen Dai has obviously spent more time pondering about SI. (I’m not a long-time LW reader, but have skimmed through existing LW and wiki articles on related topics before posting this.)
Alice. Hi, I’m interested in the question of why and when should I prefer simpler hypotheses. I’ve heard about Occam’s razor and I’ve read about Solomonoff induction and the Universal Prior. Now I’m looking for a philosophical justification of the math. I’d like to have something akin to de Finetti’s justification for probability theory—“if you don’t believe in the axioms, you’re going to be Dutch booked!”.
Bob. You’re welcome. Do you have any problems with the formulations?
A. I’m trying to understand how to connect the informal concept of Occam’s razor with the mathematical formula of the Universal Prior in a meaningful way. Informally, a hypothesis is something that explains the data. Occam’s razor tells us to prefer simpler hypotheses.
B. Well, yes.
A. But the Universal Prior seems to tell something that seems to be even stronger: that all shorter hypothesis are more likely! Clearly, that’s not the case: if F and G are hypotheses, then “F OR G” is a hypothesis as well, and not less likely than either F or G individually! What’s more, there exists a whole set of language and domain pairs where longer hypotheses have the same average probability. For one particular example consider propositional logic with AND, OR, and NOT operators. Because of symmetry between conjunction and disjunction, a well-formed 10-element formula is as likely to be a tautology (i.e. always correct) as a 100-element formula!
B. You’re confused. In this formalism, you should interpret all hypotheses as programs for Universal Turing Machine. The machine reads a hypothesis from one of its tapes. It then executes the hypothesis, and outputs the result of the computation to another tape. The result is the data—the actual string, whose prefix we’re observing, and whose remaining part we’re trying to predict. Hypothesis is the input string—the string we’re not observing. In order to predict future output, our best option is to guess which input program the machine is using. Some programs can be ruled out because they don’t correspond to the observed output; but an infinite number of possible programs always remains. The Universal Prior says that shorter programs are more likely. So, a hypothesis is just a bunch of Turing Machine instructions. It makes sense to speak about the “joint hypothesis of F and G”—it’s a program that does both what F and G do. On the other hand, it makes no sense to speak about “F OR G” in the way you were doing it before.
A. I see, right! Actually I’m not a mathematician or AI designer, I just want to reason about the physical universe. Let’s fix it as the domain of our conversation. It seems to me that hypotheses in some sense correspond to the laws of physics. The output corresponds to the physical universe itself; everything that we can observe by using our senses. But what does the Turing Machine correspond to?
B. You’re looking at it from the wrong angle. The justification of Occam’s razor is an epistemic, not an ontological issue. It’s not essential whether there is or isn’t a specific Turing Machine “out there” computing our universe; it’s essential that our universe is behaving as if it was computable! Let’s quote Scholarpedia: “It is clear that, in a world with computable processes, patterns which result from simple processes are relatively likely, while patterns that can only be produced by very complex processes are relatively unlikely.”
A. But when we’re calculating the complexity of hypotheses, we’re doing it in respect to a particular model of computing!
B. The Church-Turing thesis tells that the choice of a particular machine is not very important—any Universal Turing Machine can simulate any other model of computation.
A. But who says it’s a universal machine! It may be anything, really! For example, who says that the universe doesn’t work as a quasi-Turing machine that outputs a huge number of “1” with 99% probability every time after it outputs a single “0″? The Universal Prior relies on the relation between inputs and outputs of the machine; if this relation is changed, the probabilities are going to be wrong.
On the other hand, the physical model may be strictly more powerful than a Universal Turing machine—it may be a hypercomputer! If it is, the universe is uncomputable.
B. Well, Occam’s razor says that the model should be the simplest possible… Ah, I see, appealing to the razor here is a kind of circular reasoning.
At the moment I have another line of thought: consider what happens when we throw in the Simulation Hypothesis. The probability of being in a simulation is dependent on the computational power the simulated universes got. If the power is significantly smaller than in the parent universe, the recursive chain of simulations is shorter; if the power is closer to parent’s power, the chain is longer. Therefore, the probability of being in a simulation is proportional to the length of the chain. This implies that if we’re living in a simulation, then our universe is almost certainly not significantly less powerful than our “parent” universe. Either both our and the alien universe are computable, or they are both hypercomputable (in identical meanings of this word). Since it seems that we cannot create hypercomputers in our universe, its reasonable that the aliens cannot do that either. So it’s evidence that our universe is computable.
A. Still there is a small probability that uncomputable oracles are present in the physical world, and we have just failed to recognize them. Perhaps we could learn about them in the future and harness their power somehow.
B. Sure—but the point is that we have yet to see any evidence for them. And there’s evidence against them—the fact that our computer simulations match reality very well—as long as we have the computing power required for them! We’ve been looking for strange messages in the sky, we haven’t found them. We’ve been looking for messages in our cell, we haven’t found them.
A. In the worst case we can still justify Occam’s razor for the physical universe by induction on empirical experience, right?
B. Hmm… I thought for a while and now I’ve got a better justification! See, even if the universe itself is uncomputable, there’s still a MYRIAD of processes in it that ARE COMPUTABLE. We know that gravity and electromagnetism do not behave in random ways; they are at least approximately computable. Molecular dynamics is approximately computable. The cell is approximately computable. The nervous system is approximately computable. Evolution should be approximately computable; we can compute some of its basic elements astonishingly well. When Mendel was determining how heredity works by doing his pea-hybridization experiments, he was investigating a discrete, beautifully computable process.
Nature fights against randomness. Pure uncontrolled randomness does not allow complex systems to emerge. Even if the space-time of physics itself is completely continuous and uncomputable (which far from given), Nature at higher levels favors computable and discrete processes. In a sense, Nature has an infinite number of monkeys—but these monkeys are not sitting a a typewriter and writing the whole damn thing. Instead, they are sitting at a computer and writing input programs. As Seth Lloyd says (source):
Quantum mechanics supplies the universe with “monkeys” in the form of random quantum fluctuations, such as those that seeded the locations of galaxies. The computer into which they type is the universe itself. From a simple initial state, obeying simple physical laws, the universe has systematically processed and amplified the bits of information embodied in those quantum fluctuations. The result of this information processing is the diverse, information-packed universe we see around us: programmed by quanta, physics gave rise first to chemistry and then to life; programmed by mutation and recombination, life gave rise to Shakespeare; programmed by experience and imagination, Shakespeare gave rise to Hamlet. You might say that the difference between a monkey at a typewriter and a monkey at a computer is all the difference in the world.
A. Fine. This really has convinced me that Occam’s razor is in some sense justified. I agree that the majority of the processes in the universe are computable or computationally approximable. I agree that for these processes, the Universal Prior follows. On the other hand, Occam’s razor isn’t justified for the ultimate laws of physics, because our universe might be uncomputable.
B. Well, there might be no such thing as “laws” in the ontological sense. Look, it’s probable that there were an infinite number of inflationary bubbles expanding in the primordial Multiverse. Random regularities generated in a particular their subset lead to the appearance of laws. In this case the Universal Prior is exactly what is needed to sort out which sets of physical laws are probable, and which are improbable. I think that the existence of this simple physical model is evidence for a simple computational model, therefore evidence against hypercomputation and the kind of probabilistic, quasi-Turing machines you mentioned before.
A. May be. However, I still doubt that Occam’s razor is always the best option to use in practice, even if it is theoretically justified.
Solomonoff Induction guarantees good results when are able to observe the prefix of the output data completely and precisely. Instead, real-world reasoning requires working with data data that has measurement errors, random noise, missing samples… and so on. Furthermore, we often need not so much a single precise hypothesis, but rather something broader—a type or a class of hypotheses—what machine learning people call “a model”. Indeed, it turns out that machine learning researchers are working precisely on these problems. They are not terribly excited about Occam’s razor. Let me quote from [Domingos 1999]:
“First [version of Occam’s] razor: Given two models with the same generalization error, the simpler one should be
preferred because simplicity is desirable in itself.
[..]
Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.
We believe that it is important to distinguish clearly between these two versions of Occam’s razor. The first one is largely uncontroversial, while the second one, taken literally, is false.
Several theoretical arguments and pieces of empirical evidence have been advanced to support it, but each of these is reviewed below and found wanting.”
B. So what do you suggest?
A. It seems to me that Occam’s razor surely is a good option when we need to explain data from controlled scientific experiments. It’s the best option if we know how to computationally simulate the process and are able to isolate the process from the environment. If we don’t know the mechanics of this process, we cannot even approximately calculate the computational complexity of the data. If we cannot isolate the system from the environment (and the system doesn’t have good enough built-in error-correction facilities), then the results of the simulations won’t repeat real world data. (For example, we can simulate the heredity of the peas. We can get probability distribution of recessive and dominant traits across pea hybrids in silico, and then get nearly the same results in vivo. On the other hand, there’s small hope that we’ll ever be able to simulate the unfolding of the long-scale evolution and replicate its past results in our computers. Even though we do understand the mathematical basis of evolution, simulating the environmental impact is beyond our reach.)
B. We also should not forget that there are other motivations behind the razor. When evaluating a theory, its universal a priori probability is not the only thing that counts: theories with lower cognitive complexity are preferable because they are easier to reason about; theories that lead to algorithms with lower time complexity are preferable because they save our processing power. The universal prior is uncomputable after all; it can only be approximated. So it’s OK to trade off marginal gains in probability—the apparent gain may be an error anyway. It’s perfectly rational to examine and trade-off Kolmogorov complexity for different kinds of complexities.
A. So the bottom line is that situations in which Occam’s razor (in the informal sense) is justified is neither a superset nor a subset of the situations in which Solomonoff induction is justified. Besides, the question of the nature of the universe remains open—as always...
References:
[Domingos 1999] The role of Occam’s razor in knowledge discovery. P Domingos − 1999 - Springer. [PDF]
[Scholarperdia 2007] Marcus Hutter et al. (2007), Scholarpedia, 2(8):2572. [link]
About the relationship to Occam’s razor: note you could make a version of SI that gave some long programs more weight than some short programs, and it would still have the same nice properties. SI only needs some probability distribution that sums to 1 over the space of all programs. Since there’s a countable infinity of programs, the weights should converge to zero, but the convergence doesn’t have to be monotonous.
About the universe being computable: in some games, like the log score game, you don’t need to assume that. It’s enough to say that you are computable and therefore SI will beat you in the long run on any given sequence of observations. That’s true even when the sequence of observations is allowed to be uncomputable (the proof is a simple exercise).
A conjecture of mine, but I don’t know enough to answer on my own: is it true that for every Solomonoff-like priors, shorter programs have almost always higher probabilities than longer ones? Almost always in this case means “all but a finite number”.
Well, SI itself is uncomputable, so...
I guess that one can prove this is not the case for computable priors.
No, that’s not true. You can take any infinite set of pairs of programs and swap the weights in each pair.
Yeah, all universal priors are uncomputable. For any computable prior, you can build a computable sequence whose probability under that prior is 0. For example, you can ask the prior “which choice of next bit would have probability below 0.6?” at each step.
No, but something weaker is true. Take any probability P, e.g. 1/1000. No more than 1/P (e.g. 1000) programs can carry this or higher probability, and they’re all short in the trivial sense of being shorter in length than some fixed L, depending only on P and your formalism. So in a sense, the likely programs are all short, and all the really long programs are really unlikely.
Thanks, I guess my wording wasn’t precise enough.
For every program T, all but a finite amount of program longer than T have lower probabilities.
Let’s say that T has probability P. This means that 1/P programs carry this or higher probabily. Let’s say that L is the max of length of those programs, while T has length S. This means that max 2^(L-S) programs are both longer and more probable than T, while all the (infinitely) others are longer and less probable.
Yeah, that’s correct. Note that it’s also correct if you remove “longer than T”: “For every program T, all but a finite number of other programs have lower probabilities”. Which is obviously true because you can’t have an infinite number of programs whose probability is above some threshold.
Sure, now that you’ve pointed it out I see that my conjecture was trivially true :)
I guess on the same line of thought you can informally deconstruct Occam’s razor:
every finite-complexity state of affair can be equivalently explained by longer and longer hypothesis;
one of them must be true;
for any explanation, the list of longer explanations is infinite, but only grabs the finite remaining portion of probability mass;
so, besides for a finite number of instances, all of them must have lower and lower probabilities.
Might be worth a post to informally deduce Occam’s razor.
It’s a well known argument, I learned it from Shalizi’s note. There’s other work trying to justify Occam’s razor when the set of hypotheses is finite, e.g. Kevin Kelly’s work.
Thanks for the very interesting papers.
In Kelly’s page, this
I think it is possible to appeal to simpler principles, modifying some of the points I made above.
Indeed, I think it is possible to non-circularly explain why
Yep, that’s correct. Basically, short programs suffer from there only being a finite supply of them. The actual distribution of probabilities to programs doesn’t matter that much.
That’s why I think SI doesn’t work very well as justification/support for Occam’s Razor.
Because as an explanation it is too simple or because you feel there should be a deeper reason?
Because the prediction itself doesn’t use the shortness of programs in any essential way, and the shortness only matters because of the unavoidable nature of the infinite formalism.
A thought experiment: suppose we’ve determined that the Universe is finite (possible) and worked out a higher bound M on its informational capacity, and we know that programs longer than M will never be run in this universe. We might want to rework SI using a very large finite probability space of programs rather than an infinite one. Given a valid prior distribution over this finite space, SI itself, as a prediction procedure, will continue to work, and its promise of correctness will still be there in an updated fashion. But for SI to work there will be no reason whatsoever to assign greater probabilities to shorter program in this large finite space. A uniform distribution will work just as well, or even giving really large weight to the longest programs and systematically biasing against shorter ones. It will be all the same to SI. Clearly it doesn’t have anything to do with Occam’s Razor in this thought experiment. The connection, then, is forced upon us by the mere fact that we’re using an infinite set of finite-length algorithms, and nothing else, not anything about SI: about what we predict, how we do it, etc.
In this case I feel more inclined to bite the bullet. After all, Solomonoff-as-it-is is an application of the basic requirement of probability: “use all your available information, and nothing else”.
When the totality of all we know doesn’t constrain on a finite universe, then Occam emerges as a trivial fact about allowing infinite explanations.
But if the totality of what we know does indeed produce a finite model, and induction works from any prior, then I say that the razor simply doesn’t apply.
After all, I prefer to derive my epistemology from what works than the other way around.
Sure, if that’s the way you like it, but for me that just doesn’t work. Occam’s Razor is a principle that is supposed to help me think better here and now; to decide that its justification rests on whether, say, the Universe is discrete at Planck scale or not, when this choice has to be compatible with QM and Newtonian mechanics at larger scales and therefore doesn’t change anything in practical terms in my life here and now—that seems absurd. To me, it’s a clear evidence that this is no justification at all.
Ah, the joy of mass downvoting… Sigh.
Yeah, people get quite deeply confused by references to Occam’s razor, “length of programs”, and other such stuff.
If you only consider the programs of length N bits, approximately 2^(N-M) programs will have an output sequence forever identical to that of a program that can be minimally written in M<N bits.
Other confusion is with regards to how Solomonoff Induction would learn the program of length ~N from getting N bits of input. No it wouldn’t—you can write in a small number of bits a program that outputs BB(20) zeroes, then a string of ones, and until you read over BB(20) input bits you won’t learn a thing.
I’m confused. I thought SI assigns equal probabilities to every program in the infinite space, and then simpler predictions turn out to be more likely because there are many more programs that make simple predictions than there are programs that make complicated predictions.
As I understand it, SI works even better in the finite case than in the infinite case, because you don’t have to worry about infinite sums and whatnot. (Provided the domain of the finite case was something like “All the programs of length N or less” for some finite N.)
Yep, you’re confused. You can’t possibly assign equal probability to every program in an infinite space, because then the sum of all probabilities will diverge (go to infinity), and you need it to sum up to 1. What you do is assign probabilities in a way that the infinite sum will converge to 1, for example by lining all programs up in an infinite list: M1, M2, M3… and assigning probabilities 1⁄2, 1⁄4, 1⁄8, 1⁄16, 1⁄32 and so on to infinity. That’s essentially what the universal prior does, using a length-sorted infinite list of all possible programs (I’m skipping a few small technical difficulties).
The simpler predictions are typically more likely because of the one shortest program that generates them, not the many longer programs that also do—they do help, but because of the exponential decay they usually don’t amount to much. The shortest program tends to dominate.
Except I just lied a little in what I said. SI prediction doesn’t care if your program is short or long. It cares how early it comes on the infinite list M1,M2,M3… of all programs. When you arrange this list by program length, “earliest in the list” and “shortest” are basically the same. And in the usual infinite case, in a sense you cannot NOT arrange the list by program length. You can give your favourite very long programs some of the first spots on the list, and so high probabilities, but there’s an infinity of long programs out there, and you can’t nudge the infinity to where you want to go. But in the finite thought experiment, you’re free to arrange your list any way you want (and you don’t even need to have probabilities go down exponentially either). So you can just piss on Occam’s Razor and strictly reverse it by having longer programs always be more likely than shorter ones, if you feel like it. And SI prediction will still work.
That’s what I used to think about how SI worked. Then I read this explanation which seems to make sense to me, and seems to justify my view of things.
Specifically, this explanation shows how Solomonoff Induction doesn’t need to assume a prior that some hypotheses are more likely than others; it can weight them all equally, and then prove that simpler hypotheses have more “copies” and thus simpler predictions are more likely.
In the infinite case there is still the problem of how to get an infinite number of equally-weighted hypotheses to sum to probability 1. But this is what Measure Theory does, I believe. (I’m not a mathematician, but this is what I’ve read and been told.) So it isn’t a problem, any more than math is a problem.
But if the space was finite, as I said, then Solomonoff Induction wouldn’t even have that problem!
So, I no longer think I’m confused. I think that your understanding of SI portrays it as more arbitrary than it actually is. SI isn’t just a weighting by simplicity! It is a proof that weighting by simplicity is justified, given certain premises! (namely, that the space of hypotheses is the space of computable programs in a certain language, and that they are all equally likely, modulo evidence.)
I am a mathematician, and yes measure theory can do this, but it will require you to make many (in fact infinitely many) arbitrary choices along the way.
Huh. Okay, thanks for the info. This is troubling, because I have long held out hope that SI would not turn out to be arbitrary. Could you direct me to where I can learn more about this arbitrariness in measure theory?
You can say that the input tape (where the program gets loaded from) is an infinite sequence of random bits. A ‘program of length l’ is a piece of code which sets the machine up so that after l bits subsequent bits do not matter. Every short program is also a set of longer programs.
You can. But is there any reason to think that this models well the inherent complexity of programs? Do we ever execute programs by choosing randomly bits that constitute them? We have algorithms that utilize randomness, to be sure, but UTMs are not, normally, such algorithms. I appreciate that “choosing program bits randomly” is a simple, short, succinct way of getting to 2^-length(p), but I don’t see a reason to think it’s natural in any interesting sense.
Well, indeed. There’s nothing natural about that whole silly notion. Laws of physics in our universe are highly symmetric (and fundamental laws are even time-reversible, when mirrored and charge-flipped), what works for guessing the laws of physics is assumptions of symmetries, something that S.I. most dramatically does not do.
Furthermore, incorrect theories in S.I. can take very long time to rule out because it is very easy to set up—in very few bits—a long-running busy beaver that changes the output after many bits are outputted. For an S.I. driven agent, there’s various arbitrary doomsdays looming with probabilities greater than 1/1024 (extra length less than 10 bits) .
Furthermore, in total absence of external input, an ideal inductor must assume it is not within an universe where the laws of physics do not support universal computation. This rules out vast majority of the input tapes, which are given non-zero prior in S.I.
Do you think a better agent should assign low probabilities to such arbitrary doomsdays in the first place? What would be a good general rule for that?
Having thought about it some more, I feel that a good induction method would start with p_doomsday(time) being a smooth function, and it would have to acquire great many bits of information-theoretic data before that function would grow very sharp and specific peaks.
Meanwhile, S.I. starts with an enormous set of weird preconceptions due to the use of some one dimensional Turing machine, and consequently produces really really bad priors. The badness of said priors is somewhat masked by very-easy-for-laymen-to-misinterpret optimality proofs.
I think that if you plot sane agent’s probability of doomsday by time, it won’t be some really weird shaped curve with incredibly sharp (Planck-time sharp) peaks at various points highly specific to the internal details of the agent. Really, it’s as if someone with a weird number-fear synasthesia looked at the list of years, since the alleged birth of Christ and picked the scariest numbers, then prophesied doomsday at those years. It is clearly completely insane.
Just checking—you do know the formulation of SI that uses a universal prefix Turing machine provided with fair coinflips on the input tape, right?
I don’t understand the question. I thought that’s what we were talking about. Am I missing something?
To be more explicit: setting up a UTM with random bits on the input tape is a natural-seeming way of getting the probability distribution over programs (2^-length(p)) that goes into the Solomonoff prior. But as I’m trying to say in the comment you replied to, I don’t think it’s really natural at all. And of course SI doesn’t need this particular distribution in order to be effective at its job.
Yeah, sorry. It’s me who was missing something =)
Only if you insist on assigning non-zero probability to each individual program.
Is that responding to the “You can’t possibly assign equal probability to every program in an infinite space” part of the parent comment? If you assign equal probability to every program in an infinite space, then (assuming “probability” refers to a real number, and not some other mathematical object such as an infinitesimal) either that equal probability is non-zero, in which case the sum is infinite, or the equal probability is zero, in which case the sum is zero. Remember, if all the probabilities are equal, then either all of them are equal to zero, or none are, and it’s generally agreed that an infinite number of zeros add up to zero.
Actually 0·∞ is considered an indeterminate form. For example the interval from 0 to 1 has length 1 even though it’s composed of infinitely many points each of which has length 0.
See my reply to asr for links to more technical explanations.
If you assign an equal probability to every program, and that probability is zero, then the sum, the total probability, is zero and. If you assign an equal nonzero probability, the probability is infinite.
If you want to have a valid probability distribution, you can’t assign equal probabilities.
Well, if you’re willing to relax the third axiom of probability to finite additivity you can. Alternatively, use an uncountable state space, e.g., the space of all ultra-filters on Turing machines. Also, while we’re on the subject, there are numerous types of quasi-probability distributions worth considering.
Ah, fair enough. I had been assuming only real probabilities.
See this comment for a justification of why shorter programs are more likely.
The argument is taken from this paper.
Err, no, I wouldn’t call it a justification. It’s more like a technical setup of why shorter programs turn out to be more likely (because the padding happens to include them in this way). This is not a profound thing. It’s just one possible way to get where you want, which is to assign 2^-length(p) to p. Note that it’s brittle in the sense that you need a support from your formalism for that. You need your notion of what is a program to allow adding arbitrary bits to the end without affecting its validity or what it does. That’s not a very interesting thing to do: why would you even consider all those programs with dummy padding as real programs, given that whatever’s running them can probably easily rule them out? E.g. modern compilers will usually reject your program if you add some junk at the end.
If you restrict yourself to a finite program length, you could still only count minimal programs, those w/o padding, and give them appropriate weights (the way they do in the paper immediately afterwards for the infinite case, the standard way to set it up) and end up with the same thing. There’s no profound reason I can think of to say “oh, for fixed bound L on program length we’re just only going to look at programs of length exactly L, including padded programs, but for unrestricted length we’re throwing out padded programs and only looking at minimal ones”. The real reason is that you want to sum up to 1, and you want to prefer short programs while doing so.
Now, something closer to justiication is when they point our right afterwards that 2^-length(p) is what you get if you randomly choose program bits one by one. But as justifications go, I think it’s a pretty weak one, because no one really computes anything in any way that looks like that. And more importantly, as I said above, it just doesn’t matter for SI prediction. You don’t have to use 2^-length(p) for SI prediction to work. You just need to have it all sum up to 1.
You kind of missed my point. I provided it as a technical justification why the exact formula 2^-length(p) makes sense. As you said, “It’s one possible way to get where you want”. I know that 2^-length(p) is not the only possible option, but it’s the most natural one given that we know nothing about the input.
The model of computation is a “Universal Turing machine provided with completely random noise” (i.e. fair coin flips on the input tape). It’s a mathematical formalism; it has nothing to do with modern compilers.
I didn’t miss your point; instead, I’ve tried to explain your point to you, and perhaps didn’t do a really good job. You’re confusing two different things which are both spelled out in the paper you referenced on page 1120, but in your comment you linked me to you’ve only recounted the first one, perhaps thinking that the second one is the same thing, which it is not.
The first thing is setting up probabilities over programs of a given fixed length L, so that padding helps us take into account also shorter programs. In the paper, this is the paragraph starting “This sum is still only over programs of length ℓ...”. This is what you talked about in your other comment, and this is what I talked about in my first paragraph of the reply above.
The second thing is a way of setting up probabilities over programs of any length without bound, using a UTM-with-random-noise model as justification for the formula. This is an inherently different way of assigning probabilities, where in particular we do not use padding, but on the contrary only take into account minimal programs. This is what the paper talks about in the next two paragraphs, the second of which, starting “This is a highly technical explanation...” explains the UTM-with-random-noise. I talk about this second thing in the third paragraph of my reply above.
Hopefully I made it clearer. If you still think I missed your point, let’s leave it at that.
+1. Hadn’t thought of that.
Unless I have access to the right kind of uncomputable oracle. Which seems to be not totally impossible if we’re living in an uncomputable universe.
If you’re computable and interacting with a world that contains an uncomputable oracle, and SI is in the same position, I don’t really see what advantage you have. After all, your algorithm can be found somewhere within SI’s mixture. It depends on the game setup though. We discussed some setups here.
What do you mean, precisely, by “SI is in the same position”?
If I understand correctly, SI can be “upgraded” by changing the underlying prior.
So if we have strong reasons suspect that we, humans, have access to the halting oracle, then we should try to build AI reasoning that approximates SI + a prior defined over the universal Turing machine enhanced with halting oracle.
If we don’t (as is the case at the moment), then we just build AI that approximates SI + the Universal prior.
Either way, there are no guessing games when we, on average, are able to beat the AI, as long as the approximation is good enough.
I mean that the computable part of you is facing the decision problem “which buttons on the oracle black box do I press, and how do I use the answers on the screen when choosing my next action?” and SI is facing the same decision problem. (Unless of course you want to define the problem so that only you can press buttons on the black box, and the SI agent by definition can’t. That would seem unfair to me, though.)
Given the above, in some game setups (though probably not all) I would expect SI to beat you, just like it beats you in the log-score game even when the input is uncomputable and comes from a level 100 oracle or whatever. The relevant fact is not whether the input is computable, but that SI is pretty much a multiplicative weights mixture of all possible computable “experts” which includes you, and there are many situations where a multiplicative weights mixture provably beats any expert on any input, up to a constant etc. etc.
I thought it was implied that SI cannot access any boxes or push any buttons, because it’s a mathematical abstraction. But I see that you mean “an AI agent with SI”.
So, the precise statement would be that whatever weight you assign each program, if f(n) := sum of the weights of all programs of length n F(n) := sum of f(k) for all k < n then limit as n goes to infinity of F(n) must be finite (if the weights are normalized, then the limit must be 1) and thus limit as n goes to infinity of f(n) must be zero. In fact, by the ratio test, the limit of f(n)/f(n+1) must be greater than or equal to one. Since there are 2^n programs of size n, that means that if a(n) := average weight of programs of size n then limit as n goes to infinity of a(n)/a(n+1) must be greater than or equal to 2
We are therefore highly constrained as far as the average weight for programs of size n, but less so for the maximum weight. For instance, we could have a weighting system such that for all n, there is some program of size n with weight (.999^n)/10000. We could even have a weighting system such that for all k, there is some n > k such that there is a program of length n and weight (.9999^log(log (log n) ) )/10
Do I have that correct?
PS I believe that you meant “monotonic”, rather than “monotonous” :)
The limit is >=1 if it exists, but it doesn’t have to exist. And the lim inf can be arbitrarily close to zero.
Ditto.
It seems to me that the word “hypothesis” is generally understood to mean something more general than just a program for a UTM. For instance, we can talk about “the hypothesis that this patient has cancer”. So perhaps we should have a term specific for this concept? Such as “SIH”? And then what you’re saying is that the Universal Prior only applies to SIHs, and not (at least, not directly) to hypotheses such as “this patient has cancer”. Since pretty much all practical hypothesis testing involves hypotheses that are not SIHs, doesn’t this make SI rather limited as far as practical application? What work has been done as far as how to apply it to practical hypothesis testing?
A good point!
However, if reformulate the hypothesis as “the cause of patients symptoms is cancer”, it can be treated by SI. Reductionism says that “the patient has cancer” can be translated to a statement about physical laws and elementary particles. There are problems with such a reduction, but they are of practical nature. In order to apply SI, everything must be reduced to the basic terms. So human-identified, macro-scale patterns like “cancer” must be reduced to biochemical patterns, which in turn must be reduced to molecular dynamics, which in turn must be reduced to quantum mechanics. Doing these reductions is not possible in practice due to computational limitations—even if we did know all the laws for reduction. But in theory they are fine.
A problem of more theoretical nature: whatever evidence we get in the real-world is probabilistic. SI supposes that our observations are always 100% correct.
On the other hand, it’s intuitively obvious that if we treat some high-level concepts as irreductible entitities, then a form of Solomonoff induction can be applied directly. E.g. it can be used to a priori prefer “the cause of the symptoms A, B and C is cancer” over “”there are three unrelated causes a, b, and c to the three symptoms A, B, and C”.
It seems to me, however, that SI is not very useful if there are other reliable methods of determining the probabilities. For example, “the single cause of the two patient’s symptoms is bubonic plague” in the modern world is a hypothesis of low probability even if it is the shortest one, as the empirically-determined a priori probability of having bubonic plague is tiny.
I agree that Solomonoff induction isn’t particularly relevant to most practical cases. Solomonoff Induction tries to explain what your prior should be if you didn’t know anything about the world. But in any real context, we actually know a lot about the world.
Medical diagnostics is an apt example. Here, we have especially good priors, since we have accurate medical statistics. We can measure the overall incidence rates of cancer in the population overall, and typically we also have statistics for particular age and sex cohorts. There’s no reason to start from “how easy is it to write a program to describe ‘cancer’ in some fixed language.”
Some random thoughts:
Yes, but every other model of universal computation gives the same results, besides for a single constant (the cost of intercomputation between the universal models).
What you’re describing here is a non-deterministic Turing machine. They have been proven equivalent to deterministic ones.
Yes! But if it is, either we grab ahold of a source of hypercomputation a proceed from there (although ‘”solomonoff induction” & hypercomputation’ gives very few relevant results from Google), or we are never going to know that it is: we might just suspect from our models becoming longer and longer, but we will never be able to prove that.
Also, as B said correctly, justification for Solomonoff Induction is epistemic: we are doing the best we can (i.e. using a universal finite computational model), eliminating all that we know is factually incorrect without throwing away nothing that is not explicitly forbidden by data.
I don’t see how this solves the problem. The other models of universal computation give different results, though they are similar in certain respects. For particular problems, like “Will the gravitational constant change tomorrow?” it doesn’t seem to help.
Consider the hypotheses A = [Laws of physics] and B = [Laws of physics + 10% increase in gravitational constant tomorrow]
Every model of computation that we have used so far makes A simpler than B. But there are infinitely many models that make B simpler than A. Why should we prefer A to B?
Maybe I’m just missing something. Various smart people have said that the relativity of complexity isn’t a problem because of the constant-factor agreement between models/languages. However, various other smart people who have thought about it a lot have said the opposite, so I don’t think I’m missing something obvious.
Maybe the answer has to do with a measure defined over the space of models? This would be nice, but I worry about just pushing the problem up a level.
Well, first thing first: what problem?
My observation was that it doesn’t matter which kind of universal computation we use, because, besides for a constant, all the models gives the same complexity. This means that one universal prior will differ from another just by a finite number of terms.
If A and B are the only explanations surviving, then by all means we shouldn’t prefer A to B.
But the point is, in any situation where Solomonoff is used, you have also:
C = [Laws of physics + 10% increase in gravitational constant tomorrow + 10% decrease in two days]
D = [Laws of physics + 10% increase in gravitational constant tomorrow + 10% decrease in two days + 10% increase in three days]
E = [Laws of physics + 10% increase in gravitational constant tomorrow + 10% decrease in two days + 10% increase in three days + 10% decrease in four days]
and so on and so forth.
Let’s say that to me, in order of probability, E > D > C > B > A, in a perfect anti-Occamian fashion. In any case, A + B + C + D + E, will have an amount of probability x. You still have 1-x to assign to all the other longer programs: F, G, H, etc.
However small the probability of A, there will be a program (say Z) for which all the other programs longer than Z will have probabilities lower than A. In this case, you have a finite violation of Occam’s razor, but it is forced on you by math that A is to be preferred as an explanation to longer programs (that is, longer than Z).
The problem being discussed is the relativity of complexity. So long as anything can be made out to be more complicated than anything else by an appropriate choice of language, it seems that Solomonoff Induction will be arbitrary, and we won’t be justified in thinking that it is accurate.
Yes, one universal prior will differ from another by just a finite number of terms. But there is no upper bound on how large this finite number can be. So we can’t make any claims about how likely specific predictions are, without arbitrarily ruling out infinite sets of languages/models. So the problem remains.
As you say, A is to be preferred to programs longer than Z. But there is no upper bound on how long Z might be. So any particular program—for example, B—is such that we have no reason to say that it is more or less likely than A. So it seems we have failed to find a full justification for why we should prefer A to B.
Unless, as I said, we start talking about the space of all possible languages/models. But as I said, this threatens to just push the problem up a level.
The most-different case is the same results but for the worst-case intercomputation cost.
However, the ratio between best and worst case intercomputation costs for different sets of operations might not be trivial, and that would distort things.
You’re right—Solomonoff induction is justified for any model, whenever it is computable. The exact technical details are unimportant. I was a bit confused about this point.
Essentially, Solomonoff induction “works” in the physical universe (i.e. is the best predictor) whenever:
1) there is a source of randomness
2) there are some rules
3) the universe is not hypercomputing.
If there is no source of randomness involved, the process is fully deterministic, and can be best predicted by deductive reasoning.
If there are no rules, the process is fully random. In this case just tossing a fair coin will predict equally well (with P=0.5).
It it’s hypercomputing, a “higher-order” Solomonoff induction will do better.
In the context of Bayesian reasoning, I understand “random” as “not enough information”, which is different from “non-deterministic”.
So that:
Only if we have enough information to exactly compute the next state from the previous ones. When this is not the case, lack of information acts as a source of randomness, for which SI can account.
In a sense, yes. There might still be useful pockets of computability inside the universe, though.
I’m not sure “higher-order” Solomonoff induction is even a thing.
“Higher-order” SI is just SI armed with an upgraded universal prior—one that is defined with reference to a universal hypercomputer instead of a universal Turing machine.
It’s not that simple. There isn’t a single model of hypercomputation, and even inside the same model hypercomputers might have different cardinal powers.
Additional justification: if your data can be perfectly described by a 1000 bit program, you will find that program after at most 1000 + c bits of evidence.
I think you mean BB(1000) bits of evidence?
I was measuring the Kolmogorov complexity of the evidence, but now that you mention it that does make for a bit of circular reasoning.
Ah, I see. Yeah, 1 bit in input bitstream != 1 bit of bayesian evidence.
Assuming you have access to a source of hypercomputation, in which case the universe can’t be described by a turing machine anyway.
I have been thinking that the universal prior is tautological. Given a program, there are an infinite number of longer programs which perform the same computation (or an indistinguishable variation) but only a finite number of shorter programs having this characteristic. If you count every possible program equally, you’ll find that each short program represents a host of longer programs. However, now that I write this down, I’m no longer sure about it. Can someone say why/if it’s wrong?
[EDIT note: This is completely different from what I originally wrote in response to lavalamp’s question, because originally I completely misunderstood it. Sorry.]
You can’t “count every possible program equally”. (What probability will you give each possible program? If it’s positive then your total probability will be infinite. If it’s zero then your total probability will be zero. You can do a lot of probability-like things on a space with infinite total measure, in which case you could give every program equal weight, but that’s not generally what one does.)
Leaving that aside: Your argument suggests that whatever probabilities we give to programs, the resulting probabilities on outcomes will end up favouring outcomes that can be produced by simpler programs. That’s true. But a universal prior is doing something stronger than that: it gives (in a particular way) higher probabilities to simpler programs as well. So outcomes generated by simpler programs will (so to speak) be doubly favoured: once because those simple programs have higher probabilities, and once because there are “more” programs that generate those outcomes.
In fact, any probability assignment with finite total probability (in particular, any with total probability 1, which is of course what we usually require) must “in the limit” give small probabilities to long programs. But a universal prior is much more specific, and says how program length corresponds to probability.
The only programs allowed in the Solomonoff distribution are ones that don’t have any extended versions that produce the same output observed so far. So it’s not that the longer programs are given lower probability—it’s that they are given undefined probability, and are entirely “represented” by the most truncated version.
I learned this by reading Anatoly Vorobey’s forthcoming article!
How do you define an extended version of a program?
I agree to what you’re saying only as long an “extended version” of a program is the same program, just padded with meaningless bits at the end.
I don’t agree that its true regarding longer programs in general. A “longer version” of a program is any program that produces the same output, including programs that use different algorithms, programs that are padded at the beginning, padded in the middle etc.
To get the mathematical expression of the Universal prior, we count all programs, and here the extended versions comes into play. Intuitively, a program padded with a single bit is counted twice (as it may be padded with 0 or 1), a program padded with 2 bits is counted 4 times, etc. So a program with length L+k is 2^k times less likely than as program with length L. That’s where the 2^{-length(p)} probability comes from.
Did not know that! It seems like that would leave some probability mass unassigned, how do you rebalance? Even if you succeed, it seems likely that (for large enough outputs) there’ll be lots of programs that have epsilon difference—that are basically the same, for all practical purposes.
Normalize!
Solomonoff induction is just defined for binary data. Differences are a minimum of 1 bit,, which is enough.
That depends on how one defines “Solomonoff induction”.
According to the SchoIarpedia article, it’s not required that the probabilities add up to 1:
∑x m(x) < 1
It’s simpler to defined the Universal Prior in the way that not-halting programs are “not counted”. So the sum is not over all program, just the halting ones.
I see, thanks!
I did know this and should have phrased my sentence hypothetically. :)
You can also do probability like things on an infinite space where each finite subset has measure zero.
There are different ways to formalize the same thing. One of the ways seems less confusing than others.
Suppose you have an universal Turing machine with 1 input tape, 1 work tape, and 1 output tape. The machine reads from the input (which works monotonously, i.e. like cin in c++), prints something on the output (also monotonous, like cout in c++) and works on the work tape (initialized with zeroes). The input tape can be set up as to load a program onto the work tape, including an emulator for any other such machine.
The prior probability of observing a sequence of bits x, is the probability that it gets printed when the input tape is an infinite sequence of random bits. Just that. (The probability that input sequence starts with n specific bits is 2^-n )
When we want to find a probability of future observations y after observing x, we look for the probability that given that x is outputted the y is outputted immediately after x.
As for the correspondence between input strings and theories, it’s not very straightforward.
For example, codes which include junk bits the value of which does not affect the output are longer than those which do not include any junk bits. Yet, since the junk bits can have any value, it is not correspondingly unlikely that one of such codes is encountered. Simply put, a theory like f=m a takes form of an immense number of inane things like f=m a (0.1+0.9)+0 (0.2365912354+pi) …. , each inane thing individually being improbable but the sum of probabilities of such inane theories being not small.