Born rule or universal prior?
You’re about to flip a quantum coin a million times (these days you can even do it on the internet). What’s your estimate of the K-complexity of the resulting string, conditional on everything else you’ve observed in your life so far? The Born rule, combined with the usual counting argument, implies you should say “about 1 million”. The universal prior implies you should say “substantially less than 1 million”. Which will it be?
EDIT: Wei Dai’s comment explains why this post is wrong.
This is a tad confused.
A very simple measure on the binary strings is the uniform measure and so Solomonoff Induction will converge on it with high probability. This is easiest to think about from the Solomonoff-Levin definition of the universal prior where you take a mixture distribution of the measures according to their complexity—thus a simple thing like a uniform prior gets a very high prior probability under the universal distribution. This is different from the sequence of bits itself being complex due to the bits being random. The confusing thing is when you define it the other way by sampling programs, and it’s not at all obvious that things work out the same… indeed it’s quite surprising I’d say.
I’d suggest reading the second chapter of “Machine Super Intelligence”, I think it’s clearer there than in my old master’s thesis as I do more explaining and less proofs.
Can you explain why? What’s the result saying the Solomonoff distribution “as a whole” often converges on uniform?
It’s not clear to me from this description whether the SI predictor is also conditioned. Anyway, if the universal prior is not conditioned, then the convergence is easy as the uniform distribution has very low complexity. If it is conditioned, then you will have no doubt observed many processes well modelled by a uniform distribution over your life—flipping a coin is a good example. So the estimated probability of encountering a uniform distribution in a new situation won’t be all that low.
Indeed, with so much data SI will have built a model of language, and how this maps to mathematics and distributions, and in particular there is a good chance it will have seen a description of quantum mechanics. So if it’s also been provided with information that these will be quantum coin flips, it should predict basically perfectly, including modelling the probability that you’re lying or have simply set up the experiment wrong.
I think what mathemajician means is that if the stream of data is random (in that the bits are independent random variables each with probability 1⁄2 of being 1) then Solomonoff induction converges on the uniform measure with high probability (probability 1, in fact).
I’m sure you knew that already, but you don’t seem to realize that it undercuts the logic behind your claim:
Even leaving out the whole “experimentally verified” thing, parts of a system can have longer description lengths than the whole system by breaking symmetries in what you include. Which means that if the system has a description length distributed according to the simplicity prior, such a part doesn’t. So it’s not that you just shouldn’t use the simplicity prior on everything, it’s that you can’t.
If you’re saying our observations can be complex even though the universe is simple, you’re essentially taking the first horn of the dilemma in my post. This means you cannot use Solomonoff induction to figure out the correct quantum physics, because Solomonoff induction works on the input stream of bits that it sees, not on the “universe as a whole”.
This is a general problem: most decision theories ignore the fact that their implementation runs on physics, or make some nonsensical assumptions like “everything is program”. But observations are not just parameters of a known idea (for agent’s decisions, and in particular epistemic decisions, are not known), they by themselves mean at least the physical facts that constitute them.
I’m not actually very familiar with Solomonoff induction, so if it can’t handle the observable universe being one possibility of many, pretend I’m advocating something that can :)
I am entirely unconvinced that the universal prior really does imply that you should say “substantially less than 1 million”.
It seems to me that using the universal prior leads to a probability distribution in which most of the probability goes to hypotheses that take quantum physics seriously, in which case we expect that when you flip a quantum coin a million times we end up with (roughly) 2^1000000 versions of you, almost all of whom see bitstrings with K-complexity about 1 million.
The universal prior should probably also still give some probability to hypotheses in which the universe works non-quantum-mechanically and you get a K-complexity well below 1000000. But not very much.
That isn’t true, I think. See my reply to Perplexed.
I think you’re implicitly assuming that the K complexity of a hypothesis of the form “these n random bits followed by the observations predicted by H” equals n + (K-complexity of H) + O(1). Whereas actually, it’s n + (K-complexity of H) + O(log(n)). (Here, the log(n) is needed to specify how long the sequence of random bits is).
So if you’ve observed a hugely long sequence of random bits then log(n) is getting quite large and ‘switching universe’ hypotheses get penalized relative to hypotheses that simply extend the random sequence.
This makes intuitive sense—what makes a ‘switching universe’ unparsimonious is the arbitrariness of the moment of switching.
(Btw, I thought it was a fun question to think about, and I’m always glad when this kind of thing gets discussed here.)
ETA: But it gets more complicated if the agent is allowed to use its ‘subjective present moment’ as a primitive term in its theories, because then we really can describe a switching universe with only a constant penalty, as long as the switch happens ‘now’.
You don’t always need log(n) bits to specify n. The K-complexity of n is enough. For example, if n=3^^^^3, then you can specify n using much fewer bits than log(n). I think this kills your debunking :-)
O(BB^-1) (or whatever it is) is still greater than O(1) though, and (as best I can reconstruct it) your argument relies on there being a constant penalty.
Yeah, kind of, but the situation still worries me. Should you expect the universe to switch away from the Born rule after you’ve observed 3^^^^3 perfectly fine random bits, just because the K-complexity of 3^^^^3 is small?
If you had a mysterious K-complexity-measuring black-box wired to the output—and the output consistently pushed it to the max—I figure the universal prior would predict that it would stay that way in the immediate future.
Solomonoff induction does need to be modified here to support multiple observers, but the modification is minor; we just need to allow a universe to specify a probability distribution over observers (this is basically the SSA; different modifications are needed for different anthropic theories). The results in the probability of each individual bitstring including a factor of 2^-1000000, but the summed probability—the probability of “QM is (approximately) true”—is much greater than the probability of any alternate explanation. Most of that probability mass comes from algorithmically random strings, so that is what you should expect.
A contradiction between those two well established scientific principles would be a very interesting thing. To put it mildly.
I can’t see yet, if it is an actual paradox or not.
Why do you say this? Are you merely suggesting that my prior experience with quantum coins is << a million tosses?
No, I’m not suggesting that. I think the statement stays true even if you’ve already seen 100 million quantum coinflips and they looked “fair”. The universal prior still thinks that switching to a more ordered generator for the next million coinflips is more likely than continuing with the random generator, because at that point the algorithmic complexity of preceding coinflips is already “sunk” anyway, and the algorithmic complexity of switching universes is just a small constant.
ETA: after doing some formal calculations I’m no longer so sure of this. Halp.
I’m confused. Assuming that I “believe in” the validity of what I have been told of quantum mechanics, I fully expect that a million quantum coin tosses will generate an incompressible string. Are you suggesting that I cannot simultaneously believe in the validity of QM and also believe in the efficacy of Solomonoff induction—when applied to data which is “best explained” as causally random?
Off the top of my head, I am inclined to agree with this suggestion, which in turn suggests that Si is flawed. We need a variant of Si which allows Douglas_Knight’s simple fair coins, without thereby offering a simple explanation of everything. Or, we need to discard the whole Si concept as inappropriate in our non-deterministic universe.
The randomness of a source of information is not an empirical fact which we can discover and test—rather, it is an assumption that we impose upon our model of the data. It is a null hypothesis for which we cannot find Bayesian evidence—we can at best fail to reject it. (I hope the Popper-clippers don’t hear me say that!). Maybe what our revised Si should be looking for is the simplest explanation for data D[0] thru D[n], which explanation is not refuted by data D[n+1] thru D[n+k].
ETA: Whoops. That suggestion doesn’t work. The simplest such explanation will always be that everything is random.
Possibly ironically relevant. Eliezer quoting Robyn Dawes:
I don’t think the universe shows any signs of being non-deterministic. The laws of physics as we understand them (e.g. the wave equation) are deterministic. So, Solomonoff induction is not broken.
Hmmm. Would you be happier if I changed my last line to read ”… we need to discard the whole Si concept as inappropriate to our imperfectly-observed universe.”?
I don’t think so. Solomonoff induction applies to streams. The most common application is to streams of sense data. There is no pretense of somehow observing the whole of the universe in the first place.
You are correct that my comments are missing the mark. Still, there is a sense in which the kinds of non-determinism represented by Born probabilities present problems for Si. I agree that Si definitely does not pretend to generate its predictions based on observation of the whole universe. And it does not pretend to predict everything about the universe. But it does seem to pretend that it is doing something better than making predictions that apply to only one of many randomly selected “worlds”.
Can anyone else—Cousin_it perhaps—explain why deterministic evolution of the wave function seems to be insufficient to place Si on solid ground?
They would represent problems for determinism—if they were “real” probabailities. However the idea around here is that probabilities are in the mind.
Here is E T Jaynes on the topic:
[*] Translation: “A fundamental uncertainty, not only obscurity”
Quantum physics doesn’t say they’re random per se. It says that every sequence happens, and you only observe one of them.
I don’t know what Legg says in your link, but there are variants of the universal prior in which “the coin is fair” has small complexity, because nature has access to random numbers.
You can’t have a variant of the universal prior that makes all incoming bitstrings equiprobable regardless of their K-complexity, because that would be the uniform prior.
“Nature has access to random bits” is a very different claim than “nature outputs the uniform distribution.”
Many versions of Solomonoff induction, including, I believe, the original, predict that if so far the even bits of the output are all 0 and the odd bits have full complexity, that description will continue to be true in the future.
I’m having trouble figuring out a proof for your last claim… But then again, maybe I’m just being stupid because two other people have tried to explain it to me and I didn’t understand their attempts either :-(
These two pages from the 1997 edition of “An Introduction to Kolmogorov Complexity and Its Applications” seem relevant.
Coming back to this some years later, I’m not sure you and Douglas_Knight are right. The result holds only in the limit: if I’ve already seen a million uniform bits, then the next ten bits are also likely to be uniform. But the next million bits are expected to raise K-complexity by much less than a million. So to rescue the argument in the post, I just need to flip the coin a relatively small number of times (comparable with the information content of my brain) and then spend some time trying to find short programs to reproduce the result given the past. If I fail, which seems likely, that’s evidence that we aren’t living in the universal prior.
I think you’re right that the result I cited doesn’t actually resolve your question. I looked around a bit more and found Theorem 4 in this paper, which looks more relevant. If you’re still not convinced after looking at that, can you explain why you think “the next million bits are expected to raise K-complexity by much less than a million”?
ETA: On second thought this result doesn’t help either, because if μ is the uniform measure then the bound you get with Theorem 4 after observing a million random bits isn’t any better than what you get at the start before you observed anything. To take another tack, consider your statement:
If failing makes you update away from the universal prior, what are you updating towards? Suppose you update towards the uniform measure. But that means the actual prior you had before updating is some mixture of the universal prior and the uniform measure, but that the mixture is itself a universal prior, so you can just consider yourself as having updated within a universal prior instead.
I thought about it some more and you’re right, my argument doesn’t work. I was imagining the universal prior as a mixture of deterministic programs printing infinite strings, which is wrong. Even for something as simple as a uniform prior, the only way to get it as a mixture of programs is by using programs that print finite strings, and letting the semimeasure renormalization do its magic (allowing longer programs “inherit” the weight of shorter ones that terminate early). That’s how it can beat the mixture of all programs that print infinite strings, which can’t “inherit” each other’s weight in the same way.
Thanks a lot! I’m now convinced that the claim is true, but have no idea why :-) Will try to work through the proof.
Haven’t seen too any of those. Have they even been seriously proposed? This certainly isn’t what people usually mean by “Solomonoff induction”.
Can you explain you can’t have a language that has some encoding for random bits as well as encodings for deterministic bits? Doesn’t seem like such a language would imply a uniform prior.
Edit: Even though I don’t have the math background to understand this stuff, I love it when it gets discussed here.