Hi, my name is Jason, this is my first post. I have recently been reading about 2 subjects here, Calibration and Solomoff Induction; reading them together has given me the following question:
How well-calibrated would Solomonoff Induction be if it could actually be calculated?
That is to say, if one generated priors on a whole bunch of questions based on information complexity measured in bits—if you took all the hypotheses that were measured at 10% likely—would 10% of those actually turn out to be correct?
I don’t immediately see why Solomonoff Induction should be expected to be well-calibrated. It appears to just be a formalization of Occam’s Razor, which itself is just a rule of thumb. But if it turned out not to be well-calibrated, it would not be a very good “recipe for truth.” What am I missing?
Solomonoff Induction could be well-calibrated across mathematically possible universes. If a hypothesis has a probability 10%, you should expect it to be true in 10% of the universes.
Important thing is that Solomonoff priors are just a starting point in our reasoning. Then we update on evidence, which is at least as important as having reasonable priors. If it does not seem well calibrated, that is because you can’t get good calibration without using evidence.
Imagine that at this moment you are teleported to another universe with completely different laws of physics… do you expect any other method to work better than Solomonoff Induction? Yes, gradually you get data about the new universe and improve your model. But that’s exactly what you are supposed to do with Solomonoff priors. You wouldn’t predictable get better results by starting from different priors.
It appears to just be a formalization of Occam’s Razor, which itself is just a rule of thumb.
To me it seems that Occam’s Razor is a rule of thumb, and Solomonoff Induction is a mathematical background explaining why the rule of thumb works. (OR: “Choose the most simple hypothesis that fits your data.” Me: “Okay, but why?” SI: “Because it is more likely to be the correct one.”)
But if it turned out not to be well-calibrated, it would not be a very good “recipe for truth.” What am I missing?
You can’t get a good “recipe for truth” without actually looking at the evidence. Solomonoff Induction is the best thing you can do without the evidence (or before you start taking the evidence into account).
Essentially, the Solomonoff Induction will help you avoid the following problems:
Getting inconsistent results. For example, if you instead supposed that “if I don’t have any data confirming or rejecting a hypothesis, I will always assume its prior probability is 50%”, then if I give you two new hypotheses X and Y without any data, you are supposed to think that p(X) = 0.5 and p(Y) = 0.5, but also e.g. p(X and Y) = 0.5 (because “X and Y” is also a hypothesis you don’t have any data about).
Giving so extremely low probability to a reasonable hypothesis that available evidence cannot convince you otherwise. For example if you assume that prior probability of X is zero, then with proper updating no evidence can convince you about X, because there is always an alternative explanation with a very small but non-zero evidence (e.g. the lords of Matrix are messing with your brain). Even if the value is technically non-zero, it could be very small like 1/10^999999999, so all the evidence you could get within your human life could not make you change your mind.
On the other hand, some hypotheses do deserve very low prior probability, because reasoning like “any hypothesis, however unlikely, has prior probability at least 0.01” can be exploited by a) Pascal’s mugging, b) constructing multiple mutually exclusive hypotheses which together have arbitrarily high probability (e.g. “AAA is the god of this world and I am his prophet”, “AAB is the god of this world and I am his prophet”… “ZZZ is the god of this world and I am his prophet”).
Thank you for your reply. It does clear up some of the virtues of SI, especially when used to generate priors absent any evidence. However, as I understand it, SI does take into account evidence—one removes all the possibilities incompatible with the evidence, then renormalizes the probablities of the remaining possibilities. Right?
If so, one could still ask—after taking account of all available evidence—is SI then well-calibrated? (At some point it should be well-calibrated, right? More calibrated than human beings. Otherwise, how is it useful? Or why should we use it for induction?)
Essentially the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits. Possibly over-simplifying, this suggests that if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time. Is there any evidence that this is actually true?
(I can see immediately that one would have to control for the number of possible “paths” or universe-states or however you call it that could lead to each event, in order for the outcome to be directly proportional to the information-complexity. I am ignoring this because the inability to compute this appears to be the reason SI as a whole cannot be computed.)
You suggest above that SI explains why Occam’s razor works. I could offer another possibility—that Occam’s Razor works because it is vague, but that when specified it will not turn out to match how the universe actually works very precisely. Or that Occam’s Razor is useful because it suggests that when generating a Map one should use only as much information about the Territory is as is necessary for a certain purpose, thereby allowing one to get maximum usefulness with minimum cognitive load on the user.
I am not arguing for one or the other. Instead I am just asking, here among people knowledgeable about SI—Is there any evidence that outcomes in the universe actually occur with probablities in proportion to their information-complexity? (A much more precise claim than Occam’s suggestion that in general simpler explanations are preferable.)
Maybe it will not be possible to answer my question until SI can at least be estimated, in order to actually make the comparison?
(Above you refer to “all mathematically possible universes.” I phrased things in terms of probabilities inside a single universe because that is the context in which I observe & make decisions and would like SI to be useful. However I think you could just translate what I have said back into many-worlds language and keep the question intact.)
after taking account of all available evidence—is SI then well-calibrated?
Yes. The prediction error theorem states that as long as the true distribution is computable, the estimate will converge quickly to the true distribution.
However, almost all the work done here, comes from the conditioning. The proof uses that for any computable mu, M(x) > 2^(-K(mu)) mu(x). That is, M does not assign a “very” small probablility to any possible observation.
The exact prior you pick does not matter very much, as long as it dominates the set of all possible distributions mu in this sense. If you have some other distribution P, such that for every mu there is a C with P(x) > C mu(x), you get a similar theorem, differing by just the constant in the inequality.
So I disagree with this:
Essentially the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits
It’s ok if the prior is not very exact. As long as we don’t overlook any possibilities as a priori super-unlikely when they are not, we can use observations to pin down the exact proportions later.
However, as I understand it, SI does take into account evidence—one removes all the possibilities incompatible with the evidence, then renormalizes the probablities of the remaining possibilities. Right?
I am not sure about the terminology. I would call the described process “Solomonoff priors, plus updating”, but I don’t know the official name.
after taking account of all available evidence—is SI then well-calibrated?
I believe the answer is “yes, with enough evidence it is better calibrated then humans”.
How much would “enough evidence” be? Well, you need some to compensate for the fact that humans are already born with some physiology and instincts adapted by evolution to our laws of physics. But this is a finite amount of evidence. All the evidence that humans get, should be processed better by the hypothetical “Solomonoff prior plus updating” process. So even if the process would start from zero and get the same information as humans, at some moment it should become and remain better calibrated.
the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits [...] if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time
Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.
Is that a problem? In real life, no. We will use the system to predict future events. We will ask about a specific event E, and by definition both H1 and H2 would give the same answer. So why should we care whether the answer was derived from H1, from H2, or from a combination of both. The question will be: “Will it rain tomorrow?” and the answer will be: “No.” That’s all, from outside.
Only if you try to look inside and ask “What was your model of the world that you used for this prediction?” the machine would tell you about H1, H2, and infinitely many other hypotheses. Then, you could ask it to use Occam’s razor to only choose the simplest one and display it to you. But internally, it could keep all of them (we already suppose it has an infinite memory and infinite processing power). Note, if I understand it correctly, that it would be actually impossible for the machine to tell whether in general two hypotheses H1 and H2 are evidence-compatible.
Is there any evidence that outcomes in the universe actually occur with probablities in proportion to their information-complexity?
They don’t. To get the probabilities about something occuring in our universe, you need to get the information about our universe first. Solomonoff Induction tells you how to do that, in a random universe. After you get enough evidence to understand the universe, only then you start getting good results.
In other words, the laws of our universe don’t say “things are probable according to their information complexity”. Instead they say other things. The problem is… at the beginning, you don’t know the laws of our universe exactly. So how can you learn them?
Imagine yourself living centuries ago. If you knew Solomonoff Induction, it would give you a non-zero probability for quantum physics (and many other things, most of them wrong). A hypothetical machine with infinite power, able to do all the calculations, could in theory derive the quantum physics just by receiving the evidence you see. Isn’t that awesome?
I phrased things in terms of probabilities inside a single universe because that is the context in which I observe & make decisions and would like SI to be useful.
Me too. But we still don’t know all the laws of our universe. So in that aspect “what universe do we live in” remains a bit unknown.
However I think you could just translate what I have said back into many-worlds language and keep the question intact.
Careful. There is a difference between quantum “many worlds” which are all supposed to follow the same laws of physics, and between hypothetical universes with other laws of physics, called the Tegmark multiverse.
Again, I agree that we should only about our laws of physics, and about our branch of “many worlds”. But still we have a problem of not knowing exactly what the laws are, and which branch it is. So we need a method to work with multiple possible laws, and multiple possible branches. With enough updating on our evidence, the probabilities of the other laws and other branches will get close to zero, and the remaining ones will be the most relevant for us.
They don’t. To get the probabilities about something occuring in our universe, you need to get the information about our universe first. Solomonoff Induction tells you how to do that, in a random universe. After you get enough evidence to understand the universe, only then you start getting good results.
Yes, but we already have lots of information about our universe. So, making use of all that, if we could start using SI to, say, predict the weather, would its predictions be well-calibrated? (They should be—modern weather predictions are already well-calibrated, and SI is supposed to be better than how we do things now.) That would require that, of all predictions compatible with currently known info, ALL of them would have to occur in EXACT PROPORTION to their bit-length complexity.
of all predictions compatible with currently known info, ALL of them would have to occur in EXACT PROPORTION to their bit-length complexity
I admit I am rather confused here, but here is my best guess:
It is not true, in our specific world, that all predictions compatible with the past will occur in exact proportion to their bit-length complexity. Some of them will occur more frequently, some of them will occur less frequently. The problem is, you don’t know which ones. Because all of them are compatible with the past, so how could you tell the difference, except by a lucky guess? How could any other model tell the difference, except by a lucky guess? How could you tell which model guessed the difference correctly, except by a lucky guess? So if you want to get the best result on average, assigning the probability according to the bit-length complexity is best.
“the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits [...] if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time”
then replied
“Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.”
I am afraid I may have stated this unclearly at first. I meant, given 2 hypotheses that are both compatible with all currently-known evidence, but which predict different outcomes on a future event.
Is there any evidence that outcomes in the universe actually occur with probablities in proportion to their information-complexity?
Yes, and the first piece of evidence is rather trivial. For any given law of physics, chemistry, etc. or basically any model of anything in the universe, I can conjure up an arbitrary amount of more and more complicated hypotheses that match the current data, but all or nearly-all of which will fail utterly against new data obtained later.
For a very trivial thought experiment / example, we could have an alternate hypothesis which includes all of the current data, with only instructions to the turing machine to print this data. Then we could have another which includes all the current data twice, but tells the turing machine to only print one copy. Necessarily, both of these will fail against new data, because they will only print the old data and halt.
We could conjure any infinities of copies similar to this which also contain arbitrary amounts of gibberish right after the old data, gibberish which will be unlikely to match the new data (with probability 1/2^n where n is the length of the new data / gibberish, assuming perfect randomness).
This seems reasonable—it basically makes use of the fact that most statements are wrong, therefore adding a given statement whose truth-value is as-yet-unknown is likely to be wrong.
However, that’s vague. It supports Occam’s Razor pretty well, but does it also offer good evidence that that those likelihoods will manifest in real-world probabilities IN EXACT PROPORTION to the bit-lengths of their inputs? That is a much more precise claim! (For convenience I am ignoring the problem of multiple algorithms where hypotheses have different bit-lengths.)
It supports Occam’s Razor pretty well, but does it also offer good evidence that that those likelihoods will manifest in real-world probabilities IN EXACT PROPORTION to the bit-lengths of their inputs?
Nope, and we have no idea where we’d even start on evaluating this precisely because of the various problems relating to different languages. I think this is an active area of research.
It does seem though, by observation and inference (heh, use whatever tools you have), that more efficient languages tend to formulate shorter hypotheses that tend to hint at this.
There’s also been some demonstrations of how well SI works for learning and inferring about a completely unknown environment. I think this was what AIXI was about, though I can’t recall specifics.
Viliam_Bur makes a great run-down of what’s going on. For a more detailed introduction though, see this post explaining Solomonoff Induction, or perhaps you’d prefer to jump straight to this paragraph (Solomonoff’s Lightsaber) that contains an explanation of why shorter (simpler) hypotheses are more likely under Solomonoff Induction.
To make the bridge between that and what Viliam is saying, basically, if we consider all mathematically possible universes, then half the universes will start with a 1, and the other half will start with a 0. Then a quarter will start with 11, and another with 10, and so on. Which means that, to reuse the example in the above-linked post, 01001101 (which matches observed data perfectly so far) will appear in 1 out of 256 mathematically-possible universes, and 1000111110111111000111010010100001 (which also matches the data just as perfectly) will only appear in 1 out of 17179869184 mathematically-possible universe.
So if we expect to live in one out of all mathematically-possible universe, but we have no idea what properties it has (or if you just got warped to a different universe with different laws of physics), which of the two hypotheses do you want? The one that is true more often, in more of the possible universes, because you’re more likely to be in one of those than in one that has the longer, rarer hypothesis.
Yes, that was the post I read that generated my current line of questioning.
My reply to Viliam_Bur was phrased in terms of probabilities in a single universe, while your post here is in terms of mathematically possible universes. Let me try to rephrase my point to him in many-worlds language. This is not how I originally thought of the question, though, so I may end up a little muddled in translation.
Taking your original example, where half of the Mathematically Possible Universes start with 1, and the other half with 0. It is certainly possible to imagine a hypothetical Actual Multiverse where, nevertheless, there are 5 billion universes with 1, and only 5 universes with 0. Who knows why—maybe there is some overarching multiversal law we are unaware of, or may it’s just random. The point is that there is no a priori reason the Multiverse can’t be that way. (It may not even be possible to say that the multiverse probably isn’t that way without using Solomonoff Induction or Occam’s Razor, the very concepts under question.)
If this were the case, and I were somehow universe-hopping, I would over time come to the conclusion that SI was poorly calibrated and stop using it. This, I think, is basically the many-worlds version of my suggestion to Viliam_Bur. As I said to him, I am not arguing for or against SI, I am just asking knowledge people if there is any evidence that the probablities in this universe, or distributions across the multiverse, are actually in proportion to their information-complexities.
Yes, there’s no reason for Solomonoff to be well-calibrated in the end, but once we obtain information that most of the universes starting with 0 do not work, that is data against which most of the hypotheses starting with 0 will fail. At this point, brute solomonoff induction will be obviously inefficient, and we should begin using the heuristic of testing almost only hypotheses starting with 1.
In fact, we’re already doing this: We know for a fact that we live in the subset of universes where the acceleration between two particles is not constant and invariant of distance. So it is known that the simpler hypothesis where gravitational attraction is “0.02c/year times the total mass of the objects” is not more likely than the one where gravitational attraction also depends on distance and angular momentum and other factors, despite the former being much less complex than the latter (or so we presume).
There’s still murky depths and open questions, such as (IIRC) how to calculate how “long” (see Kolmogorov complexity) the instructions are.
Because suppose we build two universal turing machines with different sets of internal instructions.
We run Solomonoff Induction on the first machine, and it turns out that 01110101011110101010101111011 is the simplest possible program that will output “110”, and by analyzing the language and structure of the machine we learn that this corresponds to the hypothesis “2*3“, with the output being “6”. Meanwhile, on the second machine, 1111110 will also output “110”, and by analyzing it we find out that this corresponds to the hypothesis “6”, with the output being “6″.
On the first machine, to do the hypothesis “6”, we must write 101010101111110110101111111110000000111111110000110, which is much more complex than the earlier “2*3“ hypothesis, while on the second machine the “2*3” hypothesis is input as 1010111010101111, which is much longer than the “6” hypothesis.
Which hypothesis, between “2*3” and “6″, is simpler and less complex, based on what we observe from these two different machines? Which one is right? AFAIK, this is still completely unresolved.
Which hypothesis, between “2*3” and “6″, is simpler and less complex, based on what we observe from these two different machines? Which one is right? AFAIK, this is still completely unresolved.
If we’re considering hypotheses across all mathematically possible universes then why not consider hypotheses across all mathematically possible languages/machines as well?
What weight will we assing to the individual languages/machines? Their complexity… according to what? Perhaps we could make a matrix saying how complex a machine A is when simulated by a machine B, and then find the eigenvalues of the matrix?
If we’re considering hypotheses across all mathematically possible universes then why not consider hypotheses across all mathematically possible languages/machines as well?
This is also my intuition as well, though it has to be restricted to turing-complete systems I think. I was under the impression that there was already some active research in this direction, but I’ve never taken the time to look into that too deeply
Hi, my name is Jason, this is my first post. I have recently been reading about 2 subjects here, Calibration and Solomoff Induction; reading them together has given me the following question:
How well-calibrated would Solomonoff Induction be if it could actually be calculated?
That is to say, if one generated priors on a whole bunch of questions based on information complexity measured in bits—if you took all the hypotheses that were measured at 10% likely—would 10% of those actually turn out to be correct?
I don’t immediately see why Solomonoff Induction should be expected to be well-calibrated. It appears to just be a formalization of Occam’s Razor, which itself is just a rule of thumb. But if it turned out not to be well-calibrated, it would not be a very good “recipe for truth.” What am I missing?
Solomonoff Induction could be well-calibrated across mathematically possible universes. If a hypothesis has a probability 10%, you should expect it to be true in 10% of the universes.
Important thing is that Solomonoff priors are just a starting point in our reasoning. Then we update on evidence, which is at least as important as having reasonable priors. If it does not seem well calibrated, that is because you can’t get good calibration without using evidence.
Imagine that at this moment you are teleported to another universe with completely different laws of physics… do you expect any other method to work better than Solomonoff Induction? Yes, gradually you get data about the new universe and improve your model. But that’s exactly what you are supposed to do with Solomonoff priors. You wouldn’t predictable get better results by starting from different priors.
To me it seems that Occam’s Razor is a rule of thumb, and Solomonoff Induction is a mathematical background explaining why the rule of thumb works. (OR: “Choose the most simple hypothesis that fits your data.” Me: “Okay, but why?” SI: “Because it is more likely to be the correct one.”)
You can’t get a good “recipe for truth” without actually looking at the evidence. Solomonoff Induction is the best thing you can do without the evidence (or before you start taking the evidence into account).
Essentially, the Solomonoff Induction will help you avoid the following problems:
Getting inconsistent results. For example, if you instead supposed that “if I don’t have any data confirming or rejecting a hypothesis, I will always assume its prior probability is 50%”, then if I give you two new hypotheses X and Y without any data, you are supposed to think that p(X) = 0.5 and p(Y) = 0.5, but also e.g. p(X and Y) = 0.5 (because “X and Y” is also a hypothesis you don’t have any data about).
Giving so extremely low probability to a reasonable hypothesis that available evidence cannot convince you otherwise. For example if you assume that prior probability of X is zero, then with proper updating no evidence can convince you about X, because there is always an alternative explanation with a very small but non-zero evidence (e.g. the lords of Matrix are messing with your brain). Even if the value is technically non-zero, it could be very small like 1/10^999999999, so all the evidence you could get within your human life could not make you change your mind.
On the other hand, some hypotheses do deserve very low prior probability, because reasoning like “any hypothesis, however unlikely, has prior probability at least 0.01” can be exploited by a) Pascal’s mugging, b) constructing multiple mutually exclusive hypotheses which together have arbitrarily high probability (e.g. “AAA is the god of this world and I am his prophet”, “AAB is the god of this world and I am his prophet”… “ZZZ is the god of this world and I am his prophet”).
Thank you for your reply. It does clear up some of the virtues of SI, especially when used to generate priors absent any evidence. However, as I understand it, SI does take into account evidence—one removes all the possibilities incompatible with the evidence, then renormalizes the probablities of the remaining possibilities. Right?
If so, one could still ask—after taking account of all available evidence—is SI then well-calibrated? (At some point it should be well-calibrated, right? More calibrated than human beings. Otherwise, how is it useful? Or why should we use it for induction?)
Essentially the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits. Possibly over-simplifying, this suggests that if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time. Is there any evidence that this is actually true?
(I can see immediately that one would have to control for the number of possible “paths” or universe-states or however you call it that could lead to each event, in order for the outcome to be directly proportional to the information-complexity. I am ignoring this because the inability to compute this appears to be the reason SI as a whole cannot be computed.)
You suggest above that SI explains why Occam’s razor works. I could offer another possibility—that Occam’s Razor works because it is vague, but that when specified it will not turn out to match how the universe actually works very precisely. Or that Occam’s Razor is useful because it suggests that when generating a Map one should use only as much information about the Territory is as is necessary for a certain purpose, thereby allowing one to get maximum usefulness with minimum cognitive load on the user.
I am not arguing for one or the other. Instead I am just asking, here among people knowledgeable about SI—Is there any evidence that outcomes in the universe actually occur with probablities in proportion to their information-complexity? (A much more precise claim than Occam’s suggestion that in general simpler explanations are preferable.)
Maybe it will not be possible to answer my question until SI can at least be estimated, in order to actually make the comparison?
(Above you refer to “all mathematically possible universes.” I phrased things in terms of probabilities inside a single universe because that is the context in which I observe & make decisions and would like SI to be useful. However I think you could just translate what I have said back into many-worlds language and keep the question intact.)
Yes. The prediction error theorem states that as long as the true distribution is computable, the estimate will converge quickly to the true distribution.
However, almost all the work done here, comes from the conditioning. The proof uses that for any computable mu, M(x) > 2^(-K(mu)) mu(x). That is, M does not assign a “very” small probablility to any possible observation.
The exact prior you pick does not matter very much, as long as it dominates the set of all possible distributions mu in this sense. If you have some other distribution P, such that for every mu there is a C with P(x) > C mu(x), you get a similar theorem, differing by just the constant in the inequality.
So I disagree with this:
It’s ok if the prior is not very exact. As long as we don’t overlook any possibilities as a priori super-unlikely when they are not, we can use observations to pin down the exact proportions later.
I am not sure about the terminology. I would call the described process “Solomonoff priors, plus updating”, but I don’t know the official name.
I believe the answer is “yes, with enough evidence it is better calibrated then humans”.
How much would “enough evidence” be? Well, you need some to compensate for the fact that humans are already born with some physiology and instincts adapted by evolution to our laws of physics. But this is a finite amount of evidence. All the evidence that humans get, should be processed better by the hypothetical “Solomonoff prior plus updating” process. So even if the process would start from zero and get the same information as humans, at some moment it should become and remain better calibrated.
Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.
Is that a problem? In real life, no. We will use the system to predict future events. We will ask about a specific event E, and by definition both H1 and H2 would give the same answer. So why should we care whether the answer was derived from H1, from H2, or from a combination of both. The question will be: “Will it rain tomorrow?” and the answer will be: “No.” That’s all, from outside.
Only if you try to look inside and ask “What was your model of the world that you used for this prediction?” the machine would tell you about H1, H2, and infinitely many other hypotheses. Then, you could ask it to use Occam’s razor to only choose the simplest one and display it to you. But internally, it could keep all of them (we already suppose it has an infinite memory and infinite processing power). Note, if I understand it correctly, that it would be actually impossible for the machine to tell whether in general two hypotheses H1 and H2 are evidence-compatible.
They don’t. To get the probabilities about something occuring in our universe, you need to get the information about our universe first. Solomonoff Induction tells you how to do that, in a random universe. After you get enough evidence to understand the universe, only then you start getting good results.
In other words, the laws of our universe don’t say “things are probable according to their information complexity”. Instead they say other things. The problem is… at the beginning, you don’t know the laws of our universe exactly. So how can you learn them?
Imagine yourself living centuries ago. If you knew Solomonoff Induction, it would give you a non-zero probability for quantum physics (and many other things, most of them wrong). A hypothetical machine with infinite power, able to do all the calculations, could in theory derive the quantum physics just by receiving the evidence you see. Isn’t that awesome?
Me too. But we still don’t know all the laws of our universe. So in that aspect “what universe do we live in” remains a bit unknown.
Careful. There is a difference between quantum “many worlds” which are all supposed to follow the same laws of physics, and between hypothetical universes with other laws of physics, called the Tegmark multiverse.
Again, I agree that we should only about our laws of physics, and about our branch of “many worlds”. But still we have a problem of not knowing exactly what the laws are, and which branch it is. So we need a method to work with multiple possible laws, and multiple possible branches. With enough updating on our evidence, the probabilities of the other laws and other branches will get close to zero, and the remaining ones will be the most relevant for us.
Yes, but we already have lots of information about our universe. So, making use of all that, if we could start using SI to, say, predict the weather, would its predictions be well-calibrated? (They should be—modern weather predictions are already well-calibrated, and SI is supposed to be better than how we do things now.) That would require that, of all predictions compatible with currently known info, ALL of them would have to occur in EXACT PROPORTION to their bit-length complexity.
Is there any evidence that this is the case?
I admit I am rather confused here, but here is my best guess:
It is not true, in our specific world, that all predictions compatible with the past will occur in exact proportion to their bit-length complexity. Some of them will occur more frequently, some of them will occur less frequently. The problem is, you don’t know which ones. Because all of them are compatible with the past, so how could you tell the difference, except by a lucky guess? How could any other model tell the difference, except by a lucky guess? How could you tell which model guessed the difference correctly, except by a lucky guess? So if you want to get the best result on average, assigning the probability according to the bit-length complexity is best.
You quoted me
“the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits [...] if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time”
then replied
“Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.”
I am afraid I may have stated this unclearly at first. I meant, given 2 hypotheses that are both compatible with all currently-known evidence, but which predict different outcomes on a future event.
Yes, and the first piece of evidence is rather trivial. For any given law of physics, chemistry, etc. or basically any model of anything in the universe, I can conjure up an arbitrary amount of more and more complicated hypotheses that match the current data, but all or nearly-all of which will fail utterly against new data obtained later.
For a very trivial thought experiment / example, we could have an alternate hypothesis which includes all of the current data, with only instructions to the turing machine to print this data. Then we could have another which includes all the current data twice, but tells the turing machine to only print one copy. Necessarily, both of these will fail against new data, because they will only print the old data and halt.
We could conjure any infinities of copies similar to this which also contain arbitrary amounts of gibberish right after the old data, gibberish which will be unlikely to match the new data (with probability 1/2^n where n is the length of the new data / gibberish, assuming perfect randomness).
This seems reasonable—it basically makes use of the fact that most statements are wrong, therefore adding a given statement whose truth-value is as-yet-unknown is likely to be wrong.
However, that’s vague. It supports Occam’s Razor pretty well, but does it also offer good evidence that that those likelihoods will manifest in real-world probabilities IN EXACT PROPORTION to the bit-lengths of their inputs? That is a much more precise claim! (For convenience I am ignoring the problem of multiple algorithms where hypotheses have different bit-lengths.)
Nope, and we have no idea where we’d even start on evaluating this precisely because of the various problems relating to different languages. I think this is an active area of research.
It does seem though, by observation and inference (heh, use whatever tools you have), that more efficient languages tend to formulate shorter hypotheses that tend to hint at this.
There’s also been some demonstrations of how well SI works for learning and inferring about a completely unknown environment. I think this was what AIXI was about, though I can’t recall specifics.
Viliam_Bur makes a great run-down of what’s going on. For a more detailed introduction though, see this post explaining Solomonoff Induction, or perhaps you’d prefer to jump straight to this paragraph (Solomonoff’s Lightsaber) that contains an explanation of why shorter (simpler) hypotheses are more likely under Solomonoff Induction.
To make the bridge between that and what Viliam is saying, basically, if we consider all mathematically possible universes, then half the universes will start with a 1, and the other half will start with a 0. Then a quarter will start with 11, and another with 10, and so on. Which means that, to reuse the example in the above-linked post, 01001101 (which matches observed data perfectly so far) will appear in 1 out of 256 mathematically-possible universes, and 1000111110111111000111010010100001 (which also matches the data just as perfectly) will only appear in 1 out of 17179869184 mathematically-possible universe.
So if we expect to live in one out of all mathematically-possible universe, but we have no idea what properties it has (or if you just got warped to a different universe with different laws of physics), which of the two hypotheses do you want? The one that is true more often, in more of the possible universes, because you’re more likely to be in one of those than in one that has the longer, rarer hypothesis.
That’s the basic simplified logic behind it.
Yes, that was the post I read that generated my current line of questioning.
My reply to Viliam_Bur was phrased in terms of probabilities in a single universe, while your post here is in terms of mathematically possible universes. Let me try to rephrase my point to him in many-worlds language. This is not how I originally thought of the question, though, so I may end up a little muddled in translation.
Taking your original example, where half of the Mathematically Possible Universes start with 1, and the other half with 0. It is certainly possible to imagine a hypothetical Actual Multiverse where, nevertheless, there are 5 billion universes with 1, and only 5 universes with 0. Who knows why—maybe there is some overarching multiversal law we are unaware of, or may it’s just random. The point is that there is no a priori reason the Multiverse can’t be that way. (It may not even be possible to say that the multiverse probably isn’t that way without using Solomonoff Induction or Occam’s Razor, the very concepts under question.)
If this were the case, and I were somehow universe-hopping, I would over time come to the conclusion that SI was poorly calibrated and stop using it. This, I think, is basically the many-worlds version of my suggestion to Viliam_Bur. As I said to him, I am not arguing for or against SI, I am just asking knowledge people if there is any evidence that the probablities in this universe, or distributions across the multiverse, are actually in proportion to their information-complexities.
Hmm, I think I see what you mean.
Yes, there’s no reason for Solomonoff to be well-calibrated in the end, but once we obtain information that most of the universes starting with 0 do not work, that is data against which most of the hypotheses starting with 0 will fail. At this point, brute solomonoff induction will be obviously inefficient, and we should begin using the heuristic of testing almost only hypotheses starting with 1.
In fact, we’re already doing this: We know for a fact that we live in the subset of universes where the acceleration between two particles is not constant and invariant of distance. So it is known that the simpler hypothesis where gravitational attraction is “0.02c/year times the total mass of the objects” is not more likely than the one where gravitational attraction also depends on distance and angular momentum and other factors, despite the former being much less complex than the latter (or so we presume).
There’s still murky depths and open questions, such as (IIRC) how to calculate how “long” (see Kolmogorov complexity) the instructions are.
Because suppose we build two universal turing machines with different sets of internal instructions.
We run Solomonoff Induction on the first machine, and it turns out that 01110101011110101010101111011 is the simplest possible program that will output “110”, and by analyzing the language and structure of the machine we learn that this corresponds to the hypothesis “2*3“, with the output being “6”. Meanwhile, on the second machine, 1111110 will also output “110”, and by analyzing it we find out that this corresponds to the hypothesis “6”, with the output being “6″.
On the first machine, to do the hypothesis “6”, we must write 101010101111110110101111111110000000111111110000110, which is much more complex than the earlier “2*3“ hypothesis, while on the second machine the “2*3” hypothesis is input as 1010111010101111, which is much longer than the “6” hypothesis.
Which hypothesis, between “2*3” and “6″, is simpler and less complex, based on what we observe from these two different machines? Which one is right? AFAIK, this is still completely unresolved.
If we’re considering hypotheses across all mathematically possible universes then why not consider hypotheses across all mathematically possible languages/machines as well?
What weight will we assing to the individual languages/machines? Their complexity… according to what? Perhaps we could make a matrix saying how complex a machine A is when simulated by a machine B, and then find the eigenvalues of the matrix?
Must stop… before head explodes...
This is also my intuition as well, though it has to be restricted to turing-complete systems I think. I was under the impression that there was already some active research in this direction, but I’ve never taken the time to look into that too deeply
.