However, as I understand it, SI does take into account evidence—one removes all the possibilities incompatible with the evidence, then renormalizes the probablities of the remaining possibilities. Right?
I am not sure about the terminology. I would call the described process “Solomonoff priors, plus updating”, but I don’t know the official name.
after taking account of all available evidence—is SI then well-calibrated?
I believe the answer is “yes, with enough evidence it is better calibrated then humans”.
How much would “enough evidence” be? Well, you need some to compensate for the fact that humans are already born with some physiology and instincts adapted by evolution to our laws of physics. But this is a finite amount of evidence. All the evidence that humans get, should be processed better by the hypothetical “Solomonoff prior plus updating” process. So even if the process would start from zero and get the same information as humans, at some moment it should become and remain better calibrated.
the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits [...] if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time
Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.
Is that a problem? In real life, no. We will use the system to predict future events. We will ask about a specific event E, and by definition both H1 and H2 would give the same answer. So why should we care whether the answer was derived from H1, from H2, or from a combination of both. The question will be: “Will it rain tomorrow?” and the answer will be: “No.” That’s all, from outside.
Only if you try to look inside and ask “What was your model of the world that you used for this prediction?” the machine would tell you about H1, H2, and infinitely many other hypotheses. Then, you could ask it to use Occam’s razor to only choose the simplest one and display it to you. But internally, it could keep all of them (we already suppose it has an infinite memory and infinite processing power). Note, if I understand it correctly, that it would be actually impossible for the machine to tell whether in general two hypotheses H1 and H2 are evidence-compatible.
Is there any evidence that outcomes in the universe actually occur with probablities in proportion to their information-complexity?
They don’t. To get the probabilities about something occuring in our universe, you need to get the information about our universe first. Solomonoff Induction tells you how to do that, in a random universe. After you get enough evidence to understand the universe, only then you start getting good results.
In other words, the laws of our universe don’t say “things are probable according to their information complexity”. Instead they say other things. The problem is… at the beginning, you don’t know the laws of our universe exactly. So how can you learn them?
Imagine yourself living centuries ago. If you knew Solomonoff Induction, it would give you a non-zero probability for quantum physics (and many other things, most of them wrong). A hypothetical machine with infinite power, able to do all the calculations, could in theory derive the quantum physics just by receiving the evidence you see. Isn’t that awesome?
I phrased things in terms of probabilities inside a single universe because that is the context in which I observe & make decisions and would like SI to be useful.
Me too. But we still don’t know all the laws of our universe. So in that aspect “what universe do we live in” remains a bit unknown.
However I think you could just translate what I have said back into many-worlds language and keep the question intact.
Careful. There is a difference between quantum “many worlds” which are all supposed to follow the same laws of physics, and between hypothetical universes with other laws of physics, called the Tegmark multiverse.
Again, I agree that we should only about our laws of physics, and about our branch of “many worlds”. But still we have a problem of not knowing exactly what the laws are, and which branch it is. So we need a method to work with multiple possible laws, and multiple possible branches. With enough updating on our evidence, the probabilities of the other laws and other branches will get close to zero, and the remaining ones will be the most relevant for us.
They don’t. To get the probabilities about something occuring in our universe, you need to get the information about our universe first. Solomonoff Induction tells you how to do that, in a random universe. After you get enough evidence to understand the universe, only then you start getting good results.
Yes, but we already have lots of information about our universe. So, making use of all that, if we could start using SI to, say, predict the weather, would its predictions be well-calibrated? (They should be—modern weather predictions are already well-calibrated, and SI is supposed to be better than how we do things now.) That would require that, of all predictions compatible with currently known info, ALL of them would have to occur in EXACT PROPORTION to their bit-length complexity.
of all predictions compatible with currently known info, ALL of them would have to occur in EXACT PROPORTION to their bit-length complexity
I admit I am rather confused here, but here is my best guess:
It is not true, in our specific world, that all predictions compatible with the past will occur in exact proportion to their bit-length complexity. Some of them will occur more frequently, some of them will occur less frequently. The problem is, you don’t know which ones. Because all of them are compatible with the past, so how could you tell the difference, except by a lucky guess? How could any other model tell the difference, except by a lucky guess? How could you tell which model guessed the difference correctly, except by a lucky guess? So if you want to get the best result on average, assigning the probability according to the bit-length complexity is best.
“the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits [...] if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time”
then replied
“Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.”
I am afraid I may have stated this unclearly at first. I meant, given 2 hypotheses that are both compatible with all currently-known evidence, but which predict different outcomes on a future event.
I am not sure about the terminology. I would call the described process “Solomonoff priors, plus updating”, but I don’t know the official name.
I believe the answer is “yes, with enough evidence it is better calibrated then humans”.
How much would “enough evidence” be? Well, you need some to compensate for the fact that humans are already born with some physiology and instincts adapted by evolution to our laws of physics. But this is a finite amount of evidence. All the evidence that humans get, should be processed better by the hypothetical “Solomonoff prior plus updating” process. So even if the process would start from zero and get the same information as humans, at some moment it should become and remain better calibrated.
Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.
Is that a problem? In real life, no. We will use the system to predict future events. We will ask about a specific event E, and by definition both H1 and H2 would give the same answer. So why should we care whether the answer was derived from H1, from H2, or from a combination of both. The question will be: “Will it rain tomorrow?” and the answer will be: “No.” That’s all, from outside.
Only if you try to look inside and ask “What was your model of the world that you used for this prediction?” the machine would tell you about H1, H2, and infinitely many other hypotheses. Then, you could ask it to use Occam’s razor to only choose the simplest one and display it to you. But internally, it could keep all of them (we already suppose it has an infinite memory and infinite processing power). Note, if I understand it correctly, that it would be actually impossible for the machine to tell whether in general two hypotheses H1 and H2 are evidence-compatible.
They don’t. To get the probabilities about something occuring in our universe, you need to get the information about our universe first. Solomonoff Induction tells you how to do that, in a random universe. After you get enough evidence to understand the universe, only then you start getting good results.
In other words, the laws of our universe don’t say “things are probable according to their information complexity”. Instead they say other things. The problem is… at the beginning, you don’t know the laws of our universe exactly. So how can you learn them?
Imagine yourself living centuries ago. If you knew Solomonoff Induction, it would give you a non-zero probability for quantum physics (and many other things, most of them wrong). A hypothetical machine with infinite power, able to do all the calculations, could in theory derive the quantum physics just by receiving the evidence you see. Isn’t that awesome?
Me too. But we still don’t know all the laws of our universe. So in that aspect “what universe do we live in” remains a bit unknown.
Careful. There is a difference between quantum “many worlds” which are all supposed to follow the same laws of physics, and between hypothetical universes with other laws of physics, called the Tegmark multiverse.
Again, I agree that we should only about our laws of physics, and about our branch of “many worlds”. But still we have a problem of not knowing exactly what the laws are, and which branch it is. So we need a method to work with multiple possible laws, and multiple possible branches. With enough updating on our evidence, the probabilities of the other laws and other branches will get close to zero, and the remaining ones will be the most relevant for us.
Yes, but we already have lots of information about our universe. So, making use of all that, if we could start using SI to, say, predict the weather, would its predictions be well-calibrated? (They should be—modern weather predictions are already well-calibrated, and SI is supposed to be better than how we do things now.) That would require that, of all predictions compatible with currently known info, ALL of them would have to occur in EXACT PROPORTION to their bit-length complexity.
Is there any evidence that this is the case?
I admit I am rather confused here, but here is my best guess:
It is not true, in our specific world, that all predictions compatible with the past will occur in exact proportion to their bit-length complexity. Some of them will occur more frequently, some of them will occur less frequently. The problem is, you don’t know which ones. Because all of them are compatible with the past, so how could you tell the difference, except by a lucky guess? How could any other model tell the difference, except by a lucky guess? How could you tell which model guessed the difference correctly, except by a lucky guess? So if you want to get the best result on average, assigning the probability according to the bit-length complexity is best.
You quoted me
“the theory seems to predict that possible (evidence-compatible) events or states in the universe will occur in exact or fairly exact proportion to their relative complexities as measured in bits [...] if I am predicting between 2 (evidence-compatible) possibilities, and one is twice as information-complex as the other, then it should actually occur 1⁄3 of the time”
then replied
“Let’s suppose that there are two hypotheses H1 and H2, each of them predicting exactly the same events, except that H2 is one bit longer and therefore half as likely as H1. Okay, so there is no evidence to distinguish between them. Whatever happens, we either reject both hypotheses, or we keep their ratio at 1:2.”
I am afraid I may have stated this unclearly at first. I meant, given 2 hypotheses that are both compatible with all currently-known evidence, but which predict different outcomes on a future event.