I’m confused. I thought SI assigns equal probabilities to every program in the infinite space, and then simpler predictions turn out to be more likely because there are many more programs that make simple predictions than there are programs that make complicated predictions.
As I understand it, SI works even better in the finite case than in the infinite case, because you don’t have to worry about infinite sums and whatnot. (Provided the domain of the finite case was something like “All the programs of length N or less” for some finite N.)
Yep, you’re confused. You can’t possibly assign equal probability to every program in an infinite space, because then the sum of all probabilities will diverge (go to infinity), and you need it to sum up to 1. What you do is assign probabilities in a way that the infinite sum will converge to 1, for example by lining all programs up in an infinite list: M1, M2, M3… and assigning probabilities 1⁄2, 1⁄4, 1⁄8, 1⁄16, 1⁄32 and so on to infinity. That’s essentially what the universal prior does, using a length-sorted infinite list of all possible programs (I’m skipping a few small technical difficulties).
The simpler predictions are typically more likely because of the one shortest program that generates them, not the many longer programs that also do—they do help, but because of the exponential decay they usually don’t amount to much. The shortest program tends to dominate.
Except I just lied a little in what I said. SI prediction doesn’t care if your program is short or long. It cares how early it comes on the infinite list M1,M2,M3… of all programs. When you arrange this list by program length, “earliest in the list” and “shortest” are basically the same. And in the usual infinite case, in a sense you cannot NOT arrange the list by program length. You can give your favourite very long programs some of the first spots on the list, and so high probabilities, but there’s an infinity of long programs out there, and you can’t nudge the infinity to where you want to go. But in the finite thought experiment, you’re free to arrange your list any way you want (and you don’t even need to have probabilities go down exponentially either). So you can just piss on Occam’s Razor and strictly reverse it by having longer programs always be more likely than shorter ones, if you feel like it. And SI prediction will still work.
That’s what I used to think about how SI worked. Then I read this explanation which seems to make sense to me, and seems to justify my view of things.
Specifically, this explanation shows how Solomonoff Induction doesn’t need to assume a prior that some hypotheses are more likely than others; it can weight them all equally, and then prove that simpler hypotheses have more “copies” and thus simpler predictions are more likely.
In the infinite case there is still the problem of how to get an infinite number of equally-weighted hypotheses to sum to probability 1. But this is what Measure Theory does, I believe. (I’m not a mathematician, but this is what I’ve read and been told.) So it isn’t a problem, any more than math is a problem.
But if the space was finite, as I said, then Solomonoff Induction wouldn’t even have that problem!
So, I no longer think I’m confused. I think that your understanding of SI portrays it as more arbitrary than it actually is. SI isn’t just a weighting by simplicity! It is a proof that weighting by simplicity is justified, given certain premises! (namely, that the space of hypotheses is the space of computable programs in a certain language, and that they are all equally likely, modulo evidence.)
In the infinite case there is still the problem of how to get an infinite number of equally-weighted hypotheses to sum to probability 1. But this is what Measure Theory does, I believe. (I’m not a mathematician, but this is what I’ve read and been told.) So it isn’t a problem, any more than math is a problem.
I am a mathematician, and yes measure theory can do this, but it will require you to make many (in fact infinitely many) arbitrary choices along the way.
Huh. Okay, thanks for the info. This is troubling, because I have long held out hope that SI would not turn out to be arbitrary. Could you direct me to where I can learn more about this arbitrariness in measure theory?
You can say that the input tape (where the program gets loaded from) is an infinite sequence of random bits. A ‘program of length l’ is a piece of code which sets the machine up so that after l bits subsequent bits do not matter. Every short program is also a set of longer programs.
You can. But is there any reason to think that this models well the inherent complexity of programs? Do we ever execute programs by choosing randomly bits that constitute them? We have algorithms that utilize randomness, to be sure, but UTMs are not, normally, such algorithms. I appreciate that “choosing program bits randomly” is a simple, short, succinct way of getting to 2^-length(p), but I don’t see a reason to think it’s natural in any interesting sense.
Well, indeed. There’s nothing natural about that whole silly notion. Laws of physics in our universe are highly symmetric (and fundamental laws are even time-reversible, when mirrored and charge-flipped), what works for guessing the laws of physics is assumptions of symmetries, something that S.I. most dramatically does not do.
Furthermore, incorrect theories in S.I. can take very long time to rule out because it is very easy to set up—in very few bits—a long-running busy beaver that changes the output after many bits are outputted. For an S.I. driven agent, there’s various arbitrary doomsdays looming with probabilities greater than 1/1024 (extra length less than 10 bits) .
Furthermore, in total absence of external input, an ideal inductor must assume it is not within an universe where the laws of physics do not support universal computation. This rules out vast majority of the input tapes, which are given non-zero prior in S.I.
Furthermore, incorrect theories in S.I. can take very long time to rule out because it is very easy to set up—in very few bits—a long-running busy beaver that changes the output after many bits are outputted. For an S.I. driven agent, there’s various arbitrary doomsdays looming with probabilities greater than 1/1024 (extra length less than 10 bits).
Do you think a better agent should assign low probabilities to such arbitrary doomsdays in the first place? What would be a good general rule for that?
Having thought about it some more, I feel that a good induction method would start with p_doomsday(time) being a smooth function, and it would have to acquire great many bits of information-theoretic data before that function would grow very sharp and specific peaks.
Meanwhile, S.I. starts with an enormous set of weird preconceptions due to the use of some one dimensional Turing machine, and consequently produces really really bad priors. The badness of said priors is somewhat masked by very-easy-for-laymen-to-misinterpret optimality proofs.
I think that if you plot sane agent’s probability of doomsday by time, it won’t be some really weird shaped curve with incredibly sharp (Planck-time sharp) peaks at various points highly specific to the internal details of the agent. Really, it’s as if someone with a weird number-fear synasthesia looked at the list of years, since the alleged birth of Christ and picked the scariest numbers, then prophesied doomsday at those years. It is clearly completely insane.
I don’t understand the question. I thought that’s what we were talking about. Am I missing something?
To be more explicit: setting up a UTM with random bits on the input tape is a natural-seeming way of getting the probability distribution over programs (2^-length(p)) that goes into the Solomonoff prior. But as I’m trying to say in the comment you replied to, I don’t think it’s really natural at all. And of course SI doesn’t need this particular distribution in order to be effective at its job.
Yep, you’re confused. You can’t possibly assign equal probability to every program in an infinite space, because then the sum of all probabilities will diverge (go to infinity), and you need it to sum up to 1. What you do is assign probabilities in a way that the infinite sum will converge to 1, for example by lining all programs up in an infinite list: M1, M2, M3… and assigning probabilities 1⁄2, 1⁄4, 1⁄8, 1⁄16, 1⁄32 and so on to infinity. That’s essentially what the universal prior does, using a length-sorted infinite list of all possible programs (I’m skipping a few small technical difficulties).
Only if you insist on assigning non-zero probability to each individual program.
Is that responding to the “You can’t possibly assign equal probability to every program in an infinite space” part of the parent comment? If you assign equal probability to every program in an infinite space, then (assuming “probability” refers to a real number, and not some other mathematical object such as an infinitesimal) either that equal probability is non-zero, in which case the sum is infinite, or the equal probability is zero, in which case the sum is zero. Remember, if all the probabilities are equal, then either all of them are equal to zero, or none are, and it’s generally agreed that an infinite number of zeros add up to zero.
and it’s generally agreed that an infinite number of zeros add up to zero.
Actually 0·∞ is considered an indeterminate form. For example the interval from 0 to 1 has length 1 even though it’s composed of infinitely many points each of which has length 0.
See my reply to asr for links to more technical explanations.
Yep, you’re confused. You can’t possibly assign equal probability to every program in an infinite space, because then the sum of all probabilities will diverge (go to infinity), and you need it to sum up to 1.
Only if you insist on assigning non-zero probability to each individual program.
If you assign an equal probability to every program, and that probability is zero, then the sum, the total probability, is zero and. If you assign an equal nonzero probability, the probability is infinite.
If you want to have a valid probability distribution, you can’t assign equal probabilities.
Well, if you’re willing to relax the third axiom of probability to finite additivity you can. Alternatively, use an uncountable state space, e.g., the space of all ultra-filters on Turing machines. Also, while we’re on the subject, there are numerous types of quasi-probability distributions worth considering.
I’m confused. I thought SI assigns equal probabilities to every program in the infinite space, and then simpler predictions turn out to be more likely because there are many more programs that make simple predictions than there are programs that make complicated predictions.
As I understand it, SI works even better in the finite case than in the infinite case, because you don’t have to worry about infinite sums and whatnot. (Provided the domain of the finite case was something like “All the programs of length N or less” for some finite N.)
Yep, you’re confused. You can’t possibly assign equal probability to every program in an infinite space, because then the sum of all probabilities will diverge (go to infinity), and you need it to sum up to 1. What you do is assign probabilities in a way that the infinite sum will converge to 1, for example by lining all programs up in an infinite list: M1, M2, M3… and assigning probabilities 1⁄2, 1⁄4, 1⁄8, 1⁄16, 1⁄32 and so on to infinity. That’s essentially what the universal prior does, using a length-sorted infinite list of all possible programs (I’m skipping a few small technical difficulties).
The simpler predictions are typically more likely because of the one shortest program that generates them, not the many longer programs that also do—they do help, but because of the exponential decay they usually don’t amount to much. The shortest program tends to dominate.
Except I just lied a little in what I said. SI prediction doesn’t care if your program is short or long. It cares how early it comes on the infinite list M1,M2,M3… of all programs. When you arrange this list by program length, “earliest in the list” and “shortest” are basically the same. And in the usual infinite case, in a sense you cannot NOT arrange the list by program length. You can give your favourite very long programs some of the first spots on the list, and so high probabilities, but there’s an infinity of long programs out there, and you can’t nudge the infinity to where you want to go. But in the finite thought experiment, you’re free to arrange your list any way you want (and you don’t even need to have probabilities go down exponentially either). So you can just piss on Occam’s Razor and strictly reverse it by having longer programs always be more likely than shorter ones, if you feel like it. And SI prediction will still work.
That’s what I used to think about how SI worked. Then I read this explanation which seems to make sense to me, and seems to justify my view of things.
Specifically, this explanation shows how Solomonoff Induction doesn’t need to assume a prior that some hypotheses are more likely than others; it can weight them all equally, and then prove that simpler hypotheses have more “copies” and thus simpler predictions are more likely.
In the infinite case there is still the problem of how to get an infinite number of equally-weighted hypotheses to sum to probability 1. But this is what Measure Theory does, I believe. (I’m not a mathematician, but this is what I’ve read and been told.) So it isn’t a problem, any more than math is a problem.
But if the space was finite, as I said, then Solomonoff Induction wouldn’t even have that problem!
So, I no longer think I’m confused. I think that your understanding of SI portrays it as more arbitrary than it actually is. SI isn’t just a weighting by simplicity! It is a proof that weighting by simplicity is justified, given certain premises! (namely, that the space of hypotheses is the space of computable programs in a certain language, and that they are all equally likely, modulo evidence.)
I am a mathematician, and yes measure theory can do this, but it will require you to make many (in fact infinitely many) arbitrary choices along the way.
Huh. Okay, thanks for the info. This is troubling, because I have long held out hope that SI would not turn out to be arbitrary. Could you direct me to where I can learn more about this arbitrariness in measure theory?
You can say that the input tape (where the program gets loaded from) is an infinite sequence of random bits. A ‘program of length l’ is a piece of code which sets the machine up so that after l bits subsequent bits do not matter. Every short program is also a set of longer programs.
You can. But is there any reason to think that this models well the inherent complexity of programs? Do we ever execute programs by choosing randomly bits that constitute them? We have algorithms that utilize randomness, to be sure, but UTMs are not, normally, such algorithms. I appreciate that “choosing program bits randomly” is a simple, short, succinct way of getting to 2^-length(p), but I don’t see a reason to think it’s natural in any interesting sense.
Well, indeed. There’s nothing natural about that whole silly notion. Laws of physics in our universe are highly symmetric (and fundamental laws are even time-reversible, when mirrored and charge-flipped), what works for guessing the laws of physics is assumptions of symmetries, something that S.I. most dramatically does not do.
Furthermore, incorrect theories in S.I. can take very long time to rule out because it is very easy to set up—in very few bits—a long-running busy beaver that changes the output after many bits are outputted. For an S.I. driven agent, there’s various arbitrary doomsdays looming with probabilities greater than 1/1024 (extra length less than 10 bits) .
Furthermore, in total absence of external input, an ideal inductor must assume it is not within an universe where the laws of physics do not support universal computation. This rules out vast majority of the input tapes, which are given non-zero prior in S.I.
Do you think a better agent should assign low probabilities to such arbitrary doomsdays in the first place? What would be a good general rule for that?
Having thought about it some more, I feel that a good induction method would start with p_doomsday(time) being a smooth function, and it would have to acquire great many bits of information-theoretic data before that function would grow very sharp and specific peaks.
Meanwhile, S.I. starts with an enormous set of weird preconceptions due to the use of some one dimensional Turing machine, and consequently produces really really bad priors. The badness of said priors is somewhat masked by very-easy-for-laymen-to-misinterpret optimality proofs.
I think that if you plot sane agent’s probability of doomsday by time, it won’t be some really weird shaped curve with incredibly sharp (Planck-time sharp) peaks at various points highly specific to the internal details of the agent. Really, it’s as if someone with a weird number-fear synasthesia looked at the list of years, since the alleged birth of Christ and picked the scariest numbers, then prophesied doomsday at those years. It is clearly completely insane.
Just checking—you do know the formulation of SI that uses a universal prefix Turing machine provided with fair coinflips on the input tape, right?
I don’t understand the question. I thought that’s what we were talking about. Am I missing something?
To be more explicit: setting up a UTM with random bits on the input tape is a natural-seeming way of getting the probability distribution over programs (2^-length(p)) that goes into the Solomonoff prior. But as I’m trying to say in the comment you replied to, I don’t think it’s really natural at all. And of course SI doesn’t need this particular distribution in order to be effective at its job.
Yeah, sorry. It’s me who was missing something =)
Only if you insist on assigning non-zero probability to each individual program.
Is that responding to the “You can’t possibly assign equal probability to every program in an infinite space” part of the parent comment? If you assign equal probability to every program in an infinite space, then (assuming “probability” refers to a real number, and not some other mathematical object such as an infinitesimal) either that equal probability is non-zero, in which case the sum is infinite, or the equal probability is zero, in which case the sum is zero. Remember, if all the probabilities are equal, then either all of them are equal to zero, or none are, and it’s generally agreed that an infinite number of zeros add up to zero.
Actually 0·∞ is considered an indeterminate form. For example the interval from 0 to 1 has length 1 even though it’s composed of infinitely many points each of which has length 0.
See my reply to asr for links to more technical explanations.
If you assign an equal probability to every program, and that probability is zero, then the sum, the total probability, is zero and. If you assign an equal nonzero probability, the probability is infinite.
If you want to have a valid probability distribution, you can’t assign equal probabilities.
Well, if you’re willing to relax the third axiom of probability to finite additivity you can. Alternatively, use an uncountable state space, e.g., the space of all ultra-filters on Turing machines. Also, while we’re on the subject, there are numerous types of quasi-probability distributions worth considering.
Ah, fair enough. I had been assuming only real probabilities.