I believe I’ve encountered a problem with either Solomonoff induction or my understanding of Solomonoff induction. I can’t post about it in Discussion, as I have less than 20 karma, and the stupid questions thread is very full (I’m not even sure if it would belong there).
I’ve read about SI repeatedly over the last year or so, and I think I have a fairly good understanding of it. Good enough to at least follow along with informal reasoning about it, at least. Recently I was reading Rathmanner and Hutter’s paper, and Legg’s paper, due to renewed interest in AIXI as the theoretical “best intelligence,” and the Arcade Learning Environment used to test the computable Monte Carlo AIXI approximation. Then this problem came to me.
Solomonoff Induction uses the size of the description of the smallest Turing machine to output a given bitstring. I saw this as a problem. Say AIXI was reasoning about a fair coin. It would guess before each flip whether it would come up heads or tails. Because Turing machines are deterministic, AIXI cannot make hypotheses involving randomness. To model the fair coin, AIXI would come up with increasingly convoluted Turing machines, attempting to compress a bitstring that approaches Kolmogorov randomness as its length approaches infinity. Meanwhile, AIXI would be punished and rewarded randomly. This is not a satisfactory conclusion for a theoretical “best intelligence.” So is the italicized statement a valid issue? An AI that can’t delay reasoning about a problem by at least labeling it “sufficiently random, solve later” doesn’t seem like a good AI, particularly in the real world where chance plays a significant part.
Naturally, Eliezer has already thought of this, and wrote about it in Occam’s Razor:
The formalism of Solomonoff Induction measures the “complexity of a description” by the length of the shortest computer program which produces that description as an output. To talk about the “shortest computer program” that does something, you need to specify a space of computer programs, which requires a language and interpreter. Solomonoff Induction uses Turing machines, or rather, bitstrings that specify Turing machines. What if you don’t like Turing machines? Then there’s only a constant complexity penalty to design your own Universal Turing Machine that interprets whatever code you give it in whatever programming language you like. Different inductive formalisms are penalized by a worst-case constant factor relative to each other, corresponding to the size of a universal interpreter for that formalism.
In the better (IMHO) versions of Solomonoff Induction, the computer program does not produce a deterministic prediction, but assigns probabilities to strings. For example, we could write a program to explain a fair coin by writing a program that assigns equal probabilities to all 2^N strings of length N. This is Solomonoff Induction’s approach to fitting the observed data. The higher the probability a program assigns to the observed data, the better that program fits the data. And probabilities must sum to 1, so for a program to better “fit” one possibility, it must steal probability mass from some other possibility which will then “fit” much more poorly. There is no superfair coin that assigns 100% probability to heads and 100% probability to tails.
Does this warrant further discussion, if at least to validate or refute this claim? I don’t think Eliezer’s proposal for a version of SI that assigns probabilities to strings is strong enough, it doesn’t describe what form the hypotheses would take. Would hypotheses in this new description be universal nondeterministic Turing machines, with the aforementioned probability distribution summed over the nondeterministic outputs?
I think it’s going too far to call this a “problem with Solomonoff induction.” Solomonoff induction makes no claims; it’s just a tool that you can use or not. Solomonoff induction as a mathematical construct should be cleanly separated from the claim that AIXI is the “best intelligence,” which is wrong for several reasons.
Can probabilistic Turing machines be considered a generalization of deterministic Turing machines, so that DTMs can be described in terms of PTMs?
Editing in reply to your edit: I thought Solomonoff Induction was made for a purpose. Quoting from Legg’s paper:
Solomonoff’s induction method is an attempt to design a general all purpose
inductive inference system. Ideally such a system would be able to accurately
learn any meaningful hypothesis from a bare minimum of appropriately format-
ted information.
I’m just pointing out what I see as a limitation in the domain of problems classical Solomonoff Induction can successfully model.
Can probabilistic Turing machines be considered a generalization of deterministic Turing machines, so that DTMs can be described in terms of PTMs?
Yes.
I’m just pointing out what I see as a limitation in the domain of problems classical Solomonoff Induction can successfully model.
I don’t think anyone claims that this limitation doesn’t exist (and anyone who claims this is wrong). But if your concern is with actual coins in the real world, I suppose the hope is that AIXI would eventually learn enough about physics to just correctly predict the outcome of coin flips.
Might be worth having those more often too; the last one was very popular, and had lots of questions that open threads don’t typically attract.
Because Turing machines are deterministic, AIXI cannot make hypotheses involving randomness. To model the fair coin, AIXI would come up with increasingly convoluted Turing machines, attempting to compress a bitstring that approaches Kolmogorov randomness as its length approaches infinity. Meanwhile, AIXI would be punished and rewarded randomly.
Just a naïve thought, but maybe it would come up with MWI fairly quickly because of this. (I can imagine this being a beisutsukai challenge – show a student radioactive decay, and see how long it takes them to come up with MWI.) A probabilistic one is probably better for the other reasons brought up, though.
To come up with MWI, it would have to conceive of different potentialities and then a probabilistic selection. I don’t know, I’m not seeing how deterministic Turing machines could model that.
Suppose in QM you have a wavefunction which recognizably evolves into a superposition of wavefunctions. I’ll write that psi0, the initial wavefunction, becomes m.psi’ + n.psi″, where m and n are coefficients, and psi’ and psi″ are basis wavefunctions.
Something slightly analogous to the MWI interpretation of this, could be seen in a Turing machine which started with one copy of a bitstring, PSI0, and which replaced it with M copies of the bitstring PSI’ and N copies of the bitstring PSI″. That would be a deterministic computation which replaces one world, the single copy of PSI0, with many worlds, the multiple copies of PSI’ and PSI″.
So it’s straightforward enough for a deterministic state machine to invent rules corresponding to a proliferation of worlds. In fact, in the abstract theory of computation, this is one of the standard ways to model nondeterministic computation—have a deterministic computation which deterministically produces all the possible paths that could be produced by the nondeterministic process.
However, the way that QM works, and thus the way that a MWI theory would have to work, is rather more complicated, because the coefficients are complex numbers, the probabilities (which one might suppose correspond to the number of copies of each world) are squares of the absolute values of those complex numbers, and probability waves can recombine and destructively interfere, so you would need worlds / bitstrings to be destroyed as well as created.
In particular, it seems that you couldn’t reproduce QM with a setup in which the only fact about each world / bitstring was the number of current copies—you need the “phase information” (angle in the complex plane) of the complex numbers, in order to know what the interference effects are. So your Turing machine’s representation of the state of the multiverse would be something like:
(complex coefficient associated with the PSI’ worlds) (list of M copies of the PSI’ bitstring); (complex coefficient associated with the PSI″ worlds) (list of N copies of the PSI″ bitstring) ; …
and the “lists of copies of worlds” would all be dynamically irrelevant, since the dynamics comes solely from recombining the complex numbers at the head of each list of copies. At each timestep, the complex numbers would be recomputed, and then the appropriate number of world-copies would be entered into each list.
But although it’s dynamically irrelevant, that list of copies of identical worlds is still performing a function, namely, it’s there to ensure that there actually are M out of every (M+N) observers experiencing worlds of type PSI’, and N out of every (M+N) observers experiencing worlds of type PSI″. If the multiverse representation was just
(complex coefficient of PSI’ world) (one copy of PSI’ world) ; (complex coefficient of PSI″ world) (one copy of PSI″ world) ; …
then all those complex numbers could still evolve according to the Schrodinger equation, but you would only have one observer seeing a PSI’ world, and one observer seeing a PSI″ world, and this is inconsistent with observation, where we see that some quantum events are more probable than others.
This is the well-known problem of recovering the Born probabilities, or justifying the Born probability rule—mentioned in several places in the QM Sequence—but expressed in the unusual context of bit-strings on a Turing tape.
(Incidentally, I have skipped over the further problem that QM uses continuous rather than discrete quantities, because that’s not a problem of principle—you can just represent the complex numbers the way we do on real computers, to some finite degree of binary precision.)
… keep in mind that deterministic Turing machines can trivially simulate nondeterministic Turing machines.
The problem here seems to be one of notation. You are using nondeterministic Turing machine in the formal sense of the term, where Mitchell seems to be using nondeterministic closer to “has a source of random bits.”
Trivially? I was under the impression that it involved up to a polynomial slowdown, while probabilistic Turing machines can simulate deterministic Turing machines by merely having only a single probability of 1 for each component of its transition function.
Well, wouldn’t that be because it’s all theorizing about computational complexity?
I see the point. Pseudorandom number generators would be what you mean by simulation of nondeterminism in a DTM? Would a deterministic UTM with an RNG be sufficient for AIXI to hypothesize randomness? I still don’t see how SI would be able to hypothesize Turing machines that produce bitstrings that are probabilistically similar to the bitstring it is “supposed” to replicate.
Eliezer’s proposal was a different notation, not an actual change in the strength of Solomonoff Induction. The usual form of SI with deterministic hypotheses is already equivalent to one with probabilistic hypotheses. Because a single hypothesis with prior probability P that assigns uniform probability to each of 2^N different bitstrings, makes the same predictions as an ensemble of 2^N deterministic hypotheses each of which has prior probability P*2^-N and predicts one of the bitstrings with certainty; and a Bayesian update in the former case is equivalent to just discarding falsified hypotheses in the latter. Given any computable probability distribution, you can with O(1) bits of overhead convert it into a program that samples from that distribution when given a uniform random string as input, and then convert that into an ensemble of deterministic programs with different hardcoded values of the random string. (The other direction of the equivalence is obvious: a computable deterministic hypothesis is just a special case of a computable probability distribution.)
Yes, if you put a Solomonoff Inductor in an environment that contains a fair coin, it would come up with increasingly convoluted Turing machines. This is a problem only if you care about the value of an intermediate variable (posterior probability assigned to individual programs), rather than the variable that SI was actually designed to optimize, namely accurate predictions of sensory inputs. This manifests in AIXI’s limitation to using a sense-determined utility function. (Granted, a sense-determined utility function really isn’t a good formalization of my preferences, so you couldn’t build an FAI that way.)
I don’t think anyone has pointed you at quite the right direction for getting a fully satisfactory answer to your question. I think what you’re looking for is the concept of Continuous Universal A Priori Probability:
The universal distribution m is defined in a discrete domain, its arguments are finite binary strings. For applications such as the prediction of growing sequences it is necessary to define a similar distribution on infinite binary sequences. This leads to the universal semi-measure M defined as the probability that the output of a monotone universal Turing machine U starts with x when provided with fair coin flips on the input tape.
For more details see the linked article, or if you’re really interested in this field, get the referenced textbook by Li and Vitanyi.
EDIT: On second thought I’ll spell out what I think is the answer, instead of just giving you this hint. This form of Solomonoff Induction, when faced with a growing sequence of fair coin flips, will quickly assign high probability to input tapes that start with the equivalent of “copy the rest of the input tape to the output tape as is”, and therefore can be interpreted as assigning high probability to the hypothesis that it is facing a growing sequence of fair coin flips.
Qiaochu has already answered your question about SI, but to also attack your question about AIXI:
Careful about what you’re assuming. You’re implicitly assuming that the AI doesn’t know that what is being flipped is a random coin. If the AI had that knowledge, it could just replace all those convoluted descriptions with just a simple one: “Generate a pseudorandom number”. This would be just as effective as any other predictor, and indeed it would be very short and easy to run.
Now, what if the AI doesn’t know this? Then you are feeding it random numbers and expecting it to find order in them. In other words, you’re asking the hardest problem of all. It makes sense that it would expend a huge amount of computational power trying to find some order in random numbers. Put yourself in the computer’s place. How on Earth would you ever be able to know if the string of 0′s and 1′s you are being presented with is really just random or the result of some incredibly complicated computer program? No one’s telling you!
Finally, if the coin is actually a real physical coin, the computer will keep trying more and more complicated hypotheses until it has modelled your fingers, the fluid dynamics of the air, and the structure of the ground. Once it has done so, it will indeed be able to predict the outcome of the coin flip with accuracy.
Note that the optimality of AIXI is subject to several important gotchas. It is a general problem solver, and can do better than any other general problem solver, but there’s no guarantee that it will do better than specific problem solvers on certain problems. This is because a specifically-designed problem solver carries problem-specific information with it—information that AIXI may not have access to.
Even a very small amount of information (say, a few tens of bits) about a problem can greatly reduce the search space. Just 14 bits of information (two ASCII characters) can reduce the search space by a factor of 2^14 = 16384.
Naturally, Eliezer has already thought of this, and wrote about it in Occam’s Razor:
It seems to completely answer your question. That is, one can think about probabilities and formulate and test probabilistic hypotheses, without needing to generate any random numbers.
I believe I’ve encountered a problem with either Solomonoff induction or my understanding of Solomonoff induction. I can’t post about it in Discussion, as I have less than 20 karma, and the stupid questions thread is very full (I’m not even sure if it would belong there).
I’ve read about SI repeatedly over the last year or so, and I think I have a fairly good understanding of it. Good enough to at least follow along with informal reasoning about it, at least. Recently I was reading Rathmanner and Hutter’s paper, and Legg’s paper, due to renewed interest in AIXI as the theoretical “best intelligence,” and the Arcade Learning Environment used to test the computable Monte Carlo AIXI approximation. Then this problem came to me.
Solomonoff Induction uses the size of the description of the smallest Turing machine to output a given bitstring. I saw this as a problem. Say AIXI was reasoning about a fair coin. It would guess before each flip whether it would come up heads or tails. Because Turing machines are deterministic, AIXI cannot make hypotheses involving randomness. To model the fair coin, AIXI would come up with increasingly convoluted Turing machines, attempting to compress a bitstring that approaches Kolmogorov randomness as its length approaches infinity. Meanwhile, AIXI would be punished and rewarded randomly. This is not a satisfactory conclusion for a theoretical “best intelligence.” So is the italicized statement a valid issue? An AI that can’t delay reasoning about a problem by at least labeling it “sufficiently random, solve later” doesn’t seem like a good AI, particularly in the real world where chance plays a significant part.
Naturally, Eliezer has already thought of this, and wrote about it in Occam’s Razor:
Does this warrant further discussion, if at least to validate or refute this claim? I don’t think Eliezer’s proposal for a version of SI that assigns probabilities to strings is strong enough, it doesn’t describe what form the hypotheses would take. Would hypotheses in this new description be universal nondeterministic Turing machines, with the aforementioned probability distribution summed over the nondeterministic outputs?
Hypotheses in this description are probabilistic Turing machines. These can be cashed out to programs in a probabilistic programming language.
I think it’s going too far to call this a “problem with Solomonoff induction.” Solomonoff induction makes no claims; it’s just a tool that you can use or not. Solomonoff induction as a mathematical construct should be cleanly separated from the claim that AIXI is the “best intelligence,” which is wrong for several reasons.
Can probabilistic Turing machines be considered a generalization of deterministic Turing machines, so that DTMs can be described in terms of PTMs?
Editing in reply to your edit: I thought Solomonoff Induction was made for a purpose. Quoting from Legg’s paper:
I’m just pointing out what I see as a limitation in the domain of problems classical Solomonoff Induction can successfully model.
Yes.
I don’t think anyone claims that this limitation doesn’t exist (and anyone who claims this is wrong). But if your concern is with actual coins in the real world, I suppose the hope is that AIXI would eventually learn enough about physics to just correctly predict the outcome of coin flips.
The steelman is to replaces coin flips with radioactive decay and then go through with the argument.
Yes.
Might be worth having those more often too; the last one was very popular, and had lots of questions that open threads don’t typically attract.
Just a naïve thought, but maybe it would come up with MWI fairly quickly because of this. (I can imagine this being a beisutsukai challenge – show a student radioactive decay, and see how long it takes them to come up with MWI.) A probabilistic one is probably better for the other reasons brought up, though.
Someone want to start one day after tomorrow? Run monthly or something? Let’s see what happens.
To come up with MWI, it would have to conceive of different potentialities and then a probabilistic selection. I don’t know, I’m not seeing how deterministic Turing machines could model that.
Do you see how a nondeterministic Turing machine could model that?
If so … … …
Suppose in QM you have a wavefunction which recognizably evolves into a superposition of wavefunctions. I’ll write that psi0, the initial wavefunction, becomes m.psi’ + n.psi″, where m and n are coefficients, and psi’ and psi″ are basis wavefunctions.
Something slightly analogous to the MWI interpretation of this, could be seen in a Turing machine which started with one copy of a bitstring, PSI0, and which replaced it with M copies of the bitstring PSI’ and N copies of the bitstring PSI″. That would be a deterministic computation which replaces one world, the single copy of PSI0, with many worlds, the multiple copies of PSI’ and PSI″.
So it’s straightforward enough for a deterministic state machine to invent rules corresponding to a proliferation of worlds. In fact, in the abstract theory of computation, this is one of the standard ways to model nondeterministic computation—have a deterministic computation which deterministically produces all the possible paths that could be produced by the nondeterministic process.
However, the way that QM works, and thus the way that a MWI theory would have to work, is rather more complicated, because the coefficients are complex numbers, the probabilities (which one might suppose correspond to the number of copies of each world) are squares of the absolute values of those complex numbers, and probability waves can recombine and destructively interfere, so you would need worlds / bitstrings to be destroyed as well as created.
In particular, it seems that you couldn’t reproduce QM with a setup in which the only fact about each world / bitstring was the number of current copies—you need the “phase information” (angle in the complex plane) of the complex numbers, in order to know what the interference effects are. So your Turing machine’s representation of the state of the multiverse would be something like:
(complex coefficient associated with the PSI’ worlds) (list of M copies of the PSI’ bitstring); (complex coefficient associated with the PSI″ worlds) (list of N copies of the PSI″ bitstring) ; …
and the “lists of copies of worlds” would all be dynamically irrelevant, since the dynamics comes solely from recombining the complex numbers at the head of each list of copies. At each timestep, the complex numbers would be recomputed, and then the appropriate number of world-copies would be entered into each list.
But although it’s dynamically irrelevant, that list of copies of identical worlds is still performing a function, namely, it’s there to ensure that there actually are M out of every (M+N) observers experiencing worlds of type PSI’, and N out of every (M+N) observers experiencing worlds of type PSI″. If the multiverse representation was just
(complex coefficient of PSI’ world) (one copy of PSI’ world) ; (complex coefficient of PSI″ world) (one copy of PSI″ world) ; …
then all those complex numbers could still evolve according to the Schrodinger equation, but you would only have one observer seeing a PSI’ world, and one observer seeing a PSI″ world, and this is inconsistent with observation, where we see that some quantum events are more probable than others.
This is the well-known problem of recovering the Born probabilities, or justifying the Born probability rule—mentioned in several places in the QM Sequence—but expressed in the unusual context of bit-strings on a Turing tape.
(Incidentally, I have skipped over the further problem that QM uses continuous rather than discrete quantities, because that’s not a problem of principle—you can just represent the complex numbers the way we do on real computers, to some finite degree of binary precision.)
… keep in mind that deterministic Turing machines can trivially simulate nondeterministic Turing machines.
The problem here seems to be one of notation. You are using nondeterministic Turing machine in the formal sense of the term, where Mitchell seems to be using nondeterministic closer to “has a source of random bits.”
Trivially? I was under the impression that it involved up to a polynomial slowdown, while probabilistic Turing machines can simulate deterministic Turing machines by merely having only a single probability of 1 for each component of its transition function.
Algorithmically trivially, I didn’t see anyone concerned about running times.
Well, wouldn’t that be because it’s all theorizing about computational complexity?
I see the point. Pseudorandom number generators would be what you mean by simulation of nondeterminism in a DTM? Would a deterministic UTM with an RNG be sufficient for AIXI to hypothesize randomness? I still don’t see how SI would be able to hypothesize Turing machines that produce bitstrings that are probabilistically similar to the bitstring it is “supposed” to replicate.
Eliezer’s proposal was a different notation, not an actual change in the strength of Solomonoff Induction. The usual form of SI with deterministic hypotheses is already equivalent to one with probabilistic hypotheses. Because a single hypothesis with prior probability P that assigns uniform probability to each of 2^N different bitstrings, makes the same predictions as an ensemble of 2^N deterministic hypotheses each of which has prior probability P*2^-N and predicts one of the bitstrings with certainty; and a Bayesian update in the former case is equivalent to just discarding falsified hypotheses in the latter. Given any computable probability distribution, you can with O(1) bits of overhead convert it into a program that samples from that distribution when given a uniform random string as input, and then convert that into an ensemble of deterministic programs with different hardcoded values of the random string. (The other direction of the equivalence is obvious: a computable deterministic hypothesis is just a special case of a computable probability distribution.)
Yes, if you put a Solomonoff Inductor in an environment that contains a fair coin, it would come up with increasingly convoluted Turing machines. This is a problem only if you care about the value of an intermediate variable (posterior probability assigned to individual programs), rather than the variable that SI was actually designed to optimize, namely accurate predictions of sensory inputs. This manifests in AIXI’s limitation to using a sense-determined utility function. (Granted, a sense-determined utility function really isn’t a good formalization of my preferences, so you couldn’t build an FAI that way.)
I don’t think anyone has pointed you at quite the right direction for getting a fully satisfactory answer to your question. I think what you’re looking for is the concept of Continuous Universal A Priori Probability:
For more details see the linked article, or if you’re really interested in this field, get the referenced textbook by Li and Vitanyi.
EDIT: On second thought I’ll spell out what I think is the answer, instead of just giving you this hint. This form of Solomonoff Induction, when faced with a growing sequence of fair coin flips, will quickly assign high probability to input tapes that start with the equivalent of “copy the rest of the input tape to the output tape as is”, and therefore can be interpreted as assigning high probability to the hypothesis that it is facing a growing sequence of fair coin flips.
Qiaochu has already answered your question about SI, but to also attack your question about AIXI:
Careful about what you’re assuming. You’re implicitly assuming that the AI doesn’t know that what is being flipped is a random coin. If the AI had that knowledge, it could just replace all those convoluted descriptions with just a simple one: “Generate a pseudorandom number”. This would be just as effective as any other predictor, and indeed it would be very short and easy to run.
Now, what if the AI doesn’t know this? Then you are feeding it random numbers and expecting it to find order in them. In other words, you’re asking the hardest problem of all. It makes sense that it would expend a huge amount of computational power trying to find some order in random numbers. Put yourself in the computer’s place. How on Earth would you ever be able to know if the string of 0′s and 1′s you are being presented with is really just random or the result of some incredibly complicated computer program? No one’s telling you!
Finally, if the coin is actually a real physical coin, the computer will keep trying more and more complicated hypotheses until it has modelled your fingers, the fluid dynamics of the air, and the structure of the ground. Once it has done so, it will indeed be able to predict the outcome of the coin flip with accuracy.
Note that the optimality of AIXI is subject to several important gotchas. It is a general problem solver, and can do better than any other general problem solver, but there’s no guarantee that it will do better than specific problem solvers on certain problems. This is because a specifically-designed problem solver carries problem-specific information with it—information that AIXI may not have access to.
Even a very small amount of information (say, a few tens of bits) about a problem can greatly reduce the search space. Just 14 bits of information (two ASCII characters) can reduce the search space by a factor of 2^14 = 16384.
It seems to completely answer your question. That is, one can think about probabilities and formulate and test probabilistic hypotheses, without needing to generate any random numbers.