Solomonoff induction still works if the universe is uncomputable, and its usefulness doesn’t require knowing Occam’s razor
Note: I don’t think this idea is original, but I couldn’t find a good post going over the implications.
I used to think that Solomonoff induction was a bit arbitrary for the following reason: it assigned a 100% probability to the universe being computable. I’m pretty sure the universe is computable (ignoring randomness), but nowhere near 100% sure. Who’s to say we won’t find a halting oracle floating in space for no reason? That seems like a pretty simple hypothesis. Why the focus on recursive languages? You have to make some choice of how descriptions work (you can’t assign positive probability to every infinite bit string), but that didn’t change the feelings of arbitrariness.
But then I realized this understanding of why to use Solomonoff induction is incorrect. We do not use it because of the physical church-turing thesis, we use it because of the original church-turing thesis:
L.C.M.s [logical computing machines: Turing’s expression for Turing machines] can do anything that could be described as ‘rule of thumb’ or ‘purely mechanical’. - Alan Turing
Because what matters is not whether the universe is computable, but whether our methods of reasoning are computable. Or in other words, whether the map is computable. Solomonoff’s induction is at least as “good” as any computable inference method (up to a constant), regardless of the complexity of the universe. So if you, as a human, are trying to come up with a systematic way to predict things (even uncomputable things), Solomonoff’s induction is better. Here is the precise statement:
Theorem: Let be some probability distributions on infinite sequences of bits such that inferring the next bit from a prefix is computable. The likelihood ratio from to Solomonoff induction’s prior is bounded above by some finite constant (despite the sequence containing infinitely many bits), and this constant is independent of the sequence of bits.
Proof sketch: (Note: this is already a well-known result.) There is a program that is a decoder for an entropy code over finite bit strings based on . Given any finite bit string , we can find a such that . The length of is approximately (where is the probability of a bit string starting with according to ). The probability of a string starting with under Solomonoff induction’s prior is greater than . So regardless of the content or length of |x|, the ratio is bounded by (which only depends on ).
A nice thing about this result is that it makes no assumptions about the sequence of bits. The theorem is not just saying Solomonoff induction is good v.s. D on most sequences, or even almost all of them. It is good for every particular sequence of bits, because there is an upper bound on the likelihood ratio that is independent of the sequence. There is no adversarial example.
(Notice how we never invoked Occam’s razor to argue that Solomonoff induction is superior. We can instead go the other way; Occam’s razor is good because it’s an informal version of an ideal inference procedure.)
Solomonoff induction v.s. a human in an uncomputable world
How does this shake out in an uncomputable universe? What if you’re convinced that there is some orb in the universe emitting the digits of Chaitin’s constant or something? We’ll let Alice be a reasoner using Solomonoff induction, and Bob be a human.
Bob: Ah yes, I have found it! An orb emitting the digits of Chaitin’s constant is floating around in our solar system!
Alice: How do you figure?
Bob: I calculated the first two digits, and they matched!
Alice: Surprising! But not that surprising (about six and half bits).
Bob: I was surprised too, but now I can do something better than you. I can predict the digits that orb will emit.
Alice: How do you plan to predict an uncomputable sequence, given that you’re a human?
Bob: Oh yeah…
Alice: In fact, if you’re correct it will eventually look like a uniformly random sequence to you since it is algorithmically random. So I’ll be able to keep up with you just fine.
Harsh! But maybe Bob’s hypothesis will let him use the orb for something?
Bob: A ha! My hypothesis has allowed me to predict theorems proven by mathematicians before they actually proved them.
Alice: How so?
Bob: I can use the digits emitted by the orb to do computations that predict theorems. And my observations are verified when mathematicians prove it the normal way.
Alice: Yeah, I noticed the orb can do that too.
Bob: Doesn’t that support the idea that the orb is emitting the digits of Chaitin’s constant?
Alice: You could understand some of my current hypotheses as making an assumption that the orb does this, and exploiting it to learn things about other parts of the environment. However, the code is not commented, so this is just a matter of interpretation. I don’t have a hypothesis that predicts in advance that the orb emits the digits of Chaitin’s constant. But technically you don’t either, since you yourself don’t know the digits.
If the environment has uncomputable features, Alice’s hypotheses act like oracle machines. She can’t predict those features, but she can use them to predict other features of the environment. Going back to our theorem, if Bob can use the oracle, so can Alice because she has Bob as one of her hypotheses.
Hypotheses as methods of reasoning instead of as possible worlds
We often think of Solomonoff induction’s hypotheses as being possible worlds. I will argue for a different intuition; the hypotheses are forms of reasoning.
Any systematic method to make predictions used by any human, in the past, present, or future, is one of the hypotheses. If the human somehow has uncomputable intuitions (unlikely, but possible), we can just treat that as part of the environment and Solomonoff induction will figure out how to exploit it.
The hypotheses are possible maps instead of possible territories!
Conclusion
Solomonoff induction is great because it’s hypotheses include all forms of systematic reasoning (by the church-turing thesis). Even if we believe the universe is uncomputable and we are right, Solomonoff induction will make predictions just as well as us. In particular, it will act like an oracle machine with respect to the uncomputable parts of the environment, but only after seeing them. But since they are uncomputable, we can’t systematically predict them in advance either. If the universe is uncomputable, the litany of Tarski becomes a physically impossible desire!
I do have one unanswered question though: so anyways, how likely is it that the universe has a halting oracle? 🤔
- Anthropically Blind: the anthropic shadow is reflectively inconsistent by 29 Jun 2023 2:36 UTC; 43 points) (
- 29 Jun 2023 13:01 UTC; 2 points) 's comment on Anthropically Blind: the anthropic shadow is reflectively inconsistent by (
Note that Solomonoff induction is not itself computable.
I don’t think this is straightforwardly true, because Solomonoff induction does not seem to be capable of learning the statistical dependency, i.e. while Solomonoff induction agrees that the digits known so far have been spookily predictive, it thinks this is just a coincidence and that further digits have no reason to be predictive.
Ah, but it must! In the halting oracle case, it is easy to design a Turing machine that (1) predicts the phenomenon will be random but (2) when predicting other things, it treats the phenomenon as if it were a halting oracle.
More generally, it’s easy to imagine a human that believes some signal from space is a halting oracle (or can work under that assumption). And by our theorem Solomonoff induction only needs to a finite amount of evidence to converge to their strategy.
It is impossible for Solomonoff induction to treat something like a coincidence when there is a computable strategy not to.
Maybe. I have trouble seeing this work out in practice, because you never run a Turing Machine forever in the physical world, so halting oracles rarely make explicit physical prediction but are instead more of a logical tool. But Solomonoff induction essentially assumes unbounded compute (and in some but not all ways, logical omniscience), so it is hard for it to make use of it. The easiest examples I can construct are cases where you’ve got multiple uncomputable things and Solomonoff induction then uses one uncomputable thing to predict another; not cases where you’ve got a single uncomputable thing that it uses to make predictions of other things.
Yeah that’s probably better. The simplest example is that if you have two orbs outputting the digits of Chaitin’s constant, Solomonoff induction learns that they are outputting the same thing.
The main point is that if there is no way a human does better in an uncomputable universe.
Yeah that problem still remains. Solomonoff induction is still only relevant to law-thinking.
I think the even worse problem is that reasonable approximates to Solomonoff induction are still infeasible because they suffer from exponential slow downs. (Related: When does rationality-as-search have nontrivial implications?)
I’ve always thought of it like, it doesn’t rely on the universe being computable, just on the universe having a computable approximation. So if the universe is computable, SI does perfectly, if it’s not, SI does as well as any algorithm could hope to.
Yeah, I think that’s also a correct way of looking at it. However, I also think “hypotheses as reasoning methods” is a bit more intuitive.
When trying to predict what someone will say, it is hard to think “okay, what are the simplest models of the entire universe that have had decent predictive performance so far, and what do they predict now?”. Easier is “okay, what are the simplest ways to make predictions that have had decent predictive performance so far, and what do they predict now?”. (One such way to reason is with a model of the entire universe, so we don’t lose any generality this way.)
For example, if someone else is predicting things better than me, I should try to understand why. And you can vaguely understand this process in terms of Solomonoff induction. For example, it gives you a precise way to reason about whether you should copy the reasoning of people who win the lottery.
Paul Christiano speculated that the universal prior is in fact mostly just intelligences doing reasoning. Making an intelligence is simple after all: set up a simple cellular automata that tends to develop lifeforms, wait 3^^^^3 years, and then look around. (See What does the universal prior actually look like? or the exposition at The Solomonoff Prior is Malign.)
SIs don’t engage in a wide variety of types of reasoning, it’s all variations on the same theme.
SI is limited compared to humans. It can’t include itself in a model, it can’t contemplate a non-turing-computable world …. and in many ways it’s limited to instrumentalism, to predicting the next observation. A human can state “suppose the world is non computable”—how can that be expressed as a programme? Humans, despite being finite, can do all those things. An SI can test an infinite number of (instrumental) hypotheses, but they are all of the same type: It’s important not to confuse “infinite” and “every”: the set of multiples of 23 is infinite, but does not contain every number
SI isn’t theoretically useful as a way of understanding human thought. Humans can’t brute-force search every possible hypothesis , and must be doing something more sophisticated instead to come up with good hypotheses.
The same way a human can? GPT-4 can state “suppose the world is non computable” for example.
But we are talking about SI. An SI isn’t making English statements. What is true of a GPT is not necessarily true of an SI.
The instructions in a programme executed by an SI have semantics related to programme operations, but not to the outside world, because all machine code does. Machine code instructions do things like “Add 1 to register A”. You would have to look at thousands or millions of such low level instructions to infer what kind of kind high level maths—vector spaces , or non Euclidean geometry—the programme is executing.
And it’s hard to see how you know with certainty that SI is describing an uncomputable or random universe. If it is using limited precision floating point calculations, is that an approximate representation of unlimited precision real number calculations taking place in the territory? Or should it be taken literally? if it uses pseudo-random number generation, does it believe that there is real indeterminism in the territory? Human scientists are also limited in the kind of maths they can use, but again, can communicate verbally what it is supposed to mean, how exact it is, and so on.
Not really. It’s superior to all algorithms running on a Turing machine, and actually is superior to an algorithm running on a accelerating turing machine, because it actually has the complement of the recursively enumerable sets, since it’s a 1st level halting oracle, which is very nice, but compared to the most powerful computers/reasoning engines known to mathematics, it’s way less powerful. It’s optimal in the domain of computable universes, and with the resources available to a Solomonoff Inductor, it can create a halting oracle, which lets it predict the first level uncomputable sequences like Chaitin’s constant, but not anything more, which is a major limitation compared to what the champions/currently optimal machines are for reasoning.
So depending on how much compute we give the human in the form of intuition, it could very easily beat Solonomoff Induction.
Do you have a proof/source for this? I haven’t heard it before.
I know in particular that is assigns a probability of 0 to Chaitin’s constant (because all the hypotheses are computable). Are you saying it can predict the prefixes of Chaitin’s constant better than random? I haven’t heard this claim either.
Hm, I might have confused what you are allowed to do if you had enough compute to run Solonomoff Induction with Solomonoff induction itself, so that’s maybe the issue I had here.
If I wanted to create my own argument for why Solomonoff induction could do it, it’s because that it’s essentially a halting oracle, which allows it to compute/create all the digits of Chaitin’s constants, given that it can compute in general the recursively enumerable sets, and since it can compute all the digits of Chaitin’s constants, it can basically read them like a book, and thus it has predicted the sequence of digits.
Solomonoff induction is a specific probability distribution. It isn’t making “decisions” per se. It can’t notice that it’s existence implies that there is a halting oracle, and that it therefore can predict one. This is because, in general, Solomonoff induction is not embedded.
If there was a physical process for a halting oracle, that would be pretty sick because then we could just run Solomonoff induction. As shown in my post, we don’t need to worry that there might be an even better strategy in such a universe; the hypotheses of Solomonoff induction can take advantage of the halting oracle just as well as we can!
You do mention that the methods of reasoning have to be computable for this to work, and there I’m quite a bit more skeptical of that condition holding.
This post seems to be about justifying why Solomonoff induction (I presume) is a good way to make sense of the world. However, unless you use it as another name for “Bayesian reasoning”, it clearly isn’t. Rather, it’s a far-removed idealization. Nobody “uses” Solomonoff induction, except as an idealized model of how very idealized reasoning works. It has purely academic interest. The post sounds (maybe I’m misunderstanding this completely) like this weren’t the case, e.g., discussing what to do about a hypothetical oracle we might find.
In practice, humanity will only ever care about a finite number of probability distributions, each of which has finite support and will be updated a finite number of times using finite memory and finite processing power. (I guess one might debate this, but I personally usually can’t think of anything interesting to say in that debate) As such, for all practical purposes, the solution to any question that has any practical relevance is computable. You could also put an arbitrary fixed but very large bound on complexity of hypotheses, keeping all hypotheses we might ever discuss in the race, and thus make everything computable. This would change the model but make no difference at all in practice, since the size of hypotheses we may ever actually assess is miniscule. The reason why we discus the infinite limit is because it’s easier to grasp (ironically) and prove things about.
Starting from this premise, what is this post telling me? It’s saying something about a certain idealized conception of reasoning. I can see how to transfer certain aspects of Solomonoff induction to the real world: you have Occam’s razor, you have Bayesian reasoning. Is there something to transfer here? Or did I misunderstand it completely, leading me to expect there to be something?
Of course, I’m not sure if this “focus on finite things and practicality” is the most useful way to think about it, and I’ve seen people argue otherwise elsewhere, but always very unconvincingly from my perspective. Perhaps someone here will convince me that computability should matter in practice, for some reasonable concept of practice?
An uncomputable universe doesnt have to be a computable universe with an oracle bolted on. For instance, a universe containing an SI has to be uncomputable.
An uncomputable universe doesnt have to be a spooky. Classical physics isn’t exactly computable, because some systems are critically dependent on their initial conditions, and an exact specification of the initial conditions requires an infinite amount of information (unlimited precision real number). The supposedly weird theiry of quantum mechanics is more computable than the supposedly intuitive theory of classical physics.
Sure, that’s just an example. But SI can be computed by an oracle machine, so it’s a sufficiently general example.
The claim that SI is at least as good as human reasoning is a relative claim. If human reasoning is limited or useless in absolute terms, so is SI.
Also , there is no proof that human reasoning would still be computable in a hypercomputational universe. We don’t have separate evidence about the world and the mind. That’s the computational circle.
We don’t use it, on the whole.
Rationalists like Yudkowsky make more extensive claims than the mainstream about SI, specifically that it is a formalised epistemology … in fact a complete formalised ontology, that tells you what reality is, capable of solving long standing philosophical problems, and not restricted to making predictions.
So they think SI actually is revealing the territory. In saying that it is only concerned with the map, you are going back to the relatively modest, mainstream view of SI.
“Solomonoff’s theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset”
The point of my post is to claim that this view is wrong. The hypotheses in Solomonoff Induction are best thought of as maps, which is a framing that usually isn’t considered (was I the first? 🤔).
If you know of arguments about why considering them to be territories is better, feel free to share them (or links)! (I need a more precise citation than “rationalists” if I’m going to look it up, lol.)
I don’t take the view that SI-is-about-the-territory is better … it’s more that what you are saying is locally novel but not globally novel.
There is another problem with Solomonoff Induction: Overfitting. It only consideres hypotheses which predict the past data perfectly, and only based on that, it predicts future data.
If a Solomonoff inductor (like AIXI) observes its environment with a digital camera, it will only consider predictions about the future made by hypotheses which perfectly predict the past camera data, including random noise.
But the noise is very irrelevant for what the future will most likely be. Maybe there is a much simpler hypothesis, which retrodicts the past data only approximately, but which makes much better predictions, because the most successful theories are those which trade-off simplicity of a hypothesis with its fit to the “training sample”, in this case, the data observed in the past.
It is often said that successful prediction has a lot to do with compression. But it has a lot to do with lossy compression, not lossless compression. Solomonoff Induction relies entirely on lossless compression, and so it overfits. When animals or humans make predictions based on past observation, they don’t normally pick a hypothesis which predicts the past observations perfectly, because this hypothesis would be unnecessarily complex and would make, consequently, probably bad predictions.
Solomonoff induction optimizes not for both simplicity and fit, it demands perfect fit first and only then optimizes for simplicity, but this simplicity might not very simple at all.
Consider again the noisy input from the camera. The SI compression is like compressing all the camera frames as PNG, a lossless image format. The compression rate on video frames or photos is terrible with PNG, because photos contain a lot of noise and little ordered data which could be well compressed with a lossless algorithm. In photos or video frames JPEG (a lossless format) gives a result which has still pretty good quality but is massively smaller.
So how much better would Solomonoff Induction be than PNG on noisy data like photos? I think not much better. SI would be much better than PNG on some images without noise, which contain e.g. something like a large Game of Life board. SI would figure out the simple GoL algorithm which generated the image, and PNG wouldn’t. But I see no reason why SI would be significantly better on noisy data.
Okay, maybe the noise and the whole universe is deterministic (some deterministic interpretation of QM, like Bohemian mechanics, is true). Then SI could perhaps deduce the laws of nature and the initial conditions of the universe with high confidence from some amount of video data, which would compress the past data incredibly well and which would give it excellent prediction capabilities, similar to Laplace’s demon.
Is this plausible? Even if the camera noise is perfectly determined by the laws and the initial conditions of the universe, it seems not clear whether SI could deduce them, since random noise is not the only sort of uncertainty. Another source is indexical uncertainty. Many different space time locations in the universe may be consistent with the available data, and different laws / initial conditions may make different predictions for different locations. So the space of possible laws / initial conditions may not be very strongly constrained by the available data. Then a system which trades off simplicity and fit could perhaps still beat Solomonoff prediction.
Moreover, even the deterministic Many Worlds interpretation of quantum mechanics would lead to unpredictable noise for the Solomonoff inductor, again because of indexical uncertainty: It doesn’t know in which of the many worlds it is located in, so it must regard it’s location as random, which is equivalent to regarding the noise as random.
So it seems SI would most likely fail against conventional intelligences (which make a trade-off between fit and simplicity) in indeterministic or Many Worlds universes, because this leads to (real or pseudo) indeterministic noise which can’t be well compressed/predicted losslessly. And for deterministic Bohemian mechanics style universes, the success of SI is questionable because of indexical uncertainty about space time location.
This is not a problem for Solomonoff induction because
is a better hypotheses than
So Solomonoff induction still does as well as a human’s ontology. Solomonoff induction tries to compress everything it can, including the patterns human’s care about, even if other parts of the data can’t be compressed.
There is a precise trade-off involved. If you make a lossy fit better, you lose bits based on how much more complicated it is, but you gain bits in that you no longer need to hardcode explanations for the errors. If those errors are truly random, you might as well stick with your lossy fit (and Solomonoff induction does this).
I don’t see how these claims refute anything I said. You could probably use a similar argument to justify overfitting in general. A model which overfits doesn’t care at all about more or less approximate fit, it cares only about perfect fit, and two hypotheses with perfect fit on the training data could have wildly different approximate fit to reality on the predictions, while as perfect predictions they are equally bad. Then Solomonoff induction wouldn’t care at all about picking the one with better approximate predictions!
Just think about predictions of actual scientific theories: We know in advance that these theories are all strictly speaking wrong, since they are simplifications of reality (so they would be equally bad for Solomonoff), but one theory could be closer to the truth, a much better approximation, than the other. While the probability of being precisely correct could still be equal (equally low) for both theories.
That A is a better approximate prediction than B doesn’t imply that A is more likely true than B. In fact, B could (and probably would, under Solomonoff Induction) contain a lot of made-up fake precision, which would give it at least a chance of being precisely true, in contrast to A, which can fit reality only ever imperfectly. Then B would be more likely true than A, but far less similar, in expectation, to reality.
Larger errors literally take more bits to describe. For example, in binary, 3 is 11₂ and 10 is 1010₂ (twice the bits).
Say that you have two hypotheses, A and B, such that A is 100 bits more complicated than B but 5% closer to the true value. This means for each sample, the error in B on average takes log₂(1.05) = 0.07 bits more to describe than the error in A.
After about 1,430 samples, A and B will be considered equally likely. After about 95 more samples, A will be considered 100 times more likely than B.
In general, if f(x) is some high level summary of important information in x, Solomonoff induction that only tries to predict x is also universal for predicting f(x) (and it even has the same or better upper-bounds).
The problem of Solomonoff induction having 100% probability in being a computable isn’t actually, by itself logically contradictory with seeing universes that are uncomputable.
Now, to address the post’s claims:
So to answer the question, the answer is basically yes, because of the fact that humans have a finite and fixed memory + finite speed of computation, meaning that it’s trivial to compute all reasoning strategies that a human does by a Turing Machine.
In general the Church-Turing thesis that the set of reasoning strategies/the set of computers is equal to the set of Turing machines is false, as the set of computers/the set of reasoning strategies are way, way bigger than the set of Turing machines.
So either the Church-Turing thesis is tautologically true, or if we wanted to generalize it is simply false.
So Solomonoff induction, as it’s usually considered would beat any strategy that’s computable, or recursively enumerable strategies that doesn’t have the complement of the set, but compared to the set of all computers/strategies as we currently know of, Solomonoff induction doesn’t do very well, as it’s only on the first floor of a halting oracle, or equivalently it’s at the 2nd floor of the arithmetical hierarchy, but compared to the vastness of reasoning and computation, it’s still way, way worse than the current optimal reasoners/optimal computers in math. It’s superior to the Accelerating Turing Machine, an uncomputable machine, and superior to all other computable strategies, but is inferior to the Malament-Hogarth machine, possibly an infinite time turing machine depending on how many time steps we allow, and the 3 equivalent current champions of reasoning machines/computers.
The infinite state turing machine.
The Blum-Shub-Smale machine over the reals.
A recurrent neural network with real valued weights with exponential time to compute.
Solomonoff induction isn’t ideal, compared to the current champions of computation/reasoning engines.
Does this chatlin’s constant differ for different Turing machines and their instruction sets?