Solomonoff induction is an ideal answer to questions like “What probably comes next in the sequence 1, 1, 2, 3, 5, 8?” or “Given the last three years of visual data from this webcam, what will this robot probably see next?” or “Will the sun rise tomorrow?” Solomonoff induction requires infinite computing power, and is defined by taking every computable algorithm for giving a probability distribution over future data given past data, weighted by their algorithmic simplicity, and updating those weights by comparison to the actual data.
E.g., somewhere in the ideal Solomonoff distribution is an exact copy of you, right now, staring at a string of 1s and 0s and trying to predict what comes next—though this copy of you starts out with a very low weight in the mixture owing to its complexity. Since a copy of you is present in this mixture of computable predictors, we can prove a theorem about how well Solomonoff induction does compared to an exact copy of you; namely, Solomonoff induction commits only a bounded amount of error relative to you, or any other computable way of making predictions. Solomonoff induction is thus a kind of perfect or rational ideal for probabilistically predicting sequences, although it cannot be implemented in reality due to requiring infinite computing power. Still, considering Solomonoff induction can give us important insights into how non-ideal reasoning should operate in the real world.
Additional reading:
Suppose you have a program P that outputs a probability distribution over the next possible bit. With a fairly small constant amount of extra code, you can make an algorithm P’ that takes in an arbitrary bitstream and decompresses them according to P. In other words, there are data compression algorithms that can use the probabilities to compress the environmental sequence. Now consider the set of programs that look like P’(b) for some bitstring b. The number of extra bits needed to encode b is the same as the number of bits P would loose by not being able to predict every step perfectly. In other words the set of all P’(b) deterministic programs combine to give basically the same probability distribution as P. (up to constant fiddle factors that depend on choice of TM)
In short, you can compress the random noise in the environment, and put it in your predictive program. This gives a deterministic program. And this costs only a constant more bits than admitting your uncertainty.
“diffraction” should be “refraction” throughout
The Solomonoff prior is not computed from the space each hypothesis needs to compute its answer, it’s its program length, the space needed to pass the hypothesis to our UTM!
Any other form of induction one could try, like the space or time complexities of evaluating a hypothesis you described, or other machine models that Turing machines can describe, or human intuition, is included by Solomonoff induction as a hypothesis, and will replace it after finite error if it outperforms.
How would you even compute each program’s space/time complexity? Rice’s theorem says it cannot be done in general.
Link to a description of the library of babel?
I imagine Ashley Watson leaning against the fourth wall as she delivers this line with an air of knowing satisfaction at how well she’s fulfilling the purpose of her hypothetical existence.
Do we have(or need) any empirical evidence that algorithmic simplicity (space) is the ideal and ultimate absolute prior? If I reread the article carefully I see it doesn’t quite seem to advocate this, but I think it’s very easy for a learner to pick up that misconception from the way AIXI is generally discussed, the assumption that we somehow know that space complexity is the final answer, and I wonder if it should be ruled out or qualified here.
(I believe it’s easy to pick up that misconception because it happened to me. I later came to realize I probably wouldn’t bet at good odds that Space AIXI wont ever be dominated by a variant AIXI made with some other razor, the speed of generating environment turing machines, for instance, instead of space. Or how about inverse cyclomatic complexity or some more general measure of algorithm quality that humans thus far have lacked the breadth of mind to find, test or work with mathematically? Or maybe even just some other space complexity of some other machine model? Space complexity of TMs seems like extremely low-hanging fruit.)
I’m hoping to hear someone’s done the work of compiling some ridiculously extensive dataset of algorithms and their competitors and demonstrating that space complexity seemed more predictive of domination than any of the other general metrics we could think of. If this result had been found, nobody ever seems to cite it. Though I suppose we might not hear about it when so many people find it completely intuitive.
Looks like there is a typo with the fraction. sense 1...n . 1?
I tweaked this to make it a bit clearer while also cleaning up the latex (I think it’s standard to use mathrm on multi-letter variable names, to disambiguate them from the multiplication of many single variables, etc. etc.), but I still think that parts of the equation (such as InterpretProb) need types and explanations. (I think you also want to explain how it handles non-halting programs, which requires renormalization in the Solomonoff induction case, unless you want to talk about the semimeasure instead.)
I was trying to say “append 1 to previous sequence”. I guess it needs explanation.