(2) aren’t the continuation of another such program.
This is an improper way of stating it. Solomonoff induction requires that the programming language syntax prevents the existence of programs which are continuations of other programs. This is required in order to make sure that the prior is a probability distribution over all programs, up to a normalization constant (the inverse of Chaitin omega).
The simplest answer is that we have a frequentist guarantee that if the “true program” generating our input has some length N (that is, if the observable universe is a big but finite-sized computer), then our predictions will only be wrong a limited number of times, and after that we’ll predict the correctly every time.
I’m not sure what do you mean by “frequentist” in this context, but the theory of Solomonoff induction makes no assumption that there must exist a “true program” which deterministically generates the data string. It only assumes that the conditional probability distribution of each bit given all the previous one must be computable.
The trouble with using Solomonoff induction in real life is that to pick out which programs output our data so far, we need to run every program—and if the program doesn’t ever halt, we need to use a halting oracle to stop it or else we’ll take infinite time.
Even if we had a halting oracle, we would still have to consider infinitely many programs. A time limits avoids both the halting problem and the program set cardinality problem.
If we just take Solomonoff induction and put restrictions on it, our predictions will still only come from hypotheses that exactly reproduce our starting data. This is a problem.
A caveat here: Full Solomonoff induction doesn’t overfit: since it has infinite modelling resources it can spend them to exactly learn all the properties of the observed string and maintain maximum predictive power. A resource-limited version of Solomonoff induction, like any resource-limited machine learning approach, has limited modelling resources, and spending them to learn irrelevant properties of the observed data makes them less able to predict the relevant properties.
If there’s an easy way to combine machine learning with Solomonoff induction, it’s Bayes’ theorem. The machine learning was focused on driving down P(training data | chance), and picking a hypothesis with a high P(training data | hypothesis, noise model). We might want to Use Bayes’ rule to say something like P(hypothesis | training data, noise model) = P(hypothesis | complexity prior) * P(training data | hypothesis, noise model) / P(training data | noise model).
I don’t see the relevance with Solomonoff induction. Most machine learning methods have explicit or implicit noise models and ways to control hypothesis complexity. For instance, L2-regularized linear regression performs maximum a posteriori estimation under the assumption of an independent Gaussian prior on the model parameters and an independent Gaussian noise model on the observations: http://en.wikipedia.org/wiki/Tikhonov_regularization#Bayesian_interpretation
This is an improper way of stating it.
Solomonoff induction requires that the programming language syntax prevents the existence of programs which are continuations of other programs.
This is required in order to make sure that the prior is a probability distribution over all programs, up to a normalization constant (the inverse of Chaitin omega).
I’m not sure what do you mean by “frequentist” in this context, but the theory of Solomonoff induction makes no assumption that there must exist a “true program” which deterministically generates the data string. It only assumes that the conditional probability distribution of each bit given all the previous one must be computable.
Even if we had a halting oracle, we would still have to consider infinitely many programs.
A time limits avoids both the halting problem and the program set cardinality problem.
A caveat here:
Full Solomonoff induction doesn’t overfit: since it has infinite modelling resources it can spend them to exactly learn all the properties of the observed string and maintain maximum predictive power.
A resource-limited version of Solomonoff induction, like any resource-limited machine learning approach, has limited modelling resources, and spending them to learn irrelevant properties of the observed data makes them less able to predict the relevant properties.
I don’t see the relevance with Solomonoff induction. Most machine learning methods have explicit or implicit noise models and ways to control hypothesis complexity.
For instance, L2-regularized linear regression performs maximum a posteriori estimation under the assumption of an independent Gaussian prior on the model parameters and an independent Gaussian noise model on the observations: http://en.wikipedia.org/wiki/Tikhonov_regularization#Bayesian_interpretation