The basic idea of the modified Demski prior is to let Turing machines enumerate sentences, using a Solomonoff-style prior. As such, we might hope that this prior would have the good properties of Solomonoff induction when applied to an empirical problem. This is not the case. The argument here comes out of discussions with Scott. (What follows is not a proof, but only a proof sketch with a hole at the end.)
The modified Demski prior is defined as follows. Interpret a universal machine U as enumerating sentences (giving us sentence codes with a special separator code between them, possibly forever). Define a distribution over such enumerations by feeding the machine random bits. Starting with a base theory (our axioms), keep extending the set by drawing new theories T from the distribution, and keeping those which are consistent with the axioms and with what we’ve kept so far. The limit of this process is one random extension of the base theory.
Consider the case of first-order logic with the axioms of PA, and an infinite stock of predicate symbols. (We really need only one additional predicate symbol, however.)
Suppose we’ve represented the problem as sequence induction, as we would for the Solomonoff distribution: we have a sequence Sn of 0s and 1s, which we want to predict. Encode this with a distinguished predicate not used for anything else, S(n).
Suppose that Sn encodes a stochastic problem, so that no deterministic machine predicts the sequence with 100% accuracy. Let’s say it encodes flips of a biased coin which lands on True 30% of the time.
Consider only the first draw from U, and furthermore, only consider the cases where U enumerates sentences of the form S(n) or ¬S(n) exclusively. Because U is enumerating sentences rather than deciding sentences in sequence, it can skip as many sentences as it likes. Suppose that we’ve updated on N−1 sentences so far. For the Nth sentence, we make a prediction using a mixture distribution over all the machines T which haven’t been wrong yet, weighted by their probability of coming from U. The mixture distribution consisting of machines which have made predictions for all S(n) so far is likely to be quite good, predicting very close to .3 on True for sufficiently large N. Those which have predicted few or none, however, may be quite bad. The problem is that these will have more weight, asymptotically. The argument is as follows:
It is sufficient to consider the non-normalized version of the distribution, and compare the relative mass of the different components. The volume of the machines which try all predictions will shrink based on a constant factor (on average), since a certain percent should be getting things wrong. On the other hand, consider the class of machines which hold off making predictions. Some of these machines will take the following form: “Sample from U, but discard any sentences S(n) for n<X.” The probability of a particular X is based on its description complexity, kx. At N, the mass from such machines with X=N will be proportional to 2−kn. This shrinks much more slowly than the contribution of regular predictors. (In fact, it shrinks more slowly than any computable probability distribution, since it has a component for each.)
This establishes that there’s a component of first-time predictors, which contributes more mass to the probability mixture than all-time predictors do, as we update on more examples. This strongly suggests that we’ll fail to learn. Of course, there are many other varieties of machines other than just the two we considered. Recall that we restricted attention to the first draw, as well. We can deal with some of these issues.
Restricting attention to the first draw was really not very important. The subsequent draws only fill in sentences which the first draw hasn’t. This means we still have the same (relatively large) probability of drawing a first-time predictor for a particular S(N) in our first draw; it’s just that we will then fill in the sentences which it remains silent on with subsequent draws. (And since we’re drawing from the same distribution, these will be equally dysfunctional.)
Considering machines which only output sentences of the form S(n) or ¬S(n) was also not very important. The argument actually holds for the more general case: we could have talked about sampling theories which imply all the S(n) so far as opposed to none. This is just a bit more cumbersome, especially since we need to keep sampling from U rather than stop at the first one in order to guarantee that the sampled theory is closed under logical implication.
As for machines which make a prediction for something other than all sentences we’ve updated on or none of them, this is a more real complication. It’s unclear to me how to show that these don’t save the day. However, it seems very unlikely. Any group of machines which is trying to learn must expose predictions in order to knock out bad predictors. It seems the only way to overpower the group of first-time-predictors would be to make predictions even more sporadically.
Although not examined here, I think a similar argument would apply to the original version of the Demski prior.
A Problem with the Modified Demski Prior
The basic idea of the modified Demski prior is to let Turing machines enumerate sentences, using a Solomonoff-style prior. As such, we might hope that this prior would have the good properties of Solomonoff induction when applied to an empirical problem. This is not the case. The argument here comes out of discussions with Scott. (What follows is not a proof, but only a proof sketch with a hole at the end.)
The modified Demski prior is defined as follows. Interpret a universal machine U as enumerating sentences (giving us sentence codes with a special separator code between them, possibly forever). Define a distribution over such enumerations by feeding the machine random bits. Starting with a base theory (our axioms), keep extending the set by drawing new theories T from the distribution, and keeping those which are consistent with the axioms and with what we’ve kept so far. The limit of this process is one random extension of the base theory.
Consider the case of first-order logic with the axioms of PA, and an infinite stock of predicate symbols. (We really need only one additional predicate symbol, however.)
Suppose we’ve represented the problem as sequence induction, as we would for the Solomonoff distribution: we have a sequence Sn of 0s and 1s, which we want to predict. Encode this with a distinguished predicate not used for anything else, S(n).
Suppose that Sn encodes a stochastic problem, so that no deterministic machine predicts the sequence with 100% accuracy. Let’s say it encodes flips of a biased coin which lands on True 30% of the time.
Consider only the first draw from U, and furthermore, only consider the cases where U enumerates sentences of the form S(n) or ¬S(n) exclusively. Because U is enumerating sentences rather than deciding sentences in sequence, it can skip as many sentences as it likes. Suppose that we’ve updated on N−1 sentences so far. For the Nth sentence, we make a prediction using a mixture distribution over all the machines T which haven’t been wrong yet, weighted by their probability of coming from U. The mixture distribution consisting of machines which have made predictions for all S(n) so far is likely to be quite good, predicting very close to .3 on True for sufficiently large N. Those which have predicted few or none, however, may be quite bad. The problem is that these will have more weight, asymptotically. The argument is as follows:
It is sufficient to consider the non-normalized version of the distribution, and compare the relative mass of the different components. The volume of the machines which try all predictions will shrink based on a constant factor (on average), since a certain percent should be getting things wrong. On the other hand, consider the class of machines which hold off making predictions. Some of these machines will take the following form: “Sample from U, but discard any sentences S(n) for n<X.” The probability of a particular X is based on its description complexity, kx. At N, the mass from such machines with X=N will be proportional to 2−kn. This shrinks much more slowly than the contribution of regular predictors. (In fact, it shrinks more slowly than any computable probability distribution, since it has a component for each.)
This establishes that there’s a component of first-time predictors, which contributes more mass to the probability mixture than all-time predictors do, as we update on more examples. This strongly suggests that we’ll fail to learn. Of course, there are many other varieties of machines other than just the two we considered. Recall that we restricted attention to the first draw, as well. We can deal with some of these issues.
Restricting attention to the first draw was really not very important. The subsequent draws only fill in sentences which the first draw hasn’t. This means we still have the same (relatively large) probability of drawing a first-time predictor for a particular S(N) in our first draw; it’s just that we will then fill in the sentences which it remains silent on with subsequent draws. (And since we’re drawing from the same distribution, these will be equally dysfunctional.)
Considering machines which only output sentences of the form S(n) or ¬S(n) was also not very important. The argument actually holds for the more general case: we could have talked about sampling theories which imply all the S(n) so far as opposed to none. This is just a bit more cumbersome, especially since we need to keep sampling from U rather than stop at the first one in order to guarantee that the sampled theory is closed under logical implication.
As for machines which make a prediction for something other than all sentences we’ve updated on or none of them, this is a more real complication. It’s unclear to me how to show that these don’t save the day. However, it seems very unlikely. Any group of machines which is trying to learn must expose predictions in order to knock out bad predictors. It seems the only way to overpower the group of first-time-predictors would be to make predictions even more sporadically.
Although not examined here, I think a similar argument would apply to the original version of the Demski prior.