In order to apply the logical prior to sequence induction, we will add an empirical predicate to it. We define the modified Demski prior P(ϕ) as in this post, but with ϕ in the language of PA plus one additional predicate, B(n), which is true or false according to the bits in the bit-sequence we’re trying to predict. We wish to compare this to the universal semimeasure, M. For b a finite or infinite binary sequence, M(b) is defined as:
Σp:U(p)=b∗2−|p|
where U is a universal machine, and p is a minimal program such that the output of U is equal to or an extension of b. Let P be the probability distribution on binary sequences which results from P via encoding such sequences into logic as conjunctions of B(n) and ¬B(n).
Definition 1. Consider two probability distributions f and g defined on the same σ-algebra Σ. f dominates g iff ∃α∈R such that ∀x∈Σ we have αf(x)≥g(x). The dominance is strict if g does not also dominate f.
This is similar to the definition of (multiplicative) dominance in Li & Vitanyi’s An Introduction to Kolmogorov Complexity and Its Applications; however, here we consider x from the entire σ-algebra where they did not. This matters especially for Theorem 2.
Theorem 1.P is dominant over M.
Proof. For a constant complexity penalty l, we can encode a sentence-enumerator which simulates feeding random bits to the universal machine used by the universal semimeasure and asserts the sentences B(n) or ¬B(n) encoding the bit-sequence output by it. All of these hypotheses will be consistent with PA. Therefore, P(b) will be at least 2−lM(b). □
Theorem 2. The dominance from Theorem 1 is strict.
Proof. A hypothesis with positive probability in P is to encode a truth predicate for PA in B(n), as follows. Let ┌ϕ┐ be the Goedel-number of sentences ϕ in the language of PA. We assert the T-schema, ϕ↔B(┌ϕ┐), for each ϕ in the language of PA. (We do not assert this for the sentences including the predicate B().) This hypothesis forces the binary sequence to encode a complete and consistent extension of PA. As a result, P will assign positive probability to this. (Each such extension will have zero measure, but the set of such extensions will have nonzero measure.) M assigns probability zero. No finite-length program can provide a complete and consistent extension of PA, by the first incompleteness theorem. Therefore, any positive contribution to the probability must come from infinite-length programs. This is the same as saying that there exists a stochastic Turing machine with a nonzero probability of outputting a complete and consistent extension of PA. This is impossible by a result of Antonin Kucera; see here. □
Discussion
The purpose of strict dominance is to show that the hypothesis space of the modified Demski prior is in some sense larger than that of Solomonoff induction, as a result of its ability to employ logic. This comes close to issues like what an agent does if it encounters a halting oracle in the wild. (Not very close, though.)
Defining dominance over the entire σ-algebra allows us to talk about how P assigns positive measure to a class of sentences no one of which gets nonzero measure individually. An analogy would be the comparison between a distribution Q which assigns a positive probability to all rational numbers between 0 and 1, and the mixture of Q and the uniform distribution between 0 and 1. The 2nd does not assign nonzero probability to any new individual numbers, but assigns nonzero probability to the real numbers as a set while Q does not.
I think this is a very legitimate move, but only found definitions for dominance which quantified over binary strings. It might be possible to prove strict dominance for that type of dominance, but I doubt it; it would require finding a single sequence which has nonzero weight in P and zero weight in M.
Compared to my definition, the more standard definition is closer to a notion of bounded regret. For P to dominate Q in the old sense means that its Bayes score can only be some constant worse than Q, no worse, regardless of environment. My definition here implies that, but I think it’s a stricter requirement.
This framework applies the modified Demski prior to empirical uncertainty, by adding a new predicate B() to PA. What’s more interesting for logical uncertainty is predicting a sentence ϕ(n) which is already definable in PA. Two informal conjectures:
Informal conjecture 1. “Natural” approximations of the modified Demski prior on PA fail to approximate ϕ at level 1 logical uncertainty (where ϕ(n) is computable, and the time bound is enough to compute all previous values before guessing at the current). Allowing programs to guess on only the sentences they’re interested in overly incentivises them to remain silent; this is the problem which is solved by the subsequence induction algorithm (which consequently solves Level 1). Paul Christiano pointed out that this setting is called “sleeping experts” and subsequence induction looks a lot like the standard solution.
Informal conjecture 2. The modified Demski prior can succeed at Level 1 by performing a Bayesian update on things PA can prove, rather than accepting the axioms of PA as axioms. This is because a Bayesian update implements a correct solution to the sleeping experts problem, while the Demski procedure of kicking out inconsistent hypotheses does not. Doing a Bayesian update on our set of hypotheses when we prove a new ϕ(n) implements something closer to subsequence induction, and so will be able to learn properly (at Level 1 difficulty, that is). This is similar to Christiano’s paradox of ignorance.
Strict Dominance for the Modified Demski Prior
I previously noted that difficulties with the modified Demski prior were less than previously thought: I was failing to distinguish between the two different kinds of update. Here, I analyze more fully what it succeeds at.
In order to apply the logical prior to sequence induction, we will add an empirical predicate to it. We define the modified Demski prior P(ϕ) as in this post, but with ϕ in the language of PA plus one additional predicate, B(n), which is true or false according to the bits in the bit-sequence we’re trying to predict. We wish to compare this to the universal semimeasure, M. For b a finite or infinite binary sequence, M(b) is defined as: Σp:U(p)=b∗2−|p| where U is a universal machine, and p is a minimal program such that the output of U is equal to or an extension of b. Let P be the probability distribution on binary sequences which results from P via encoding such sequences into logic as conjunctions of B(n) and ¬B(n).
Definition 1. Consider two probability distributions f and g defined on the same σ-algebra Σ. f dominates g iff ∃α∈R such that ∀x∈Σ we have αf(x)≥g(x). The dominance is strict if g does not also dominate f.
This is similar to the definition of (multiplicative) dominance in Li & Vitanyi’s An Introduction to Kolmogorov Complexity and Its Applications; however, here we consider x from the entire σ-algebra where they did not. This matters especially for Theorem 2.
Theorem 1.P is dominant over M.
Proof. For a constant complexity penalty l, we can encode a sentence-enumerator which simulates feeding random bits to the universal machine used by the universal semimeasure and asserts the sentences B(n) or ¬B(n) encoding the bit-sequence output by it. All of these hypotheses will be consistent with PA. Therefore, P(b) will be at least 2−lM(b). □
Theorem 2. The dominance from Theorem 1 is strict.
Proof. A hypothesis with positive probability in P is to encode a truth predicate for PA in B(n), as follows. Let ┌ϕ┐ be the Goedel-number of sentences ϕ in the language of PA. We assert the T-schema, ϕ↔B(┌ϕ┐), for each ϕ in the language of PA. (We do not assert this for the sentences including the predicate B().) This hypothesis forces the binary sequence to encode a complete and consistent extension of PA. As a result, P will assign positive probability to this. (Each such extension will have zero measure, but the set of such extensions will have nonzero measure.) M assigns probability zero. No finite-length program can provide a complete and consistent extension of PA, by the first incompleteness theorem. Therefore, any positive contribution to the probability must come from infinite-length programs. This is the same as saying that there exists a stochastic Turing machine with a nonzero probability of outputting a complete and consistent extension of PA. This is impossible by a result of Antonin Kucera; see here. □
Discussion
The purpose of strict dominance is to show that the hypothesis space of the modified Demski prior is in some sense larger than that of Solomonoff induction, as a result of its ability to employ logic. This comes close to issues like what an agent does if it encounters a halting oracle in the wild. (Not very close, though.)
Defining dominance over the entire σ-algebra allows us to talk about how P assigns positive measure to a class of sentences no one of which gets nonzero measure individually. An analogy would be the comparison between a distribution Q which assigns a positive probability to all rational numbers between 0 and 1, and the mixture of Q and the uniform distribution between 0 and 1. The 2nd does not assign nonzero probability to any new individual numbers, but assigns nonzero probability to the real numbers as a set while Q does not.
I think this is a very legitimate move, but only found definitions for dominance which quantified over binary strings. It might be possible to prove strict dominance for that type of dominance, but I doubt it; it would require finding a single sequence which has nonzero weight in P and zero weight in M.
Compared to my definition, the more standard definition is closer to a notion of bounded regret. For P to dominate Q in the old sense means that its Bayes score can only be some constant worse than Q, no worse, regardless of environment. My definition here implies that, but I think it’s a stricter requirement.
This framework applies the modified Demski prior to empirical uncertainty, by adding a new predicate B() to PA. What’s more interesting for logical uncertainty is predicting a sentence ϕ(n) which is already definable in PA. Two informal conjectures:
Informal conjecture 1. “Natural” approximations of the modified Demski prior on PA fail to approximate ϕ at level 1 logical uncertainty (where ϕ(n) is computable, and the time bound is enough to compute all previous values before guessing at the current). Allowing programs to guess on only the sentences they’re interested in overly incentivises them to remain silent; this is the problem which is solved by the subsequence induction algorithm (which consequently solves Level 1). Paul Christiano pointed out that this setting is called “sleeping experts” and subsequence induction looks a lot like the standard solution.
Informal conjecture 2. The modified Demski prior can succeed at Level 1 by performing a Bayesian update on things PA can prove, rather than accepting the axioms of PA as axioms. This is because a Bayesian update implements a correct solution to the sleeping experts problem, while the Demski procedure of kicking out inconsistent hypotheses does not. Doing a Bayesian update on our set of hypotheses when we prove a new ϕ(n) implements something closer to subsequence induction, and so will be able to learn properly (at Level 1 difficulty, that is). This is similar to Christiano’s paradox of ignorance.