Strict Dominance for the Modified Demski Prior
I previously noted that difficulties with the modified Demski prior were less than previously thought: I was failing to distinguish between the two different kinds of update. Here, I analyze more fully what it succeeds at.
In order to apply the logical prior to sequence induction, we will add an empirical predicate to it. We define the modified Demski prior as in this post, but with in the language of PA plus one additional predicate, , which is true or false according to the bits in the bit-sequence we’re trying to predict. We wish to compare this to the universal semimeasure, . For a finite or infinite binary sequence, is defined as: where is a universal machine, and is a minimal program such that the output of is equal to or an extension of . Let be the probability distribution on binary sequences which results from via encoding such sequences into logic as conjunctions of and .
Definition 1. Consider two probability distributions and defined on the same -algebra . dominates iff such that we have . The dominance is strict if does not also dominate .
This is similar to the definition of (multiplicative) dominance in Li & Vitanyi’s An Introduction to Kolmogorov Complexity and Its Applications; however, here we consider from the entire -algebra where they did not. This matters especially for Theorem 2.
Theorem 1. is dominant over .
Proof. For a constant complexity penalty , we can encode a sentence-enumerator which simulates feeding random bits to the universal machine used by the universal semimeasure and asserts the sentences or encoding the bit-sequence output by it. All of these hypotheses will be consistent with . Therefore, P(b) will be at least .
Theorem 2. The dominance from Theorem 1 is strict.
Proof. A hypothesis with positive probability in is to encode a truth predicate for in , as follows. Let be the Goedel-number of sentences in the language of . We assert the T-schema, , for each in the language of . (We do not assert this for the sentences including the predicate .) This hypothesis forces the binary sequence to encode a complete and consistent extension of . As a result, will assign positive probability to this. (Each such extension will have zero measure, but the set of such extensions will have nonzero measure.) assigns probability zero. No finite-length program can provide a complete and consistent extension of , by the first incompleteness theorem. Therefore, any positive contribution to the probability must come from infinite-length programs. This is the same as saying that there exists a stochastic Turing machine with a nonzero probability of outputting a complete and consistent extension of . This is impossible by a result of Antonin Kucera; see here.
Discussion
The purpose of strict dominance is to show that the hypothesis space of the modified Demski prior is in some sense larger than that of Solomonoff induction, as a result of its ability to employ logic. This comes close to issues like what an agent does if it encounters a halting oracle in the wild. (Not very close, though.)
Defining dominance over the entire -algebra allows us to talk about how assigns positive measure to a class of sentences no one of which gets nonzero measure individually. An analogy would be the comparison between a distribution which assigns a positive probability to all rational numbers between 0 and 1, and the mixture of and the uniform distribution between 0 and 1. The 2nd does not assign nonzero probability to any new individual numbers, but assigns nonzero probability to the real numbers as a set while does not.
I think this is a very legitimate move, but only found definitions for dominance which quantified over binary strings. It might be possible to prove strict dominance for that type of dominance, but I doubt it; it would require finding a single sequence which has nonzero weight in and zero weight in .
Compared to my definition, the more standard definition is closer to a notion of bounded regret. For to dominate in the old sense means that its Bayes score can only be some constant worse than , no worse, regardless of environment. My definition here implies that, but I think it’s a stricter requirement.
This framework applies the modified Demski prior to empirical uncertainty, by adding a new predicate to . What’s more interesting for logical uncertainty is predicting a sentence which is already definable in . Two informal conjectures:
Informal conjecture 1. “Natural” approximations of the modified Demski prior on fail to approximate at level 1 logical uncertainty (where is computable, and the time bound is enough to compute all previous values before guessing at the current). Allowing programs to guess on only the sentences they’re interested in overly incentivises them to remain silent; this is the problem which is solved by the subsequence induction algorithm (which consequently solves Level 1). Paul Christiano pointed out that this setting is called “sleeping experts” and subsequence induction looks a lot like the standard solution.
Informal conjecture 2. The modified Demski prior can succeed at Level 1 by performing a Bayesian update on things can prove, rather than accepting the axioms of as axioms. This is because a Bayesian update implements a correct solution to the sleeping experts problem, while the Demski procedure of kicking out inconsistent hypotheses does not. Doing a Bayesian update on our set of hypotheses when we prove a new implements something closer to subsequence induction, and so will be able to learn properly (at Level 1 difficulty, that is). This is similar to Christiano’s paradox of ignorance.
[EDIT: This all deals with measures, not semimeasures; see below.]
Your definition is equivalent to the standard definition. Li and Vitányi say that P dominates Q iff there is some α∈R such that for any binary string x, we have αP(x)≥Q(x). Li and Vitányi’s “probability distributions” take a binary string as input while the probability distributions we are using ask for an element of a σ-algebra, but we can abuse notation by allowing a binary string to represent the event of that the random sequence starts with this string as a prefix.
Theorem. For any probability distributions P,Q on Cantor space and any α, the condition αP(x)≥Q(x) holds for all binary strings x iff it holds for all x∈Σ.
Proof. The direction ⟸ is immediate. For ⟹, suppose that αP(x)≥Q(x) for all binary string events x. We first show that this property holds for every event in the algebra A generated by the binary string events. We can write any x∈A as a disjoint union of binary string events x=⋃ni=1yi. Then, αP(x)=αn∑i=1P(yi)≥n∑i=1Q(yi)=Q(x), as desired.
We can extend this to the σ-algebra generated by A, which is just the Borel σ-algebra Σ on Cantor space, by the monotone class theorem. I’m not sure how much measure theory you know, but you can think of this as a form of induction on measurable sets; if a property holds on an algebra of sets A and it is preserved by countable monotone unions and intersections, then it holds on every set in the σ-algebra generated by A. For countable monotone unions, if we have x1⊆x2⊆…, then αP(∞⋃i=1xi)=supiαP(xi)≥supiQ(xi)=Q(⋃ixi). We can do the same thing for countable monotone intersections x1⊇x2⊇…; αP(∞⋂i=1xi)=infiαP(xi)≥infiQ(xi)=Q(⋂ixi). □
EDIT: I’m talking about probability distributions P rather than semimeasures. I don’t know how your definition of dominance is helpful for semimeasures though. The universal semimeasure M is defined on binary sequences, and I think that question of whether P dominates M depends on how you extend it to the whole σ-algebra. I expect that many reasonable extensions of M would satisfy your Theorem 1, but this post (in particular the proof of Theorem 1) doesn’t seem to choose a specific such extension.