Maximize Worst Case Bayes Score

This post has mostly been copied from something I posted on Less Wrong over a year ago. I am reposting it here, because I want to reference it in a future post related to the two-update problem. Here, I propose an logical prior. I do not think this prior is nearly as good as the modified Demski prior, or even the original Demski prior, but I am posting it has some interesting properties.

Given a consistent but incomplete theory, how should one choose a random model of that theory?

My proposal is rather simple. Just assign probabilities to sentences in such that if an adversary were to choose a model, your Worst Case Bayes Score is maximized. This assignment of probabilities represents a probability distribution on models, and choose randomly from this distribution. However, it will take some work to show that what I just described even makes sense. We need to show that Worst Case Bayes Score can be maximized, that such a maximum is unique, and that this assignment of probabilities to sentences represents an actual probability distribution. This post gives the necessary definitions, and proves these three facts.

Finally, I will show that any given probability assignment is coherent if and only if it is impossible to change the probability assignment in a way that simultaneously improves the Bayes Score by an amount bounded away from 0 in all models. This is nice because it gives us a measure of how far a probability assignment is from being coherent. Namely, we can define the “incoherence” of a probability assignment to be the supremum amount by which you can simultaneously improve the Bayes Score in all models. This could be a useful notion since we usually cannot compute a coherent probability assignment so in practice we need to work with incoherent probability assignments which approach a coherent one.

Now, let’s move on to the formal definitions and proofs.

Fix some language $L$ , for example the language of first order set theory. Fix a consistent theory $T$ of $L$ , for example ZFC. Fix a nowhere zero probability measure $μ$ on $L$ , for example $μ (ϕ) = 2^{- ℓ (ϕ)}$ , where $ℓ (ϕ)$ is the number of bits necessary to encode $ϕ$ .

A probability assignment of $L$ is any function from $L$ to the interval $[0, 1]$ . Note that this can be any function, and does not have to represent a probability distribution. Given a probability assignment $P$ of $L$ , and a model $M$ of $T$ , we can define the Bayes Score of $P$ with respect to $M$ by

$Bayes (M, P) = \sum M ⊨ ϕ {log}_{2} (P (ϕ)) μ (ϕ) + \sum M ⊨ \neg ϕ {log}_{2} (1 - P (ϕ)) μ (ϕ) .$

We define the Worst Case Bayes Score $WCB (P)$ to be the infimum of $Bayes (M, P)$ over all models $M$ of $T$ . Let $P$ denote the probability assignment that maximizes the function $WCB$ . We will show that this maximum exists and is unique, so $P$ is well defined.

In fact, $P$ is also coherent, meaning that there exists a probability distribution on the set of all models of $T$ such that $P (ϕ)$ is exactly the probability that a randomly chosen model satisfies $ϕ$ . Since the natural definition of a measurable subset of models comes from unions and intersections of the sets of all models satisfying a given sentence, we can think of $P$ as an actual probability distribution on the set of all models of $T$ .

First, we must show that there exists a probability assignment $P$ which maximizes $WCB$ .

Note that $Bayes (M, P)$ either diverges to $- \infty$ , or converges to a non-positive real number. If $P$ is the identically $1 / 2$ function, then $WCB (P) = - 1$ , so there is at least one $P$ for which $WCB (P)$ is finite. This means that when maximizing $WCB (P)$ , we need only consider $P$ for which $Bayes (M, P)$ converges to a number between $- 1$ and $0$ for all $M$ .

Assume by way of contradiction that there is no $P$ which maximizes $WCB$ . Then there must be some supremum value $m$ such that $WCB$ can get arbitrarily close to $m$ , but never equals or surpasses $m$ . Consider an infinite sequence probability assignments ${P_{i}}$ such that $WCB (P_{i}) \to m$ . We may take a subsequence of ${P_{i}}$ in order to assume that ${P_{i} (ϕ)}$ converges for every sentence $ϕ$ . Let $P$ be such that $P_{i} (ϕ) \to P (ϕ)$ for all $ϕ$ .

By assumption, $WCB (P)$ must be less than $m$ . Take any model $M$ for which $Bayes (M, P) < m$ . Then there exists a finite subset $S$ of $L$ such that ${Bayes}_{S} (M, P) < m$ , where

${Bayes}_{S} (M, P) = \sum ϕ \in S, M ⊨ ϕ {log}_{2} (P (ϕ)) μ (ϕ) + \sum ϕ \in S, M ⊨ \neg ϕ {log}_{2} (1 - P (ϕ)) μ (ϕ) .$

Note that in order to keep the Bayes score at least $- 1$ , any $P_{i}$ must satisfy $2^{- 1 / μ (ϕ)} \leq P_{i} (ϕ) \leq 1$ if $M ⊨ ϕ$ , and $0 \leq P_{i} (ϕ) \leq 1 - 2^{- 1 / μ (ϕ)}$ if $M ⊨ \neg ϕ$ . Consider the space of all functions $f$ from $S$ to $[0, 1]$ satisfying these inequalities. Since there are only finitely many values restricted to closed and bounded intervals, this space is compact. Further, ${Bayes}_{S} (M, f)$ is a continuous function of $f$ , defined everywhere on this compact set. Therefore,

$lim i \to \infty {Bayes}_{S} (M, P_{i}) = {Bayes}_{S} (M, P) < m .$

However, clearly $WCB (P_{i}) \leq Bayes (M, P_{i}) \leq {Bayes}_{S} (M, P_{i})$ , so

$lim i \to \infty WCB (P_{i}) < m,$

contradicting our assumption that $WCB (P_{i})$ converges to $m$ .

Next, we will show that there is a unique probability assignment which maximizes $WCB$ . Assume by way of contradiction that there were two probability assignments, $P_{1}$ and $P_{2}$ which maximize $WCB$ . Consider the probability assignment $P_{3}$ , given by

$P_{3} (ϕ) = \frac{\sqrt{P_{1} (ϕ) P_{2} (ϕ)}}{\sqrt{P_{1} (ϕ) P_{2} (ϕ)} + \sqrt{(1 - P_{1} (ϕ)) (1 - P_{2} (ϕ))}} .$

It is quick to check that this definition satisfies both

${log}_{2} (P_{3} (ϕ)) \geq \frac{{log}_{2} (P_{1} (ϕ)) + {log}_{2} (P_{2} (ϕ))}{2}$

and

${log}_{2} (1 - P_{3} (ϕ)) \geq \frac{{log}_{2} (1 - P_{1} (ϕ)) + {log}_{2} (1 - P_{2} (ϕ))}{2},$

and in both cases equality holds only when $P_{1} (ϕ) = P_{2} (ϕ) .$

Therefore, we get that for any fixed model, $M$ ,

$Bayes (M, P_{3} (ϕ)) \geq \frac{Bayes (M, P_{1} (ϕ)) + Bayes (M, P_{2} (ϕ))}{2} .$

By looking at the improvement coming from a single sentence $ϕ$ with $P_{1} (ϕ) \neq P_{2} (ϕ),$ we see that

$Bayes (M, P_{3} (ϕ)) - \frac{Bayes (M, P_{1} (ϕ)) + Bayes (M, P_{2} (ϕ))}{2},$

is actually bounded away from $0$ , which means that

$WCB (P_{3} (ϕ)) \geq \frac{WCB (P_{1} (ϕ)) + WCB (P_{2} (ϕ))}{2},$

which contradicts the fact that $P_{1}$ and $P_{2}$ maximize $WCB$ .

This means that there is a unique probability assignment, $P$ , which maximizes $WCB$ , but we still need to show that $P$ is coherent. For this, we will use the fact that $P$ is coherent if and only if $P$ assigns probability 0 to every contradiction, probability 1 to every tautology, and satisfies $P (ϕ) = P (ϕ \land ψ) + P (ϕ \land \neg ψ)$ for all $ϕ$ and $ψ$ .

Clearly $P$ assigns probability 0 to every contradiction, since otherwise we could increase the Bayes Score in all models by the same amount by updating that probability to 0. Similarly $P$ clearly assigns probability 1 to all tautologies.

If $P (ϕ) \neq P (ϕ \land ψ) + P (ϕ \land \neg ψ)$ , then we update all three probabilities as follows:

$P (ϕ) \mapsto \frac{1}{1 + \frac{1 - P (ϕ)}{P (ϕ)} (2^{- x / μ (ϕ)})},$

$P (ϕ \land ψ) \mapsto \frac{1}{1 + \frac{1 - P (ϕ \land ψ)}{P (ϕ \land ψ)} (2^{x / μ (ϕ \land ψ)})},$

and

$P (ϕ \land \neg ψ) \mapsto \frac{1}{1 + \frac{1 - P (ϕ \land \neg ψ)}{P (ϕ \land \neg ψ)} (2^{x / μ (ϕ \land \neg ψ)})},$

where $x$ is the unique real number such that the three new probabilities satisfy $P (ϕ) = P (ϕ \land ψ) + P (ϕ \land \neg ψ)$ . This correction increases Bayes Score by the same amount in all models, and therefore increase $WCB$ , contradicting the maximality of $WCB (P)$ . Therefore $P$ is coherent as desired.

Finally, we show that any given probability assignment $P$ is coherent if and only if it is impossible to simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. The above proof that $P$ is coherent actually shows one direction of this proof, since the only fact it used about $P$ is that you could not simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. For the other direction, assume by way of contradiction that $P$ is coherent, and that there exists a $Q$ and an $ϵ > 0$ such that $Bayes (M, Q) - Bayes (M, P) > ϵ$ for all $M$ .

In particular, since $P$ . is coherent, it represents a probability distribution on models, so we can choose a random model $M$ from the distribution $P$ . If we do so, we must have that

$E (Bayes (M, Q)) - E (Bayes (M, P)) > 0.$

However, this contradicts the well known fact that the expectation of Bayes Score is maximized by choosing honest probabilities corresponding the actual distribution $M$ is chosen from.

Coming up, I will present a list of open problems related to this prior.