This post has mostly been copied from something I posted on Less Wrong over a year ago. I am reposting it here, because I want to reference it in a future post related to the two-update problem. Here, I propose an logical prior. I do not think this prior is nearly as good as the modified Demski prior, or even the original Demski prior, but I am posting it has some interesting properties.
Given a consistent but incomplete theory, how should one choose a random model of that theory?
My proposal is rather simple. Just assign probabilities to sentences in such that if an adversary were to choose a model, your Worst Case Bayes Score is maximized. This assignment of probabilities represents a probability distribution on models, and choose randomly from this distribution. However, it will take some work to show that what I just described even makes sense. We need to show that Worst Case Bayes Score can be maximized, that such a maximum is unique, and that this assignment of probabilities to sentences represents an actual probability distribution. This post gives the necessary definitions, and proves these three facts.
Finally, I will show that any given probability assignment is coherent if and only if it is impossible to change the probability assignment in a way that simultaneously improves the Bayes Score by an amount bounded away from 0 in all models. This is nice because it gives us a measure of how far a probability assignment is from being coherent. Namely, we can define the “incoherence” of a probability assignment to be the supremum amount by which you can simultaneously improve the Bayes Score in all models. This could be a useful notion since we usually cannot compute a coherent probability assignment so in practice we need to work with incoherent probability assignments which approach a coherent one.
Now, let’s move on to the formal definitions and proofs.
Fix some language L, for example the language of first order set theory. Fix a consistent theory T of L, for example ZFC. Fix a nowhere zero probability measure μ on L, for example μ(ϕ)=2−ℓ(ϕ), where ℓ(ϕ) is the number of bits necessary to encode ϕ.
A probability assignment of L is any function from L to the interval [0,1]. Note that this can be any function, and does not have to represent a probability distribution. Given a probability assignment P of L, and a model M of T, we can define the Bayes Score of P with respect to M by
We define the Worst Case Bayes Score WCB(P) to be the infimum of Bayes(M,P) over all models M of T.
Let P denote the probability assignment that maximizes the function WCB. We will show that this maximum exists and is unique, so P is well defined.
In fact, P is also coherent, meaning that there exists a probability distribution on the set of all models of T such that P(ϕ) is exactly the probability that a randomly chosen model satisfies ϕ. Since the natural definition of a measurable subset of models comes from unions and intersections of the sets of all models satisfying a given sentence, we can think of P as an actual probability distribution on the set of all models of T.
First, we must show that there exists a probability assignment P which maximizes WCB.
Note that Bayes(M,P) either diverges to −∞, or converges to a non-positive real number. If P is the identically 1/2 function, then WCB(P)=−1, so there is at least one P for which WCB(P) is finite. This means that when maximizing WCB(P), we need only consider P for which Bayes(M,P) converges to a number between −1 and 0 for all M.
Assume by way of contradiction that there is no P which maximizes WCB. Then there must be some supremum value m such that WCB can get arbitrarily close to m, but never equals or surpasses m. Consider an infinite sequence probability assignments {Pi} such that WCB(Pi)→m. We may take a subsequence of {Pi} in order to assume that {Pi(ϕ)} converges for every sentence ϕ. Let P be such that Pi(ϕ)→P(ϕ) for all ϕ.
By assumption, WCB(P) must be less than m. Take any model M for which Bayes(M,P)<m. Then there exists a finite subset S of L such that BayesS(M,P)<m, where
Note that in order to keep the Bayes score at least −1, any Pi must satisfy 2−1/μ(ϕ)≤Pi(ϕ)≤1 if M⊨ϕ, and 0≤Pi(ϕ)≤1−2−1/μ(ϕ) if M⊨¬ϕ. Consider the space of all functions f from S to [0,1] satisfying these inequalities. Since there are only finitely many values restricted to closed and bounded intervals, this space is compact. Further, BayesS(M,f) is a continuous function of f, defined everywhere on this compact set. Therefore,
limi→∞BayesS(M,Pi)=BayesS(M,P)<m.
However, clearly WCB(Pi)≤Bayes(M,Pi)≤BayesS(M,Pi), so
limi→∞WCB(Pi)<m,
contradicting our assumption that WCB(Pi) converges to m.
Next, we will show that there is a unique probability assignment which maximizes WCB. Assume by way of contradiction that there were two probability assignments, P1 and P2 which maximize WCB. Consider the probability assignment P3, given by
P3(ϕ)=√P1(ϕ)P2(ϕ)√P1(ϕ)P2(ϕ)+√(1−P1(ϕ))(1−P2(ϕ)).
It is quick to check that this definition satisfies both
log2(P3(ϕ))≥log2(P1(ϕ))+log2(P2(ϕ))2
and
log2(1−P3(ϕ))≥log2(1−P1(ϕ))+log2(1−P2(ϕ))2,
and in both cases equality holds only when P1(ϕ)=P2(ϕ).
Therefore, we get that for any fixed model, M,
Bayes(M,P3(ϕ))≥Bayes(M,P1(ϕ))+Bayes(M,P2(ϕ))2.
By looking at the improvement coming from a single sentence ϕ with P1(ϕ)≠P2(ϕ), we see that
Bayes(M,P3(ϕ))−Bayes(M,P1(ϕ))+Bayes(M,P2(ϕ))2,
is actually bounded away from 0, which means that
WCB(P3(ϕ))≥WCB(P1(ϕ))+WCB(P2(ϕ))2,
which contradicts the fact that P1 and P2 maximize WCB.
This means that there is a unique probability assignment, P, which maximizes WCB, but we still need to show that P is coherent. For this, we will use the fact that P is coherent if and only if P assigns probability 0 to every contradiction, probability 1 to every tautology, and satisfies P(ϕ)=P(ϕ∧ψ)+P(ϕ∧¬ψ) for all ϕ and ψ.
Clearly P assigns probability 0 to every contradiction, since otherwise we could increase the Bayes Score in all models by the same amount by updating that probability to 0. Similarly P clearly assigns probability 1 to all tautologies.
If P(ϕ)≠P(ϕ∧ψ)+P(ϕ∧¬ψ), then we update all three probabilities as follows:
P(ϕ)↦11+1−P(ϕ)P(ϕ)(2−x/μ(ϕ)),
P(ϕ∧ψ)↦11+1−P(ϕ∧ψ)P(ϕ∧ψ)(2x/μ(ϕ∧ψ)),
and
P(ϕ∧¬ψ)↦11+1−P(ϕ∧¬ψ)P(ϕ∧¬ψ)(2x/μ(ϕ∧¬ψ)),
where x is the unique real number such that the three new probabilities satisfy P(ϕ)=P(ϕ∧ψ)+P(ϕ∧¬ψ). This correction increases Bayes Score by the same amount in all models, and therefore increase WCB, contradicting the maximality of WCB(P). Therefore P is coherent as desired.
Finally, we show that any given probability assignment P is coherent if and only if it is impossible to simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. The above proof that P is coherent actually shows one direction of this proof, since the only fact it used about P is that you could not simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. For the other direction, assume by way of contradiction that P is coherent, and that there exists a Q and an ϵ>0 such that Bayes(M,Q)−Bayes(M,P)>ϵ for all M.
In particular, since P. is coherent, it represents a probability distribution on models, so we can choose a random model M from the distribution P. If we do so, we must have that
E(Bayes(M,Q))−E(Bayes(M,P))>0.
However, this contradicts the well known fact that the expectation of Bayes Score is maximized by choosing honest probabilities corresponding the actual distribution M is chosen from.
Coming up, I will present a list of open problems related to this prior.
Maximize Worst Case Bayes Score
This post has mostly been copied from something I posted on Less Wrong over a year ago. I am reposting it here, because I want to reference it in a future post related to the two-update problem. Here, I propose an logical prior. I do not think this prior is nearly as good as the modified Demski prior, or even the original Demski prior, but I am posting it has some interesting properties.
Given a consistent but incomplete theory, how should one choose a random model of that theory?
My proposal is rather simple. Just assign probabilities to sentences in such that if an adversary were to choose a model, your Worst Case Bayes Score is maximized. This assignment of probabilities represents a probability distribution on models, and choose randomly from this distribution. However, it will take some work to show that what I just described even makes sense. We need to show that Worst Case Bayes Score can be maximized, that such a maximum is unique, and that this assignment of probabilities to sentences represents an actual probability distribution. This post gives the necessary definitions, and proves these three facts.
Finally, I will show that any given probability assignment is coherent if and only if it is impossible to change the probability assignment in a way that simultaneously improves the Bayes Score by an amount bounded away from 0 in all models. This is nice because it gives us a measure of how far a probability assignment is from being coherent. Namely, we can define the “incoherence” of a probability assignment to be the supremum amount by which you can simultaneously improve the Bayes Score in all models. This could be a useful notion since we usually cannot compute a coherent probability assignment so in practice we need to work with incoherent probability assignments which approach a coherent one.
Now, let’s move on to the formal definitions and proofs.
Fix some language L, for example the language of first order set theory. Fix a consistent theory T of L, for example ZFC. Fix a nowhere zero probability measure μ on L, for example μ(ϕ)=2−ℓ(ϕ), where ℓ(ϕ) is the number of bits necessary to encode ϕ.
A probability assignment of L is any function from L to the interval [0,1]. Note that this can be any function, and does not have to represent a probability distribution. Given a probability assignment P of L, and a model M of T, we can define the Bayes Score of P with respect to M by
Bayes(M,P)=∑M⊨ϕlog2(P(ϕ))μ(ϕ)+∑M⊨¬ϕlog2(1−P(ϕ))μ(ϕ).
We define the Worst Case Bayes Score WCB(P) to be the infimum of Bayes(M,P) over all models M of T. Let P denote the probability assignment that maximizes the function WCB. We will show that this maximum exists and is unique, so P is well defined.
In fact, P is also coherent, meaning that there exists a probability distribution on the set of all models of T such that P(ϕ) is exactly the probability that a randomly chosen model satisfies ϕ. Since the natural definition of a measurable subset of models comes from unions and intersections of the sets of all models satisfying a given sentence, we can think of P as an actual probability distribution on the set of all models of T.
First, we must show that there exists a probability assignment P which maximizes WCB.
Note that Bayes(M,P) either diverges to −∞, or converges to a non-positive real number. If P is the identically 1/2 function, then WCB(P)=−1, so there is at least one P for which WCB(P) is finite. This means that when maximizing WCB(P), we need only consider P for which Bayes(M,P) converges to a number between −1 and 0 for all M.
Assume by way of contradiction that there is no P which maximizes WCB. Then there must be some supremum value m such that WCB can get arbitrarily close to m, but never equals or surpasses m. Consider an infinite sequence probability assignments {Pi} such that WCB(Pi)→m. We may take a subsequence of {Pi} in order to assume that {Pi(ϕ)} converges for every sentence ϕ. Let P be such that Pi(ϕ)→P(ϕ) for all ϕ.
By assumption, WCB(P) must be less than m. Take any model M for which Bayes(M,P)<m. Then there exists a finite subset S of L such that BayesS(M,P)<m, where
BayesS(M,P)=∑ϕ∈S,M⊨ϕlog2(P(ϕ))μ(ϕ)+∑ϕ∈S,M⊨¬ϕlog2(1−P(ϕ))μ(ϕ).
Note that in order to keep the Bayes score at least −1, any Pi must satisfy 2−1/μ(ϕ)≤Pi(ϕ)≤1 if M⊨ϕ, and 0≤Pi(ϕ)≤1−2−1/μ(ϕ) if M⊨¬ϕ. Consider the space of all functions f from S to [0,1] satisfying these inequalities. Since there are only finitely many values restricted to closed and bounded intervals, this space is compact. Further, BayesS(M,f) is a continuous function of f, defined everywhere on this compact set. Therefore,
limi→∞BayesS(M,Pi)=BayesS(M,P)<m.
However, clearly WCB(Pi)≤Bayes(M,Pi)≤BayesS(M,Pi), so
limi→∞WCB(Pi)<m,
contradicting our assumption that WCB(Pi) converges to m.
Next, we will show that there is a unique probability assignment which maximizes WCB. Assume by way of contradiction that there were two probability assignments, P1 and P2 which maximize WCB. Consider the probability assignment P3, given by
P3(ϕ)=√P1(ϕ)P2(ϕ)√P1(ϕ)P2(ϕ)+√(1−P1(ϕ))(1−P2(ϕ)).
It is quick to check that this definition satisfies both
log2(P3(ϕ))≥log2(P1(ϕ))+log2(P2(ϕ))2
and
log2(1−P3(ϕ))≥log2(1−P1(ϕ))+log2(1−P2(ϕ))2,
and in both cases equality holds only when P1(ϕ)=P2(ϕ).
Therefore, we get that for any fixed model, M,
Bayes(M,P3(ϕ))≥Bayes(M,P1(ϕ))+Bayes(M,P2(ϕ))2.
By looking at the improvement coming from a single sentence ϕ with P1(ϕ)≠P2(ϕ), we see that
Bayes(M,P3(ϕ))−Bayes(M,P1(ϕ))+Bayes(M,P2(ϕ))2,
is actually bounded away from 0, which means that
WCB(P3(ϕ))≥WCB(P1(ϕ))+WCB(P2(ϕ))2,
which contradicts the fact that P1 and P2 maximize WCB.
This means that there is a unique probability assignment, P, which maximizes WCB, but we still need to show that P is coherent. For this, we will use the fact that P is coherent if and only if P assigns probability 0 to every contradiction, probability 1 to every tautology, and satisfies P(ϕ)=P(ϕ∧ψ)+P(ϕ∧¬ψ) for all ϕ and ψ.
Clearly P assigns probability 0 to every contradiction, since otherwise we could increase the Bayes Score in all models by the same amount by updating that probability to 0. Similarly P clearly assigns probability 1 to all tautologies.
If P(ϕ)≠P(ϕ∧ψ)+P(ϕ∧¬ψ), then we update all three probabilities as follows:
P(ϕ)↦11+1−P(ϕ)P(ϕ)(2−x/μ(ϕ)),
P(ϕ∧ψ)↦11+1−P(ϕ∧ψ)P(ϕ∧ψ)(2x/μ(ϕ∧ψ)),
and
P(ϕ∧¬ψ)↦11+1−P(ϕ∧¬ψ)P(ϕ∧¬ψ)(2x/μ(ϕ∧¬ψ)),
where x is the unique real number such that the three new probabilities satisfy P(ϕ)=P(ϕ∧ψ)+P(ϕ∧¬ψ). This correction increases Bayes Score by the same amount in all models, and therefore increase WCB, contradicting the maximality of WCB(P). Therefore P is coherent as desired.
Finally, we show that any given probability assignment P is coherent if and only if it is impossible to simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. The above proof that P is coherent actually shows one direction of this proof, since the only fact it used about P is that you could not simultaneously improve the Bayes Score by an amount bounded away from 0 in all models. For the other direction, assume by way of contradiction that P is coherent, and that there exists a Q and an ϵ>0 such that Bayes(M,Q)−Bayes(M,P)>ϵ for all M.
In particular, since P. is coherent, it represents a probability distribution on models, so we can choose a random model M from the distribution P. If we do so, we must have that
E(Bayes(M,Q))−E(Bayes(M,P))>0.
However, this contradicts the well known fact that the expectation of Bayes Score is maximized by choosing honest probabilities corresponding the actual distribution M is chosen from.
Coming up, I will present a list of open problems related to this prior.