I made a conjecture I think is cool. Mark Sellke proved it. I don’t know what else to do with it, so I will explain why I think it’s cool and give the proof here. Hopefully, you will think it’s cool, too.

______________________________________________________________________

Suppose we are trying to assign as much probability as possible to whichever of several hypotheses is true. The law of conservation of expected evidence tells us that for any hypothesis, we should expect to assign the same probability to that hypothesis after observing a test result that we assign to it now. Suppose that $H$ takes values $h_{i}$ . We can express the law of conservation of expected evidence as, for any fixed $h_{i}$ :

E [P (H = h_{i} | D)] = P (H = h_{i})

In English this says that the probability we should expect to assign to $h_{i}$ after observing the value of $D$ equals the probability we assign to $h_{i}$ before we observe the value of $D .$

This law raises a question. If all I want is to assign as much probability to the true hypothesis as possible, and I should expect to assign the same probability I currently assign to each hypothesis after getting a new piece of data, why would I ever collect more data? A. J. Ayer pointed out this puzzle in The Conception of Probability as a Logical Relation (I unfortunately cannot find a link). I. J. Good solved Ayer’s puzzle in On the Principle of Total Evidence. Good shows that if I need to act on a hypothesis, the expected value of gaining an extra piece of data is always greater than or equal to the expected value of not gaining that new piece of data. Although there is nothing wrong with Good’s solution, I found it somewhat unsatisfying. Ayer’s puzzle is purely epistemic, and while there is nothing wrong with a pragmatic solution to an epistemic puzzle, I still felt that there should be a solution that makes no reference to acts or utility at all.

Herein I present a theorem that I think constitutes such a solution. I have decided to call it the principle of predicted improvement (PPI):

E [P (H | D)] \geq E [P (H)]

In English the theorem says that the probability we should expect to assign to the true value of $H$ after observing the true value of $D$ is greater than or equal to the expected probability we assign to the true value of $H$ before observing the value of $D$ . This inequality is strict when $H$ and $D$ are not independent. In other words, you should predict that your epistemic state will improve (e.g. you will assign more probability to the truth) after making any relevant observation.

This is a solution to Ayer’s puzzle because it says that I should always expect to assign more probability to the true hypothesis after making a relevant observation. It is a purely epistemic solution because it makes no reference to acts or utility. So long as I want to assign more probability to the true hypothesis than I currently do, I should want to make relevant observations.

Importantly, this is completely consistent with the law of conservation of expected evidence. Although for any particular hypothesis I should expect to assign it the same probability after performing a test that I do now, I should also expect to assign more probability to whichever hypothesis is actually true.

Aside from being a solution to Ayer’s puzzle, the PPI is cool just because it tells you that you should expect to assign more probability to the truth as you observe stuff.

______________________________________________________________________

There is a similar more well known theorem from information theory that my friend Alex Davis showed me:

E [E [- log (P (H | D)) | D]] \leq E [- log (P (H))]

In English this says that you should expect the entropy of your distribution to go down after you make an observation. If we use the law of iterated expectation, multiply both sides by minus one, and reverse the inequality, we get something that looks a lot like the PPI:

E [log (P (H | D))] \geq E [log (P (H))]

It does not imply PPI in any obvious way because we can have two distributions such that one is higher in expected negentropy but lower in expected probability assigned to the truth, and vice versa. They are similar theorems in that one says you should predict that the probability assigned to the true outcome will be higher after an observation, while the other says you should predict that the log probability will be higher. They are different in that they use different measures of confidence.

The advantage of the PPI is that it is phrased in the same terms as Ayer’s puzzle: probabilities rather than log probabilities. I also claim that the PPI is easier to read and interpret, so it might be pedagogically useful to teach it before teaching that expected entropy after an observation is less than or equal to current entropy.

Anyway, here’s Sellke’s proof.

______________________________________________________________________

Proof:

We want to show

E [P (H | D)] \geq E [P (H)] .

Let’s say that $H$ takes values $(h_{i})$ and $D$ takes values $(d_{j})$ . The left hand side is

\sum i, j P [h_{i} | d_{j}] P [h_{i} \land d_{i}] .

The right hand side is

\sum i P [h_{i}]^{2} .

The left hand side is equivalent to

\sum i, j \frac{P [h_{i} \land d_{j}]^{2}}{P [d_{j}]} = \sum i \sum j \frac{P [h_{i} \land d_{j}]^{2}}{P [d_{j}]} .

Titu’s lemma: for any sequences of a’s and b’s

\sum j (\frac{a_{j}^{2}}{b_{j}}) \geq \frac{(\sum_{j} a_{j})^{2}}{\sum_{j} b_{j}} .

If we apply this to each $\sum j$ then we just get $P [h_{i}]^{2}$ . This is because:

\sum j P [h_{i} \land d_{j}] = P [h_{i}]

and

\sum j P [d_{j}] = 1.

For each fixed i, set:

a_{j} = a_{i, j} = P [h_{i} \land d_{j}]

and

b_{j} = P [d_{j}] .

To conclude, for each fixed i we have

\sum j \frac{P [h_{i} \land d_{j}]^{2}}{P [d_{j}]} \geq \frac{(\sum_{j} P [h_{i} \land d_{j}])^{2}}{\sum_{j} P [d_{j}]} = P [h_{i}]^{2}

and hence

\sum i \sum j \frac{P [h_{i} \land d_{i}]^{2}}{P [d_{j}]} \geq \sum i P [h_{i}]^{2} .

Equality:

Here we explain why the only equality case is when $H$ and $E$ are independent.

Titu’s Lemma is an equality iff the two vectors $(a_{1}, \dots, a_{n})$ and $(b_{1}, \dots, b_{n})$ are parallel, that is, if there exists a constant $λ$ such that $a_{i} = λ b_{i}$ for all $i$ . If we translate this equality condition over to our application of Titu’s Lemma above, we see that our proof preserves equality if and only if there exist constants $λ_{1}, \dots, λ_{n}$ such that $P [h_{i} \land d_{j}] = λ_{i} \cdot P [d_{j}] .$ (We applied Titu once for each value of $i$ , so we need a $λ_{i}$ value for each inequality to be an equality. But these $λ_{i}$ ’s can be different.)

Now if we sum over $j$ there we get

\sum j P [h_{i} \land d_{j}] = λ_{i} \cdot \sum j P [d_{j}]

and so

P [h_{i}] = λ_{i} .

Plugging this back in, we see that equality is true iff

P [h_{i} \land d_{j}] = P [h_{i}] \cdot P [d_{j}]

which is equivalent to independence of $H$ and $E$ , or $0$ mutual information.

The Principle of Predicted Improvement

Proof:

Equality: