Suppose a new scientific hypothesis, such as general relativity, explains a well-know observation such as the perihelion precession of mercury better than any existing theory. Intuitively, this is a point in favor of the new theory. However, the probability for the well-known observation was already at 100%. How can a previously-known statement provide new support for the hypothesis, as if we are re-updating on evidence we’ve already updated on long ago? This is known as the problem of old evidence, and is usually leveled as a charge against Bayesian epistemology.
Bayesian Solutions vs Scientific Method
It is typical for a Bayesian analysis to resolve the problem by pretending that all hypotheses are around “from the very beginning” so that all hypotheses are judged on all evidence. The perihelion precession of Mercury is very difficult to explain from Newton’s theory of gravitation, and therefore quite improbable; but it fits quite well with Einstein’s theory of gravitation. Therefore, Newton gets “ruled out” by the evidence, and Einstein wins.
A drawback of this approach is that it allows scientists to formulate a hypothesis in light of the evidence, and then use that very same evidence in their favor. Imagine a physicist competing with Einstein, Dr. Bad, who publishes a “theory of gravity” which is just a list of all the observations we have made about the orbits of celestial bodies. Dr. Bad has “cheated” by providing the correct answers without any deep explanations; but “deep explanation” is not an objectively verifiable quality of a hypothesis, so it should not factor into the calculation of scientific merit, if we are to use simple update rules like Bayes’ Law. Dr. Bad’s theory will predict the evidence as well or better than Einstein’s. So the new picture is that Newton’s theory gets eliminated by the evidence, but Einstein’s and Dr. Bad’s theories remain as contenders.
The scientific method emphasizes predictions made in advance to avoid this type of cheating. To test Einstein’s hypothesis, Sir Arthur Eddington measured Mercury’s orbit in more accuracy than had been done before. This test would have ruled out Dr. Bad’s theory of gravity, since (unless Dr. Bad possessed a time machine) there would be no way for Dr. Bad to know what to predict.
Simplicity Priors
Proponents of simplicity-based priors will instead say that the problem with Dr. Bad’s theory can be identified by looking at its description length in contrast to Einstein’s. We can tell that Einstein didn’t cheat by gerrymandering his theory to specially predict Mercury’s orbit correctly, because the theory is mathematically succinct! There is no room to cheat; no empty closet in which to smuggle information about Mercury. Marcus Hutter argues for this resolution to the problem of old evidence in On Universal Prediction and Bayesian Confirmation.
In contrast, ruling out all evidence except “advanced predictions” may seem like a crude and epistemically inefficient solution, which will get you to the truth more slowly, or perhaps not at all.
Unfortunately, simplicity priors do not appear to be the end of the story here. They can only help us to avoid cheating in specific contexts. Consider the problem of people evaluating how much to trust the opinions of a variety of people, based on their success at predictions so far. For example, Manifold Markets or similar platforms. If people were given points for predicting old evidence on such platforms, newcomers could simply claim they would have predicted all the old evidence perfectly, gaining more points than users who were around for the old stuff and risked points to predict it.
Humans are similar to hypotheses, in that they can generate probabilities predicting events. However, we can’t judge humans on “descriptive complexity” to use the simplicity-prior solution to the problem of old evidence.
Logical Uncertainty
Even in cases where we can measure simplicity perfectly, such as in Solomonoff Induction, is it really a perfect correction for the problem of old evidence? This becomes implausible in cases of logical uncertainty.
Simplicity priors seem like a very plausible solution to the problem of old evidence in the case of empirical uncertainty. If I try to “cheat” by baking in some known information into my hypothesis, without having real explanatory insight, then the description length of my hypothesis will be expanded by the number of bits I sneak in. This will penalize the prior probability by exactly the amount I stand to benefit by predicting those bits! In other words, the penalty is balanced so that it does not matter whether I try to “cheat” or not.
The same argument does not hold if I am predicting mathematical facts rather than empirical facts, however. Mathematicians are often in a situation where they already know how to calculate a sequence of numbers, but they are looking for some deeper understanding, such as a closed-form expression for the sequence, or a statistical model of the sequence (EG, the prime number theorem describes the statistics of the prime numbers). It is common to compute long sequences in order to check conjectures against more of the sequence, and in doing so, treat the computed numbers as evidence for a conjecture.
If I claimed to have some way to predict the prime numbers, but it turned out that my method actually had one of the standard ways to calculate prime numbers hidden within the source code, I would be accused of “cheating” in much the same way that a scientific hypothesis about gravity would be “cheating” if its source code included big tables of the observed orbits of celestial bodies. However, since mathematical sequences are produced from simple definitions, this “cheating” will not be registered by a simplicity prior.
Paradox of Ignorance
Paul Christiano presents the “paradox of ignorance” where a weaker, less informed agent appears to outperform a more powerful, more informed agent in certain situations. This seems to contradict the intuitive desideratum that more information should always lead to better performance.
The example given is of two agents, one powerful and one limited, trying to determine the truth of a universal statement ∀x:ϕ(x) for some Δ0 formula ϕ. The limited agent treats each new value of ϕ(x) as a surprise and evidence about the generalization ∀x:ϕ(x). So it can query the environment about some simple inputs x and get a reasonable view of the universal generalization.
In contrast, the more powerful agent may be able to deduce ϕ(x) directly for simple x. Because it assigns these statements prior probability 1, they don’t act as evidence at all about the universal generalization ∀x:ϕ(x). So the powerful agent must consult the environment about more complex examples and pay a higher cost to form reasonable beliefs about the generalization.
This can be seen as a variant of the problem of old evidence where the “old evidence” is instead embedded into the prior, rather than modeled as observations. It is as if everyone simply knew about the orbit of Mercury, rather than studying it through a telescope at some point.
This causes trouble for the typical Bayesian solution to the problem, where we imagine that all hypotheses were around “at the very beginning” so that all hypotheses can gain/lose probability based on how well they predict. In Paul’s version, since the information is encoded into the prior, there is no opportunity to “predict” it at all.
This poses a problem for a picture where a “better” prior is one which “knows more”.
It also poses a problem for a view where “one man’s prior is another’s posterior”—if the question “which beliefs are this agent’s prior?” only has an answer relative to which update is being performed (there is no sense in an ultimate prior, what philosophers call an ur-prior—only prior/posterior relative to specific updates), then it seems the Bayesian loses the right to answer the problem of old evidence by imagining all hypotheses were present from the very beginning (since there is no objective beginning).
Adding Hypotheses
Another perspective on the problem of old evidence is to think of it as a question of how to add new hypotheses over time. The typical Bayesian solution of modeling all hypotheses as present from the beginning can be seen as dodging the question, rather than providing a solution. Suppose we wish to model the rationality of a thinking being who can only consider finitely many hypotheses at a time, but who may formulate new hypotheses (and discard old ones) over time. Should there not be rationality principles governing this activity?