What exactly is a “prior”, as a mathematical object? Suppose you’re looking at an urn filled with red and white balls. When you draw the very first ball, you haven’t yet had a chance to gather much evidence, so you start out with a rather vague and fuzzy expectation of what might happen—you might say “fifty/fifty, even odds” for the chance of getting a red or white ball. But you’re ready to revise that estimate for future balls as soon as you’ve drawn a few samples. So then this initial probability estimate, 0.5, is not repeat not a “prior”.
An introduction to Bayes’s Rule for confused students
might refer to the population frequency of breast cancer as the “prior
probability of breast cancer”, and the revised probability after a
mammography as the “posterior probability”. But in the scriptures of
Deep Bayesianism, such as Probability Theory: The Logic of Science, one finds a quite different concept—that of
prior information, which includes e.g. our beliefs about the
sensitivity and specificity of mammography exams. Our belief about the
population frequency of breast cancer is only one small element of our
prior information.
In my earlier post on inductive bias, I discussed three possible beliefs we might have about an urn of red and white balls, which will be sampled without replacement:
Case 1: The urn contains 5 red balls and 5 white balls;
Case 2: A random number was generated between 0 and 1, and each ball was selected to be red (or white) at this probability;
Case 3: A monkey threw balls into the urn, each with a 50% chance of being red or white.
In each case, if you ask me—before I draw any balls—to estimate my marginal probability that the fourth ball drawn will be red, I will respond “50%”. And yet, once I begin observing balls drawn from the urn, I reason from the evidence in three different ways:
Case 1: Each red ball drawn makes it less likely that future balls will be red, because I believe there are fewer red balls left in the urn.
Case 2: Each red ball drawn makes it more plausible that future balls will be red, because I will reason that the random number was probably higher, and that the urn is hence more likely to contain mostly red balls.
Case 3: Observing a red or white ball has no effect on my future estimates, because each ball was independently selected to be red or white at a fixed, known probability.
Suppose I write a Python program to reproduce my reasoning in each of these scenarios. The program will take in a record of balls observed so far, and output an estimate of the probability that the next ball drawn will be red. It turns out that the only necessary information is the count of red balls seen and white balls seen, which we will respectively call R and W. So each program accepts inputs R and W, and outputs the probability that the next ball drawn is red:
Case 1: return (5 - R)/(10 - R—W) # Number of red balls remaining / total balls remaining
Case 2: return (R + 1)/(R + W + 2) # Laplace’s Law of Succession
Case 3: return 0.5
These programs are correct so far as they go. But unfortunately, probability theory does not operate on Python programs. Probability theory is an algebra of uncertainty, a calculus of credibility, and Python programs are not allowed in the formulas. It is like trying to add 3 to a toaster oven.
To use these programs in the probability calculus, we must figure out how to convert a Python program into a more convenient mathematical object—say, a probability distribution.
Suppose I want to know the combined probability that the sequence observed will be RWWRR, according to program 2 above. Program 2 does not have a direct faculty for returning the joint or combined probability of a sequence, but it is easy to extract anyway. First, I ask what probability program 2 assigns to observing R, given that no balls have been observed. Program 2 replies “1/2”. Then I ask the probability that the next ball is R, given that one red ball has been observed; program 2 replies “2/3″. The second ball is actually white, so the joint probability so far is 1⁄2 * 1⁄3 = 1⁄6. Next I ask for the probability that the third ball is red, given that the previous observation is RW; this is summarized as “one red and one white ball”, and the answer is 1⁄2. The third ball is white, so the joint probability for RWW is 1⁄12. For the fourth ball, given the previous observation RWW, the probability of redness is 2⁄5, and the joint probability goes to 1⁄30. We can write this as p(RWWR|RWW) = 2⁄5, which means that if the sequence so far is RWW, the probability assigned by program 2 to the sequence continuing with R and forming RWWR equals 2⁄5. And then p(RWWRR|RWWR) = 1⁄2, and the combined probability is 1⁄60.
We can do this with every possible sequence of ten balls, and end up with a table of 1024 entries. This table of 1024 entries constitutes a probability distribution over sequences of observations of length 10, and it says everything the Python program had to say (about 10 or fewer observations, anyway). Suppose I have only this probability table, and I want to know the probability that the third ball is red, given that the first two balls drawn were white. I need only sum over the probability of all entries beginning with WWR, and divide by the probability of all entries beginning with WW.
We have thus transformed a program that computes the probability of future events given past experiences, into a probability distribution over sequences of observations.
You wouldn’t want to do this in real life, because the Python program is ever so much more compact than a table with 1024 entries. The point is not that we can turn an efficient and compact computer program into a bigger and less efficient giant lookup table; the point is that we can view an inductive learner as a mathematical object, a distribution over sequences, which readily fits into standard probability calculus. We can take a computer program that reasons from experience and think about it using probability theory.
Why might this be convenient? Say that I’m not sure which of these three scenarios best describes the urn—I think it’s about equally likely that each of the three cases holds true. How should I reason from my actual observations of the urn? If you think about the problem from the perspective of constructing a computer program that imitates my inferences, it looks complicated—we have to juggle the relative probabilities of each hypothesis, and also the probabilities within each hypothesis. If you think about it from the perspective of probability theory, the obvious thing to do is to add up all three distributions with weightings of 1⁄3 apiece, yielding a new distribution (which is in fact correct). Then the task is just to turn this new distribution into a computer program, which turns out not to be difficult.
So that is what a prior really is—a mathematical object that represents all of your starting information plus the way you learn from experience.
Priors as Mathematical Objects
Followup to: “Inductive Bias”
What exactly is a “prior”, as a mathematical object? Suppose you’re looking at an urn filled with red and white balls. When you draw the very first ball, you haven’t yet had a chance to gather much evidence, so you start out with a rather vague and fuzzy expectation of what might happen—you might say “fifty/fifty, even odds” for the chance of getting a red or white ball. But you’re ready to revise that estimate for future balls as soon as you’ve drawn a few samples. So then this initial probability estimate, 0.5, is not repeat not a “prior”.
An introduction to Bayes’s Rule for confused students might refer to the population frequency of breast cancer as the “prior probability of breast cancer”, and the revised probability after a mammography as the “posterior probability”. But in the scriptures of Deep Bayesianism, such as Probability Theory: The Logic of Science, one finds a quite different concept—that of prior information, which includes e.g. our beliefs about the sensitivity and specificity of mammography exams. Our belief about the population frequency of breast cancer is only one small element of our prior information.
In my earlier post on inductive bias, I discussed three possible beliefs we might have about an urn of red and white balls, which will be sampled without replacement:
Case 1: The urn contains 5 red balls and 5 white balls;
Case 2: A random number was generated between 0 and 1, and each ball was selected to be red (or white) at this probability;
Case 3: A monkey threw balls into the urn, each with a 50% chance of being red or white.
In each case, if you ask me—before I draw any balls—to estimate my marginal probability that the fourth ball drawn will be red, I will respond “50%”. And yet, once I begin observing balls drawn from the urn, I reason from the evidence in three different ways:
Case 1: Each red ball drawn makes it less likely that future balls will be red, because I believe there are fewer red balls left in the urn.
Case 2: Each red ball drawn makes it more plausible that future balls will be red, because I will reason that the random number was probably higher, and that the urn is hence more likely to contain mostly red balls.
Case 3: Observing a red or white ball has no effect on my future estimates, because each ball was independently selected to be red or white at a fixed, known probability.
Suppose I write a Python program to reproduce my reasoning in each of these scenarios. The program will take in a record of balls observed so far, and output an estimate of the probability that the next ball drawn will be red. It turns out that the only necessary information is the count of red balls seen and white balls seen, which we will respectively call R and W. So each program accepts inputs R and W, and outputs the probability that the next ball drawn is red:
Case 1: return (5 - R)/(10 - R—W) # Number of red balls remaining / total balls remaining
Case 2: return (R + 1)/(R + W + 2) # Laplace’s Law of Succession
Case 3: return 0.5
These programs are correct so far as they go. But unfortunately, probability theory does not operate on Python programs. Probability theory is an algebra of uncertainty, a calculus of credibility, and Python programs are not allowed in the formulas. It is like trying to add 3 to a toaster oven.
To use these programs in the probability calculus, we must figure out how to convert a Python program into a more convenient mathematical object—say, a probability distribution.
Suppose I want to know the combined probability that the sequence observed will be RWWRR, according to program 2 above. Program 2 does not have a direct faculty for returning the joint or combined probability of a sequence, but it is easy to extract anyway. First, I ask what probability program 2 assigns to observing R, given that no balls have been observed. Program 2 replies “1/2”. Then I ask the probability that the next ball is R, given that one red ball has been observed; program 2 replies “2/3″. The second ball is actually white, so the joint probability so far is 1⁄2 * 1⁄3 = 1⁄6. Next I ask for the probability that the third ball is red, given that the previous observation is RW; this is summarized as “one red and one white ball”, and the answer is 1⁄2. The third ball is white, so the joint probability for RWW is 1⁄12. For the fourth ball, given the previous observation RWW, the probability of redness is 2⁄5, and the joint probability goes to 1⁄30. We can write this as p(RWWR|RWW) = 2⁄5, which means that if the sequence so far is RWW, the probability assigned by program 2 to the sequence continuing with R and forming RWWR equals 2⁄5. And then p(RWWRR|RWWR) = 1⁄2, and the combined probability is 1⁄60.
We can do this with every possible sequence of ten balls, and end up with a table of 1024 entries. This table of 1024 entries constitutes a probability distribution over sequences of observations of length 10, and it says everything the Python program had to say (about 10 or fewer observations, anyway). Suppose I have only this probability table, and I want to know the probability that the third ball is red, given that the first two balls drawn were white. I need only sum over the probability of all entries beginning with WWR, and divide by the probability of all entries beginning with WW.
We have thus transformed a program that computes the probability of future events given past experiences, into a probability distribution over sequences of observations.
You wouldn’t want to do this in real life, because the Python program is ever so much more compact than a table with 1024 entries. The point is not that we can turn an efficient and compact computer program into a bigger and less efficient giant lookup table; the point is that we can view an inductive learner as a mathematical object, a distribution over sequences, which readily fits into standard probability calculus. We can take a computer program that reasons from experience and think about it using probability theory.
Why might this be convenient? Say that I’m not sure which of these three scenarios best describes the urn—I think it’s about equally likely that each of the three cases holds true. How should I reason from my actual observations of the urn? If you think about the problem from the perspective of constructing a computer program that imitates my inferences, it looks complicated—we have to juggle the relative probabilities of each hypothesis, and also the probabilities within each hypothesis. If you think about it from the perspective of probability theory, the obvious thing to do is to add up all three distributions with weightings of 1⁄3 apiece, yielding a new distribution (which is in fact correct). Then the task is just to turn this new distribution into a computer program, which turns out not to be difficult.
So that is what a prior really is—a mathematical object that represents all of your starting information plus the way you learn from experience.