First things first, keep in mind the thing we’re actually interested in: P[model | data]. The Bayes factor summarizes the contribution of the data to that number, in a convenient and reusable way, but the Bayes factor itself is not what we really want to know.
In the example, one model is “unbiased coin”, the other is “some unknown bias”. The prior over the bias is our distribution on θ before seeing the data. Thus the name “prior”. It’s P[θ|model], not P[θ|model,data]. Before seeing the data, we have no reason at all to think that 4⁄5 is special. Thus, a uniform prior. This all comes directly from applying the rules of probability: P[data|model2]=∫θP[data|θ]p[θ|model2]dθ
If you stick strictly to the rules of probability itself, you will never find any reason to maximize anything. There is no built-in maximization anywhere in probability. There are lots of integrals (or sums, in the discrete case), and we usually approximate those integrals by maximizing things—but those are approximations, always. In particular, if you try to compute P[model | data], you will not find any maximization of anything—unless you introduce it to approximate an integral.
In fact, let’s go ahead and write out the whole example in one place, just this once:
… and there we have it. The relative probability of each model, given the data, calculated from first principles, is .0046.048P[model1]P[model2]. No approximation (except maybe in the choice of the models themselves), no maximization.
Continuum of Models
We could imagine, rather than two models, a whole spectrum of models—one for each θ value. In that case, it’s true that θ = 4⁄5 would have a Bayes factor of 47. However, θ = 4⁄5 would also have a prior probability of zero: there are infinitely many possibly θ values, and we don’t have any reason to favor 4⁄5 a priori. To get nonzero P[model|data], we’d have to consider a small window of width dθ around 4⁄5, and then P[modelθ±dθ/2|data]≈P[data|θ=4/5]p[θ=4/5]dθ. Again, this all just comes directly from the rules of probability—and you’ll notice that the prior p[θ]dθ has re-entered the picture, exactly the same as before, except now we’re thinking of it as a prior density on the models parameterized by θ.
What’s really interesting about this model is how we handle the unbiased coin hypothesis. θ = 4⁄5 exactly has prior 0, because there’s a whole continuum of θ-values and no reason to favor one of them a priori. But there is reason to favor θ=1/2 a priori: that’s the unbiased coin hypothesis. If we think there’s an 80% chance that the coin is perfectly unbiased, then θ=1/2 needs to have prior probability 0.8. That implies that our prior involves a delta function 0.8δ(θ−1/2), plus whatever prior distribution we have on all the other θ-values. (In practice, we probably think even “unbiased” coins have some tiny bias, so we’d have some very sharp peak in the prior around 1⁄2 rather than a true delta function.)
Why Not Maximize Likelihood?
The practical reason for integrating over a prior distribution, rather than just maximizing likelihood, is overfit. If we stick strictly to the rules of probability, without any approximation, then we will not ever overfit. Overfit comes from approximations, in particular maximum likelihood.
This example would take a lot longer to write up, but… consider the canonical overfit problem: you have a dozen x-y points, and try fitting polynomials of order 0, 1, 2, 3, and 4. The 4th-order polynomial has highest maximum likelihood, but it’s obviously overfitting. Thing is, the 4th-order model only has high likelihood for some very specific coefficients—most possible coefficients for a 4th-order polynomial have low likelihood. So, when we integrate over all possible coefficient values, the 4th-order polynomial doesn’t look so good.
That’s an intuitive ad-hoc explanation, but more generally, we know that overfit must stem from somehow deviating from the exact rules of probability. Why? Because the rules of probability tell us exactly what we’re able to figure out, given our information. If a probabilistic calculation is exact, and includes all of our information, then we should not be able to predict any way in which it’s wrong—unless we use some additional information. (Note that this is a Bayesian principle; not everyone agrees with it, but oddly enough no counterexamples have stood the test of time.) In the case of overfit, we know that certain conclusions are usually wrong, without using any additional information—so it must have come from some approximation.
Learn More?
If you want to learn more, check out this book (or the free pdf hosted by the author’s former university), especially chapters 1, 2, and 20.
First things first, keep in mind the thing we’re actually interested in: P[model | data]. The Bayes factor summarizes the contribution of the data to that number, in a convenient and reusable way, but the Bayes factor itself is not what we really want to know.
In the example, one model is “unbiased coin”, the other is “some unknown bias”. The prior over the bias is our distribution on θ before seeing the data. Thus the name “prior”. It’s P[θ|model], not P[θ|model,data]. Before seeing the data, we have no reason at all to think that 4⁄5 is special. Thus, a uniform prior. This all comes directly from applying the rules of probability: P[data|model2]=∫θP[data|θ]p[θ|model2]dθ
If you stick strictly to the rules of probability itself, you will never find any reason to maximize anything. There is no built-in maximization anywhere in probability. There are lots of integrals (or sums, in the discrete case), and we usually approximate those integrals by maximizing things—but those are approximations, always. In particular, if you try to compute P[model | data], you will not find any maximization of anything—unless you introduce it to approximate an integral.
In fact, let’s go ahead and write out the whole example in one place, just this once:
P[model1|data]P[model2|data]=P[data|model1]P[data|model2]P[model1]P[model2]
P[data|model1]P[data|model2]=P[data|θ=12]∫θP[data|θ]p[θ|model2]dθP[model1]P[model2]=(204)(12)16(12)4∫θ(204)(θ)16(1−θ)4dθ=0.00460.048
… and there we have it. The relative probability of each model, given the data, calculated from first principles, is .0046.048P[model1]P[model2]. No approximation (except maybe in the choice of the models themselves), no maximization.
Continuum of Models
We could imagine, rather than two models, a whole spectrum of models—one for each θ value. In that case, it’s true that θ = 4⁄5 would have a Bayes factor of 47. However, θ = 4⁄5 would also have a prior probability of zero: there are infinitely many possibly θ values, and we don’t have any reason to favor 4⁄5 a priori. To get nonzero P[model|data], we’d have to consider a small window of width dθ around 4⁄5, and then P[modelθ±dθ/2|data]≈P[data|θ=4/5]p[θ=4/5]dθ. Again, this all just comes directly from the rules of probability—and you’ll notice that the prior p[θ]dθ has re-entered the picture, exactly the same as before, except now we’re thinking of it as a prior density on the models parameterized by θ.
What’s really interesting about this model is how we handle the unbiased coin hypothesis. θ = 4⁄5 exactly has prior 0, because there’s a whole continuum of θ-values and no reason to favor one of them a priori. But there is reason to favor θ=1/2 a priori: that’s the unbiased coin hypothesis. If we think there’s an 80% chance that the coin is perfectly unbiased, then θ=1/2 needs to have prior probability 0.8. That implies that our prior involves a delta function 0.8δ(θ−1/2), plus whatever prior distribution we have on all the other θ-values. (In practice, we probably think even “unbiased” coins have some tiny bias, so we’d have some very sharp peak in the prior around 1⁄2 rather than a true delta function.)
Why Not Maximize Likelihood?
The practical reason for integrating over a prior distribution, rather than just maximizing likelihood, is overfit. If we stick strictly to the rules of probability, without any approximation, then we will not ever overfit. Overfit comes from approximations, in particular maximum likelihood.
This example would take a lot longer to write up, but… consider the canonical overfit problem: you have a dozen x-y points, and try fitting polynomials of order 0, 1, 2, 3, and 4. The 4th-order polynomial has highest maximum likelihood, but it’s obviously overfitting. Thing is, the 4th-order model only has high likelihood for some very specific coefficients—most possible coefficients for a 4th-order polynomial have low likelihood. So, when we integrate over all possible coefficient values, the 4th-order polynomial doesn’t look so good.
That’s an intuitive ad-hoc explanation, but more generally, we know that overfit must stem from somehow deviating from the exact rules of probability. Why? Because the rules of probability tell us exactly what we’re able to figure out, given our information. If a probabilistic calculation is exact, and includes all of our information, then we should not be able to predict any way in which it’s wrong—unless we use some additional information. (Note that this is a Bayesian principle; not everyone agrees with it, but oddly enough no counterexamples have stood the test of time.) In the case of overfit, we know that certain conclusions are usually wrong, without using any additional information—so it must have come from some approximation.
Learn More?
If you want to learn more, check out this book (or the free pdf hosted by the author’s former university), especially chapters 1, 2, and 20.