At least within Bayesian probability, there is a single unique unambiguously-correct answer to “how should we penalize for complexity?”: the Bayes factor. The Bayes factor is Hard to compute in general, which is why there’s a whole slew of of other numbers which approximate it in various ways.
Here’s how Bayes factors work. Want to know whether model 1 or model 2 is more consistent with the data? Then compute P[model1|data] and P[model2|data]. Using Bayes’ rule:
P[modeli|data]=1ZP[data|modeli]P[modeli]
where Z is the normalizer. If we’re just comparing two models, then we can get rid of that annoying Z by computing odds for the two models:
In English: posterior relative odds of the two models is equal to prior odds times the ratio of likelihoods. That likelihood ratio P[data|model1]P[data|model2] is the Bayes’ factor: it directly describes the update in the relative odds of the two models, due to the data.
Example
20 coin flips yield 16 heads and 4 tails. Is the coin biased?
Here we have two models:
Model 1: coin unbiased
Model 2: coin has some unknown probability θ of coming up heads (we’ll use a uniform prior on θ for simplicity; this is not a good idea in general)
The second model has one free parameter (the bias) which we can use to fit the data, but it’s more complex and prone to over-fitting. When we integrate over that free parameter, it will fit the data poorly over most of the parameter space—thus the “penalty” associated with free parameters in general.
So the Bayes factor is (.048)/(.0046) ~ 10, in favor of a biased coin. In practice, I’d say unbiased coins are at least 10x more likely than biased coins in day-to-day life a priori, so we might still think the coin is unbiased. But if we were genuinely unsure to start with, then this would be pretty decent evidence in favor.
Approximation
Why is this hard to compute in general? Well, if the model has an m-dimensional vector of free parameters θ, then P[data|model]=∫θP[data|model,θ]dP[θ|model], which is an integral in m dimensions. That’s not just NP-complete, it’s #P-complete.
There’s various ways to approximate that integral. The two most common are:
MCMC (see e.g. here for application to Bayes factors): main limitation is that MCMC gets finicky in high dimensions.
Laplace approximation around the maximum-likelihood point. Aside from the usual limitations of using a single maximum-likelihood point, main limitation is that we need the determinant of the Hessian, which takes somewhere between O(k) and O(k3) to compute for most k-free-parameter models (depending on how clever we are with the linear algebra).
BIC
The BIC can be derived directly from the Laplace approximation. Laplace says:
where θ∗ is the maximum likelihood point and k is the dimension of θ.
Notice that ln(P[data|θ∗]p[θ∗]) is the usual “score” of a model; the rest is what we can think of as a complexity penalty, and that’s what the BIC will approximate. The main assumption is that ∂2lnP[data|θ]∂θ2 scales with the number of data points N; this is trivial if the data points are iiid (given θ), since in that case lnP[data|θ] is a sum over data points.
Now for the trick: that’s the only part of the Laplace approximation’s “complexity penalty” which scales with the number of data points N. It scales in proportion to N−k2, since each of k dimensions of the determinant contributes a factor N. Take the log of the whole thing, and all the rest will be constants which are dominated by the k2lnN term as N goes to infinity. And that’s the BIC: log of the MAP probability, minus k2lnN.
Finally, the important part: the main benefit of knowing all this is being able to predict when the BIC will and will not work well. If it’s an approximation of Laplace, which is itself an approximation of the Bayes factor, then BIC will work well exactly when those approximations are accurate. In particular, BIC fails when the terms we ignored (i.e. ln((2π)k2det(−1N∂2lnP[data|θ]∂θ2)−12)) are as large as or even larger than ln(P[data|θ∗]p[θ∗])−k2lnN, i.e. the BIC itself.
Very helpful comment. But why the uniform prior, and not the bias that has the highest likelihood given the data? theta = 4⁄5 would give you a Bayes factor of 47, right?
First things first, keep in mind the thing we’re actually interested in: P[model | data]. The Bayes factor summarizes the contribution of the data to that number, in a convenient and reusable way, but the Bayes factor itself is not what we really want to know.
In the example, one model is “unbiased coin”, the other is “some unknown bias”. The prior over the bias is our distribution on θ before seeing the data. Thus the name “prior”. It’s P[θ|model], not P[θ|model,data]. Before seeing the data, we have no reason at all to think that 4⁄5 is special. Thus, a uniform prior. This all comes directly from applying the rules of probability: P[data|model2]=∫θP[data|θ]p[θ|model2]dθ
If you stick strictly to the rules of probability itself, you will never find any reason to maximize anything. There is no built-in maximization anywhere in probability. There are lots of integrals (or sums, in the discrete case), and we usually approximate those integrals by maximizing things—but those are approximations, always. In particular, if you try to compute P[model | data], you will not find any maximization of anything—unless you introduce it to approximate an integral.
In fact, let’s go ahead and write out the whole example in one place, just this once:
… and there we have it. The relative probability of each model, given the data, calculated from first principles, is .0046.048P[model1]P[model2]. No approximation (except maybe in the choice of the models themselves), no maximization.
Continuum of Models
We could imagine, rather than two models, a whole spectrum of models—one for each θ value. In that case, it’s true that θ = 4⁄5 would have a Bayes factor of 47. However, θ = 4⁄5 would also have a prior probability of zero: there are infinitely many possibly θ values, and we don’t have any reason to favor 4⁄5 a priori. To get nonzero P[model|data], we’d have to consider a small window of width dθ around 4⁄5, and then P[modelθ±dθ/2|data]≈P[data|θ=4/5]p[θ=4/5]dθ. Again, this all just comes directly from the rules of probability—and you’ll notice that the prior p[θ]dθ has re-entered the picture, exactly the same as before, except now we’re thinking of it as a prior density on the models parameterized by θ.
What’s really interesting about this model is how we handle the unbiased coin hypothesis. θ = 4⁄5 exactly has prior 0, because there’s a whole continuum of θ-values and no reason to favor one of them a priori. But there is reason to favor θ=1/2 a priori: that’s the unbiased coin hypothesis. If we think there’s an 80% chance that the coin is perfectly unbiased, then θ=1/2 needs to have prior probability 0.8. That implies that our prior involves a delta function 0.8δ(θ−1/2), plus whatever prior distribution we have on all the other θ-values. (In practice, we probably think even “unbiased” coins have some tiny bias, so we’d have some very sharp peak in the prior around 1⁄2 rather than a true delta function.)
Why Not Maximize Likelihood?
The practical reason for integrating over a prior distribution, rather than just maximizing likelihood, is overfit. If we stick strictly to the rules of probability, without any approximation, then we will not ever overfit. Overfit comes from approximations, in particular maximum likelihood.
This example would take a lot longer to write up, but… consider the canonical overfit problem: you have a dozen x-y points, and try fitting polynomials of order 0, 1, 2, 3, and 4. The 4th-order polynomial has highest maximum likelihood, but it’s obviously overfitting. Thing is, the 4th-order model only has high likelihood for some very specific coefficients—most possible coefficients for a 4th-order polynomial have low likelihood. So, when we integrate over all possible coefficient values, the 4th-order polynomial doesn’t look so good.
That’s an intuitive ad-hoc explanation, but more generally, we know that overfit must stem from somehow deviating from the exact rules of probability. Why? Because the rules of probability tell us exactly what we’re able to figure out, given our information. If a probabilistic calculation is exact, and includes all of our information, then we should not be able to predict any way in which it’s wrong—unless we use some additional information. (Note that this is a Bayesian principle; not everyone agrees with it, but oddly enough no counterexamples have stood the test of time.) In the case of overfit, we know that certain conclusions are usually wrong, without using any additional information—so it must have come from some approximation.
Learn More?
If you want to learn more, check out this book (or the free pdf hosted by the author’s former university), especially chapters 1, 2, and 20.
Thanks for this. I’m trying to get an intuition on how this works.
My mental picture is to imagine the likelihood function with respect to theta of the more complex model. The simpler model is the equivalent of a square function with height of its likelihood and width 1.
The relative areas under the graphs reflect the likelihood of the models. So if picturing the relative maximum likelihoods and how sharp the peak is on the more complex model gives an impression of the Bayes factor.
Does that work? Or is there a better mental model?
It’s kind of tricky to picture it in terms of areas, because the two models usually don’t live in the same space—one is an integral over more dimensions than the other. You can sometimes shoehorn it into the same space, e.g. by imagining that the likelihood for one model is constant with respect to some of the θ’s—it sounds like that what’s you’re trying to do here. That’s not wrong, and you can think of it that way if you want, but personally I find it a bit awkward.
Here’s a similar (but hopefully more natural) intuition:
Pick a model
Generate a θ value at random from that model’s prior
Compute the likelihood of the data given that θ value
Take the average likelihood from this process, and you have P[data|model]
Compare this to the traditional maximum-likelihood approach: max likelihood says “what’s the highest likelihood which I could possibly assign to this data?”, whereas the Bayes factor approach says “what’s the average likelihood I’d assign to this data, given my prior knowledge?”. It’s analogous to the distinction between a maximax decision rule (“take the decision with the best available best-case outcome”) and a Bayesian decision rule (“take the decision with the best available average outcome”).
That said, the above intuition is still somewhat ad-hoc. Personally, I intuit Bayes factors with the same intuitions I use for any other probability calculations. It’s just a natural application of the usual probability ideas to the question of “what’s the right model, given my background knowledge and all this data?”. In particular, the intuitions involved are:
Gears-level understanding, usually manifesting as causal models. In this case, the causal model is θ → data, with the individual data points independent conditional on θ. I then reflexively think “if I knew the values of all the nodes in this causal model, then I could easily compute the probability of the whole thing”—i.e. I can easily compute P[data,θ|model].
Summing over unknown values: I need something like P[data|model] (in which I don’t know theta), and it would be easy to compute if I knew theta. So, I break it into a sum over theta: P[data|model]=∑θP[data,θ|model] . Intuitively, I’m considering all the possible ways the world could be, while still consistent with my model.
… and of course these are all standard probability-related reflexes with multiple possible intuitions underneath them.
I keep meaning to write a post on this...
At least within Bayesian probability, there is a single unique unambiguously-correct answer to “how should we penalize for complexity?”: the Bayes factor. The Bayes factor is Hard to compute in general, which is why there’s a whole slew of of other numbers which approximate it in various ways.
Here’s how Bayes factors work. Want to know whether model 1 or model 2 is more consistent with the data? Then compute P[model1|data] and P[model2|data]. Using Bayes’ rule:
P[modeli|data]=1ZP[data|modeli]P[modeli]
where Z is the normalizer. If we’re just comparing two models, then we can get rid of that annoying Z by computing odds for the two models:
P[model1|data]P[model2|data]=P[data|model1]P[data|model2]P[model1]P[model2]
In English: posterior relative odds of the two models is equal to prior odds times the ratio of likelihoods. That likelihood ratio P[data|model1]P[data|model2] is the Bayes’ factor: it directly describes the update in the relative odds of the two models, due to the data.
Example
20 coin flips yield 16 heads and 4 tails. Is the coin biased?
Here we have two models:
Model 1: coin unbiased
Model 2: coin has some unknown probability θ of coming up heads (we’ll use a uniform prior on θ for simplicity; this is not a good idea in general)
The second model has one free parameter (the bias) which we can use to fit the data, but it’s more complex and prone to over-fitting. When we integrate over that free parameter, it will fit the data poorly over most of the parameter space—thus the “penalty” associated with free parameters in general.
In this example, the integral is exactly tractable (it’s a dirichlet-multinomial model), and we get:
So the Bayes factor is (.048)/(.0046) ~ 10, in favor of a biased coin. In practice, I’d say unbiased coins are at least 10x more likely than biased coins in day-to-day life a priori, so we might still think the coin is unbiased. But if we were genuinely unsure to start with, then this would be pretty decent evidence in favor.
Approximation
Why is this hard to compute in general? Well, if the model has an m-dimensional vector of free parameters θ, then P[data|model]=∫θP[data|model,θ]dP[θ|model], which is an integral in m dimensions. That’s not just NP-complete, it’s #P-complete.
There’s various ways to approximate that integral. The two most common are:
MCMC (see e.g. here for application to Bayes factors): main limitation is that MCMC gets finicky in high dimensions.
Laplace approximation around the maximum-likelihood point. Aside from the usual limitations of using a single maximum-likelihood point, main limitation is that we need the determinant of the Hessian, which takes somewhere between O(k) and O(k3) to compute for most k-free-parameter models (depending on how clever we are with the linear algebra).
BIC
The BIC can be derived directly from the Laplace approximation. Laplace says:
∫θP[data|θ]p[θ]dθ≈P[data|θ∗]p[θ∗](2π)k2det(−∂2lnP[data|θ]∂θ2)−12
where θ∗ is the maximum likelihood point and k is the dimension of θ.
Notice that ln(P[data|θ∗]p[θ∗]) is the usual “score” of a model; the rest is what we can think of as a complexity penalty, and that’s what the BIC will approximate. The main assumption is that ∂2lnP[data|θ]∂θ2 scales with the number of data points N; this is trivial if the data points are iiid (given θ), since in that case lnP[data|θ] is a sum over data points.
Now for the trick: that’s the only part of the Laplace approximation’s “complexity penalty” which scales with the number of data points N. It scales in proportion to N−k2, since each of k dimensions of the determinant contributes a factor N. Take the log of the whole thing, and all the rest will be constants which are dominated by the k2lnN term as N goes to infinity. And that’s the BIC: log of the MAP probability, minus k2lnN.
Finally, the important part: the main benefit of knowing all this is being able to predict when the BIC will and will not work well. If it’s an approximation of Laplace, which is itself an approximation of the Bayes factor, then BIC will work well exactly when those approximations are accurate. In particular, BIC fails when the terms we ignored (i.e. ln((2π)k2det(−1N∂2lnP[data|θ]∂θ2)−12)) are as large as or even larger than ln(P[data|θ∗]p[θ∗])−k2lnN, i.e. the BIC itself.
Very helpful comment. But why the uniform prior, and not the bias that has the highest likelihood given the data? theta = 4⁄5 would give you a Bayes factor of 47, right?
First things first, keep in mind the thing we’re actually interested in: P[model | data]. The Bayes factor summarizes the contribution of the data to that number, in a convenient and reusable way, but the Bayes factor itself is not what we really want to know.
In the example, one model is “unbiased coin”, the other is “some unknown bias”. The prior over the bias is our distribution on θ before seeing the data. Thus the name “prior”. It’s P[θ|model], not P[θ|model,data]. Before seeing the data, we have no reason at all to think that 4⁄5 is special. Thus, a uniform prior. This all comes directly from applying the rules of probability: P[data|model2]=∫θP[data|θ]p[θ|model2]dθ
If you stick strictly to the rules of probability itself, you will never find any reason to maximize anything. There is no built-in maximization anywhere in probability. There are lots of integrals (or sums, in the discrete case), and we usually approximate those integrals by maximizing things—but those are approximations, always. In particular, if you try to compute P[model | data], you will not find any maximization of anything—unless you introduce it to approximate an integral.
In fact, let’s go ahead and write out the whole example in one place, just this once:
P[model1|data]P[model2|data]=P[data|model1]P[data|model2]P[model1]P[model2]
P[data|model1]P[data|model2]=P[data|θ=12]∫θP[data|θ]p[θ|model2]dθP[model1]P[model2]=(204)(12)16(12)4∫θ(204)(θ)16(1−θ)4dθ=0.00460.048
… and there we have it. The relative probability of each model, given the data, calculated from first principles, is .0046.048P[model1]P[model2]. No approximation (except maybe in the choice of the models themselves), no maximization.
Continuum of Models
We could imagine, rather than two models, a whole spectrum of models—one for each θ value. In that case, it’s true that θ = 4⁄5 would have a Bayes factor of 47. However, θ = 4⁄5 would also have a prior probability of zero: there are infinitely many possibly θ values, and we don’t have any reason to favor 4⁄5 a priori. To get nonzero P[model|data], we’d have to consider a small window of width dθ around 4⁄5, and then P[modelθ±dθ/2|data]≈P[data|θ=4/5]p[θ=4/5]dθ. Again, this all just comes directly from the rules of probability—and you’ll notice that the prior p[θ]dθ has re-entered the picture, exactly the same as before, except now we’re thinking of it as a prior density on the models parameterized by θ.
What’s really interesting about this model is how we handle the unbiased coin hypothesis. θ = 4⁄5 exactly has prior 0, because there’s a whole continuum of θ-values and no reason to favor one of them a priori. But there is reason to favor θ=1/2 a priori: that’s the unbiased coin hypothesis. If we think there’s an 80% chance that the coin is perfectly unbiased, then θ=1/2 needs to have prior probability 0.8. That implies that our prior involves a delta function 0.8δ(θ−1/2), plus whatever prior distribution we have on all the other θ-values. (In practice, we probably think even “unbiased” coins have some tiny bias, so we’d have some very sharp peak in the prior around 1⁄2 rather than a true delta function.)
Why Not Maximize Likelihood?
The practical reason for integrating over a prior distribution, rather than just maximizing likelihood, is overfit. If we stick strictly to the rules of probability, without any approximation, then we will not ever overfit. Overfit comes from approximations, in particular maximum likelihood.
This example would take a lot longer to write up, but… consider the canonical overfit problem: you have a dozen x-y points, and try fitting polynomials of order 0, 1, 2, 3, and 4. The 4th-order polynomial has highest maximum likelihood, but it’s obviously overfitting. Thing is, the 4th-order model only has high likelihood for some very specific coefficients—most possible coefficients for a 4th-order polynomial have low likelihood. So, when we integrate over all possible coefficient values, the 4th-order polynomial doesn’t look so good.
That’s an intuitive ad-hoc explanation, but more generally, we know that overfit must stem from somehow deviating from the exact rules of probability. Why? Because the rules of probability tell us exactly what we’re able to figure out, given our information. If a probabilistic calculation is exact, and includes all of our information, then we should not be able to predict any way in which it’s wrong—unless we use some additional information. (Note that this is a Bayesian principle; not everyone agrees with it, but oddly enough no counterexamples have stood the test of time.) In the case of overfit, we know that certain conclusions are usually wrong, without using any additional information—so it must have come from some approximation.
Learn More?
If you want to learn more, check out this book (or the free pdf hosted by the author’s former university), especially chapters 1, 2, and 20.
Thanks for this. I’m trying to get an intuition on how this works.
My mental picture is to imagine the likelihood function with respect to theta of the more complex model. The simpler model is the equivalent of a square function with height of its likelihood and width 1.
The relative areas under the graphs reflect the likelihood of the models. So if picturing the relative maximum likelihoods and how sharp the peak is on the more complex model gives an impression of the Bayes factor.
Does that work? Or is there a better mental model?
It’s kind of tricky to picture it in terms of areas, because the two models usually don’t live in the same space—one is an integral over more dimensions than the other. You can sometimes shoehorn it into the same space, e.g. by imagining that the likelihood for one model is constant with respect to some of the θ’s—it sounds like that what’s you’re trying to do here. That’s not wrong, and you can think of it that way if you want, but personally I find it a bit awkward.
Here’s a similar (but hopefully more natural) intuition:
Pick a model
Generate a θ value at random from that model’s prior
Compute the likelihood of the data given that θ value
Take the average likelihood from this process, and you have P[data|model]
Compare this to the traditional maximum-likelihood approach: max likelihood says “what’s the highest likelihood which I could possibly assign to this data?”, whereas the Bayes factor approach says “what’s the average likelihood I’d assign to this data, given my prior knowledge?”. It’s analogous to the distinction between a maximax decision rule (“take the decision with the best available best-case outcome”) and a Bayesian decision rule (“take the decision with the best available average outcome”).
That said, the above intuition is still somewhat ad-hoc. Personally, I intuit Bayes factors with the same intuitions I use for any other probability calculations. It’s just a natural application of the usual probability ideas to the question of “what’s the right model, given my background knowledge and all this data?”. In particular, the intuitions involved are:
Bayes’ Rule
Gears-level understanding, usually manifesting as causal models. In this case, the causal model is θ → data, with the individual data points independent conditional on θ. I then reflexively think “if I knew the values of all the nodes in this causal model, then I could easily compute the probability of the whole thing”—i.e. I can easily compute P[data,θ|model].
Summing over unknown values: I need something like P[data|model] (in which I don’t know theta), and it would be easy to compute if I knew theta. So, I break it into a sum over theta: P[data|model]=∑θP[data,θ|model] . Intuitively, I’m considering all the possible ways the world could be, while still consistent with my model.
… and of course these are all standard probability-related reflexes with multiple possible intuitions underneath them.