The previous post outlined Laplace approximation, one of the most common tools used to approximate hairy probability integrals. In this post, we’ll use Laplace approximation to derive the Bayesian Information Criterion (BIC), a popular complexity penalty method for comparing models with more free parameters to models with fewer free parameters.
The BIC is pretty simple:
Start with the maximum log likelihood lnP[data|θmax]
Subtract k2lnN to penalize for complexity, where N is the number of data points and k is the number of free parameters (i.e. dimension of θ).
Thus: lnP[data|θmax]−k2lnN. Using this magic number, we compare any two models we like.
Let’s derive that.
BIC Derivation
As usual, we’ll start from P[data|model]. (Caution: don’t forget that what we really care about is P[model|data]; we can jump to P[data|model] only as long as our priors are close enough to be swamped by the evidence.) This time, we’ll assume that we have N independent data points xi, all with the same unobserved parameters—e.g. N die rolls with the same unobserved biases. In that case, we have
Now for the main trick: how does each term scale as the number of data points N increases?
The max log likelihood ∑iP[xi|θmax] is a sum over data points, so it should scale roughly proportionally to N.
The prior density and the k2ln(2π) are constant with respect to N.
H is another sum over data points, so it should also scale roughly proportionally to N.
Let’s go ahead and write H as N∗(1NH), to pull out the N-dependence. Then, if we can remember how determinants scale:
lndet(N∗(1NH))=lndet(1NH)+k∗lnN
so we can re-write our Laplace approximation as lnP[data|model]≈∑ilnP[xi|θmax]+p[θmax]+k2ln(2π)−12lndet(1NH)−k2lnN=lnP[data|θmax]−k2lnN+O(1)
where O(1) contains all the terms which are roughly constant with respect to N. The first two terms are the BIC.
In other words, the BIC is just the Laplace approximation, but ignoring all the terms which don’t scale up as the number of data points increases.
When Does BIC Work?
What conditions need to hold for BIC to work? Let’s go back through the derivation and list out the key assumptions behind our approximations.
First, in order to jump from P[model|data] to P[data|model], our models should have roughly similar prior probabilities P[model] - i.e. within a few orders of magnitude.
Second, in order for any point approximation to work, the posterior parameter distribution needs to be pointy and unimodal—most of the posterior probability mass must be near θmax. In other words, we need enough data to get a precise estimate of the unobserved parameters.
Third, we must have N large enough that k2lnN (the smallest term we’re keeping) is much larger than the O(1) terms we’re throwing away.
That last condition is the big one. BIC is a large-N approximation, so N needs to be large for it to work. How large? That depends how big lndet(1NH) is—N needs to be exponentially larger than that. We’ll see an example in the next post.
Next post will talk more about relative advantages of BIC, Laplace, and exact calculation for comparing models. We’ll see a concrete example of when BIC works/fails.
From Laplace to BIC
The previous post outlined Laplace approximation, one of the most common tools used to approximate hairy probability integrals. In this post, we’ll use Laplace approximation to derive the Bayesian Information Criterion (BIC), a popular complexity penalty method for comparing models with more free parameters to models with fewer free parameters.
The BIC is pretty simple:
Start with the maximum log likelihood lnP[data|θmax]
Subtract k2lnN to penalize for complexity, where N is the number of data points and k is the number of free parameters (i.e. dimension of θ).
Thus: lnP[data|θmax]−k2lnN. Using this magic number, we compare any two models we like.
Let’s derive that.
BIC Derivation
As usual, we’ll start from P[data|model]. (Caution: don’t forget that what we really care about is P[model|data]; we can jump to P[data|model] only as long as our priors are close enough to be swamped by the evidence.) This time, we’ll assume that we have N independent data points xi, all with the same unobserved parameters—e.g. N die rolls with the same unobserved biases. In that case, we have
P[data|model]=∫θP[data|θ]dP[θ]=∫θ∏Ni=1P[xi|θ]p[θ]dθ
Next, apply Laplace approximation and take the log.
lnP[data|model]≈∑ilnP[xi|θmax]+lnp[θmax]+k2ln(2π)−12lndet(H)
where the Hessian matrix H is given by
H=d2dθ2lnP[data|θ]|θmax=∑id2dθ2lnP[xi|θ]|θmax
Now for the main trick: how does each term scale as the number of data points N increases?
The max log likelihood ∑iP[xi|θmax] is a sum over data points, so it should scale roughly proportionally to N.
The prior density and the k2ln(2π) are constant with respect to N.
H is another sum over data points, so it should also scale roughly proportionally to N.
Let’s go ahead and write H as N∗(1NH), to pull out the N-dependence. Then, if we can remember how determinants scale:
lndet(N∗(1NH))=lndet(1NH)+k∗lnN
so we can re-write our Laplace approximation as lnP[data|model]≈∑ilnP[xi|θmax]+p[θmax]+k2ln(2π)−12lndet(1NH)−k2lnN=lnP[data|θmax]−k2lnN+O(1)
where O(1) contains all the terms which are roughly constant with respect to N. The first two terms are the BIC.
In other words, the BIC is just the Laplace approximation, but ignoring all the terms which don’t scale up as the number of data points increases.
When Does BIC Work?
What conditions need to hold for BIC to work? Let’s go back through the derivation and list out the key assumptions behind our approximations.
First, in order to jump from P[model|data] to P[data|model], our models should have roughly similar prior probabilities P[model] - i.e. within a few orders of magnitude.
Second, in order for any point approximation to work, the posterior parameter distribution needs to be pointy and unimodal—most of the posterior probability mass must be near θmax. In other words, we need enough data to get a precise estimate of the unobserved parameters.
Third, we must have N large enough that k2lnN (the smallest term we’re keeping) is much larger than the O(1) terms we’re throwing away.
That last condition is the big one. BIC is a large-N approximation, so N needs to be large for it to work. How large? That depends how big lndet(1NH) is—N needs to be exponentially larger than that. We’ll see an example in the next post.
Next post will talk more about relative advantages of BIC, Laplace, and exact calculation for comparing models. We’ll see a concrete example of when BIC works/fails.