Thanks for this. I’m trying to get an intuition on how this works.
My mental picture is to imagine the likelihood function with respect to theta of the more complex model. The simpler model is the equivalent of a square function with height of its likelihood and width 1.
The relative areas under the graphs reflect the likelihood of the models. So if picturing the relative maximum likelihoods and how sharp the peak is on the more complex model gives an impression of the Bayes factor.
Does that work? Or is there a better mental model?
It’s kind of tricky to picture it in terms of areas, because the two models usually don’t live in the same space—one is an integral over more dimensions than the other. You can sometimes shoehorn it into the same space, e.g. by imagining that the likelihood for one model is constant with respect to some of the θ’s—it sounds like that what’s you’re trying to do here. That’s not wrong, and you can think of it that way if you want, but personally I find it a bit awkward.
Here’s a similar (but hopefully more natural) intuition:
Pick a model
Generate a θ value at random from that model’s prior
Compute the likelihood of the data given that θ value
Take the average likelihood from this process, and you have P[data|model]
Compare this to the traditional maximum-likelihood approach: max likelihood says “what’s the highest likelihood which I could possibly assign to this data?”, whereas the Bayes factor approach says “what’s the average likelihood I’d assign to this data, given my prior knowledge?”. It’s analogous to the distinction between a maximax decision rule (“take the decision with the best available best-case outcome”) and a Bayesian decision rule (“take the decision with the best available average outcome”).
That said, the above intuition is still somewhat ad-hoc. Personally, I intuit Bayes factors with the same intuitions I use for any other probability calculations. It’s just a natural application of the usual probability ideas to the question of “what’s the right model, given my background knowledge and all this data?”. In particular, the intuitions involved are:
Gears-level understanding, usually manifesting as causal models. In this case, the causal model is θ → data, with the individual data points independent conditional on θ. I then reflexively think “if I knew the values of all the nodes in this causal model, then I could easily compute the probability of the whole thing”—i.e. I can easily compute P[data,θ|model].
Summing over unknown values: I need something like P[data|model] (in which I don’t know theta), and it would be easy to compute if I knew theta. So, I break it into a sum over theta: P[data|model]=∑θP[data,θ|model] . Intuitively, I’m considering all the possible ways the world could be, while still consistent with my model.
… and of course these are all standard probability-related reflexes with multiple possible intuitions underneath them.
Thanks for this. I’m trying to get an intuition on how this works.
My mental picture is to imagine the likelihood function with respect to theta of the more complex model. The simpler model is the equivalent of a square function with height of its likelihood and width 1.
The relative areas under the graphs reflect the likelihood of the models. So if picturing the relative maximum likelihoods and how sharp the peak is on the more complex model gives an impression of the Bayes factor.
Does that work? Or is there a better mental model?
It’s kind of tricky to picture it in terms of areas, because the two models usually don’t live in the same space—one is an integral over more dimensions than the other. You can sometimes shoehorn it into the same space, e.g. by imagining that the likelihood for one model is constant with respect to some of the θ’s—it sounds like that what’s you’re trying to do here. That’s not wrong, and you can think of it that way if you want, but personally I find it a bit awkward.
Here’s a similar (but hopefully more natural) intuition:
Pick a model
Generate a θ value at random from that model’s prior
Compute the likelihood of the data given that θ value
Take the average likelihood from this process, and you have P[data|model]
Compare this to the traditional maximum-likelihood approach: max likelihood says “what’s the highest likelihood which I could possibly assign to this data?”, whereas the Bayes factor approach says “what’s the average likelihood I’d assign to this data, given my prior knowledge?”. It’s analogous to the distinction between a maximax decision rule (“take the decision with the best available best-case outcome”) and a Bayesian decision rule (“take the decision with the best available average outcome”).
That said, the above intuition is still somewhat ad-hoc. Personally, I intuit Bayes factors with the same intuitions I use for any other probability calculations. It’s just a natural application of the usual probability ideas to the question of “what’s the right model, given my background knowledge and all this data?”. In particular, the intuitions involved are:
Bayes’ Rule
Gears-level understanding, usually manifesting as causal models. In this case, the causal model is θ → data, with the individual data points independent conditional on θ. I then reflexively think “if I knew the values of all the nodes in this causal model, then I could easily compute the probability of the whole thing”—i.e. I can easily compute P[data,θ|model].
Summing over unknown values: I need something like P[data|model] (in which I don’t know theta), and it would be easy to compute if I knew theta. So, I break it into a sum over theta: P[data|model]=∑θP[data,θ|model] . Intuitively, I’m considering all the possible ways the world could be, while still consistent with my model.
… and of course these are all standard probability-related reflexes with multiple possible intuitions underneath them.