there is a single unique unambiguously-correct answer to “how should we penalize for model complexity?”: calculate the probability of each model, given the data
Wouldn’t it be more accurate to say that the penalty for model complexity resides in the prior, not the likelihood?
The second model has one free parameter (the bias) which we can use to fit the data, but it’s more complex and prone to over-fitting. When we integrate over that free parameter, it will fit the data poorly over most of the parameter space—thus the “penalty” associated with free parameters in general.
But your choice of prior was arbitrary. You chose to privilege the unbiased coin hypothesis by assigning fully half of your prior probability to the case where the coin is fair, a case your other model assigns 0 probability to!
So your real answer to penalize model complexity is: assign a lower prior to complex models. (Actually in this case they are kinda equally complex but whatever.) I find this answer a bit unsatisfying, because in some cases my prior belief is that a phenomenon is going to be quite complex. Yet overfitting is still possible in those cases.
At least the way I think about it, the main role of Bayesian model testing is to compare gears-level models. A prior belief like “this phenomenon is going to be quite complex” doesn’t have any gears in it, so it doesn’t really make sense to think about in this context at all. I could sort-of replace “it’s complex” with a “totally ignorant” uniform-prior model (the trivial case of a gears-level model with no gears), but I’m not sure that captures quite the same thing.
Anyway, I recommend reading the second post on Wolf’s Dice. That should give a better intuition for why we’re privileging the unbiased coin hypothesis here. The prior is not arbitrary—I chose it because I actually do believe that most coins are (approximately) unbiased. The prior is where the (hypothesized) gears are: in this case, the hypothesis that most coins are approximately unbiased is a gear.
Wouldn’t it be more accurate to say that the penalty for model complexity resides in the prior, not the likelihood?
But your choice of prior was arbitrary. You chose to privilege the unbiased coin hypothesis by assigning fully half of your prior probability to the case where the coin is fair, a case your other model assigns 0 probability to!
So your real answer to penalize model complexity is: assign a lower prior to complex models. (Actually in this case they are kinda equally complex but whatever.) I find this answer a bit unsatisfying, because in some cases my prior belief is that a phenomenon is going to be quite complex. Yet overfitting is still possible in those cases.
At least the way I think about it, the main role of Bayesian model testing is to compare gears-level models. A prior belief like “this phenomenon is going to be quite complex” doesn’t have any gears in it, so it doesn’t really make sense to think about in this context at all. I could sort-of replace “it’s complex” with a “totally ignorant” uniform-prior model (the trivial case of a gears-level model with no gears), but I’m not sure that captures quite the same thing.
Anyway, I recommend reading the second post on Wolf’s Dice. That should give a better intuition for why we’re privileging the unbiased coin hypothesis here. The prior is not arbitrary—I chose it because I actually do believe that most coins are (approximately) unbiased. The prior is where the (hypothesized) gears are: in this case, the hypothesis that most coins are approximately unbiased is a gear.