I am a little confused by what x is on your statement, and by why you think we can’t compute the likelihood or posterior predictive.
In most real problems we can’t compute the posterior but we can draw from it and thus approximate it via MCMC
Sorry! Bad notation… What I meant was that we can’t compute the conditional posterior predictive density p(~y|~x,D) where D={(x1,y1),…,(xn,yn)}. We can compute p(~y|~x,D,M), where M is some model, approximately using MCMC by drawing samples from the parameter space of M, i.e. we can approximate the integral below using MCMC:
p(~y|~x,D,M)=∫θ∈Θp(~y|~x,M,θ)p(θ|M,D)dθ
where Θ is the parameter space of M. But the quantity that we are interested in is
p(~y|~x,D) not p(~y|~x,D,M) for a specific model i.e. we need to marginalise over the unknown model. How can we do this?
You are correct, we have to assume a model, just like we have to assume a prior. And strictly speaking the model is wrong and the prior is wrong :). But we can calculate how good the posterior predictive describe the data to get a feel for how bad our model is :)
Ignoring the practical problems of Bayesian model averaging, isn’t assuming that either M1, M2, or M3 is true better than assuming that some model M is true? So Bayesian model averaging is always better right (if it is practically possible)?
If there are 3 competing models then Ideally you can make a larger model where each submodel is realized by specific parameter combinations.
If a M2 is simply M1 with an extra parameter b2, then you should have a stronger prior b2 being zero in M2, if M3 is M1 with one parameter transformed, then you should have a parameter interpolating between this transformation so you can learn that between 40-90% interpolating describe the data better.
If it’s impossible to translate between models like this then you can do model averaging, but it’s a sign of you not understanding your data.
Yes, this is usually the right approach—use a single, more complex, model that has the various models you were considering as special cases. It’s likely that the best parameters of this extended model won’t actually turn out to be one of the special cases. (But note that this approach doesn’t necessarily eliminate the need for careful consideration of the prior, since unwise priors for a single complex model can also cause problems.)
However, there are some situations where discrete models make sense. For instance, you might be analysing old Roman coins, and be unsure whether they were all minted in one mint, or in two (or three, …) different mints. There aren’t really any intermediate possibilities between one mint or two. Or you might be studying inheritance of two genes, and be considering two models in which they are either on the same chromosome or on different chromosones.
If you’re thinking of a stick-breaking prior such as a Dirichlet process mixture model, they typically produce an infinite number of components (which would be mints, in this case), though of course only a finite number will be represented in your finite data set. But we know that the number of mints producing coins in the Roman Empire was finite. So that’s not a reasonable prior (though of course you might sometimes be able to get away with using it anyway).
I am a little confused by what x is on your statement, and by why you think we can’t compute the likelihood or posterior predictive. In most real problems we can’t compute the posterior but we can draw from it and thus approximate it via MCMC
Sorry! Bad notation… What I meant was that we can’t compute the conditional posterior predictive density p(~y|~x,D) where D={(x1,y1),…,(xn,yn)}. We can compute p(~y|~x,D,M), where M is some model, approximately using MCMC by drawing samples from the parameter space of M, i.e. we can approximate the integral below using MCMC:
p(~y|~x,D,M)=∫θ∈Θp(~y|~x,M,θ) p(θ|M,D) dθ
where Θ is the parameter space of M. But the quantity that we are interested in is
p(~y|~x,D) not p(~y|~x,D,M) for a specific model i.e. we need to marginalise over the unknown model. How can we do this?
You are correct, we have to assume a model, just like we have to assume a prior. And strictly speaking the model is wrong and the prior is wrong :). But we can calculate how good the posterior predictive describe the data to get a feel for how bad our model is :)
Ignoring the practical problems of Bayesian model averaging, isn’t assuming that either M1, M2, or M3 is true better than assuming that some model M is true? So Bayesian model averaging is always better right (if it is practically possible)?
If there are 3 competing models then Ideally you can make a larger model where each submodel is realized by specific parameter combinations.
If a M2 is simply M1 with an extra parameter b2, then you should have a stronger prior b2 being zero in M2, if M3 is M1 with one parameter transformed, then you should have a parameter interpolating between this transformation so you can learn that between 40-90% interpolating describe the data better.
If it’s impossible to translate between models like this then you can do model averaging, but it’s a sign of you not understanding your data.
Yes, this is usually the right approach—use a single, more complex, model that has the various models you were considering as special cases. It’s likely that the best parameters of this extended model won’t actually turn out to be one of the special cases. (But note that this approach doesn’t necessarily eliminate the need for careful consideration of the prior, since unwise priors for a single complex model can also cause problems.)
However, there are some situations where discrete models make sense. For instance, you might be analysing old Roman coins, and be unsure whether they were all minted in one mint, or in two (or three, …) different mints. There aren’t really any intermediate possibilities between one mint or two. Or you might be studying inheritance of two genes, and be considering two models in which they are either on the same chromosome or on different chromosones.
Good points, but can’t you still solve the discrete problem with a single model and a stick breaking prior on the number of mints, right?
If you’re thinking of a stick-breaking prior such as a Dirichlet process mixture model, they typically produce an infinite number of components (which would be mints, in this case), though of course only a finite number will be represented in your finite data set. But we know that the number of mints producing coins in the Roman Empire was finite. So that’s not a reasonable prior (though of course you might sometimes be able to get away with using it anyway).
Ahhh… that makes a lot of sense. ↗Thank you!↖