I can’t say what O’Hagan had in mind, but the reasons I have to be skeptical of results involving Bayesian model averaging are that model averaging makes sense only if you’ve been very, very careful in setting up the models, and you’ve also been very, very careful in specifying the prior distributions these models use for their parameters. For some problems, being very, very careful may be beyond the capacity of human intellect.
Regarding the models: For complex problems, it may be the none of the models you have defined represent the real phenomenon well, even approximately. But the posterior model probabilities used in Bayesian model averaging assume that the true model is among those being considered.
If that’s true (and the models have reasonable priors over their parameters), then model averaging—and it’s limit of model selection when the posterior probability of one model is close to one—is a sensible thing to do. That’s because the true model is always the best one to use, regardless of your purpose in doing inference.
However, if you’re actually using a set of models that are all grossly inadequate, then which of these terrible models is best to use (or what weights it’s best to average them with) depends on your purpose. For example, with non-linear regression models relating y to x, you might be interested in predicting y at new values of x that are negative, or in predicting y at new values of x that are positive. If you’ve got the true model, it’s good for both positive and negative x. But if all you’ve got are bad models, it may be that the one that’s best for negative x is not the same as the one that’s best for positive x. Bayesian model averaging takes no account of your purpose, and so can’t possibly do the right thing when none of the models are good.
Regarding priors: The problem is not priors for the models themselves (assuming there aren’t huge numbers of them), but rather priors for the parameters within each of the models. (Note that different models may have different sets of parameters, so these priors are not necessarily parallel between models.) Once you have a fairly large amount of data, it’s often the case that the exact prior for parameters that you choose isn’t crucial for inference within a model—the posterior distribution for parameters may vary little over a wide class of reasonable priors (that aren’t highly concentrated in some small region). You can often even get away with using an “improper” prior, such as a uniform distribution over the real numbers (which doesn’t actually exist, of course).
But for computing model probabilities for use in Bayesian model averaging, the priors used for the parameters of each model are absolutely crucial. Using an overly-vague prior in which probability is spread over a wide range of parameters that mostly don’t fit the data very well will give a lower model probability than if you used a more well-considered prior, that put less probability on parameters that don’t fit the data well (and that weren’t really plausible even a priori). Using an improper prior for parameters will generally result in the model probability being zero, since there’s zero prior probability for parameters that fit the data.
Especially when the parameter space is high-dimensional, it can be quite difficult to fully express your prior knowledge about which parameter values are plausible. With a lot of thought, you maybe can do fairly well. But if you need to think hard, how can you tell whether you thought equally hard for each model? And, just thinking equally hard isn’t even really enough—you need to have actually pinned down the prior that really expresses your prior knowledge, for every one of the models. Most people doing Bayesian model averaging haven’t done that.
The purpose of having priors is to compensate for lack of data, so that at least you are closer to the true model a posterior, and faster training since model averaging would take longer than training a single model. Also it’s not that the true model is within the ensemble of models but that you know before hand that getting a true model is rather difficult, lack of data or just the sheer complexity of the true model and parameter size. If you have enough data, playing around with different prior wouldn’t make any meaningful difference. I think when people talk about true model, what they really mean is how close they are to the true model. There isn’t really a way to know. Take coin flip for example. You only have 50-50 if your flips are perfect and your coin is perfectly uniform, 0 wind, etc. These details are neglected because they aren’t really important theoretically, but the true model isn’t theoretically perfect either since it is supposed to reflect reality.
I can’t say what O’Hagan had in mind, but the reasons I have to be skeptical of results involving Bayesian model averaging are that model averaging makes sense only if you’ve been very, very careful in setting up the models, and you’ve also been very, very careful in specifying the prior distributions these models use for their parameters. For some problems, being very, very careful may be beyond the capacity of human intellect.
Regarding the models: For complex problems, it may be the none of the models you have defined represent the real phenomenon well, even approximately. But the posterior model probabilities used in Bayesian model averaging assume that the true model is among those being considered.
If that’s true (and the models have reasonable priors over their parameters), then model averaging—and it’s limit of model selection when the posterior probability of one model is close to one—is a sensible thing to do. That’s because the true model is always the best one to use, regardless of your purpose in doing inference.
However, if you’re actually using a set of models that are all grossly inadequate, then which of these terrible models is best to use (or what weights it’s best to average them with) depends on your purpose. For example, with non-linear regression models relating y to x, you might be interested in predicting y at new values of x that are negative, or in predicting y at new values of x that are positive. If you’ve got the true model, it’s good for both positive and negative x. But if all you’ve got are bad models, it may be that the one that’s best for negative x is not the same as the one that’s best for positive x. Bayesian model averaging takes no account of your purpose, and so can’t possibly do the right thing when none of the models are good.
Regarding priors: The problem is not priors for the models themselves (assuming there aren’t huge numbers of them), but rather priors for the parameters within each of the models. (Note that different models may have different sets of parameters, so these priors are not necessarily parallel between models.) Once you have a fairly large amount of data, it’s often the case that the exact prior for parameters that you choose isn’t crucial for inference within a model—the posterior distribution for parameters may vary little over a wide class of reasonable priors (that aren’t highly concentrated in some small region). You can often even get away with using an “improper” prior, such as a uniform distribution over the real numbers (which doesn’t actually exist, of course).
But for computing model probabilities for use in Bayesian model averaging, the priors used for the parameters of each model are absolutely crucial. Using an overly-vague prior in which probability is spread over a wide range of parameters that mostly don’t fit the data very well will give a lower model probability than if you used a more well-considered prior, that put less probability on parameters that don’t fit the data well (and that weren’t really plausible even a priori). Using an improper prior for parameters will generally result in the model probability being zero, since there’s zero prior probability for parameters that fit the data.
Especially when the parameter space is high-dimensional, it can be quite difficult to fully express your prior knowledge about which parameter values are plausible. With a lot of thought, you maybe can do fairly well. But if you need to think hard, how can you tell whether you thought equally hard for each model? And, just thinking equally hard isn’t even really enough—you need to have actually pinned down the prior that really expresses your prior knowledge, for every one of the models. Most people doing Bayesian model averaging haven’t done that.
The purpose of having priors is to compensate for lack of data, so that at least you are closer to the true model a posterior, and faster training since model averaging would take longer than training a single model. Also it’s not that the true model is within the ensemble of models but that you know before hand that getting a true model is rather difficult, lack of data or just the sheer complexity of the true model and parameter size. If you have enough data, playing around with different prior wouldn’t make any meaningful difference. I think when people talk about true model, what they really mean is how close they are to the true model. There isn’t really a way to know. Take coin flip for example. You only have 50-50 if your flips are perfect and your coin is perfectly uniform, 0 wind, etc. These details are neglected because they aren’t really important theoretically, but the true model isn’t theoretically perfect either since it is supposed to reflect reality.