Monteith et al. (2011) (linked in the OP) is an interesting read on the subject. They discuss a puzzle: why does the theoretically optimal Bayesian method for dealing with multiple models (that is, Bayesian model averaging) tend to underperform ad-hoc methods (e.g. “bagging” and “boosting”) in empirical tests? It turns out that “Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is.” The solution is simply modify the Bayesian model averaging process so that it integrates over combinations of models rather than over individual models. (They call this Bayesian model combination, to distinguish it from “normal” Bayesian model averaging.) In their tests, Bayesian model combination beats out bagging, boosting, and “normal” Bayesian model averaging.
Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is.
Wait, what? That sounds significant. What does more than one model being correct mean?
Speculation before I read the paper:
I guess that’s like modelling a process as the superposition of sub-processes? That would give the model more degrees of freedom with which to fit the data. Would we expect that to do strictly better than the mutual exclusion assumption, or does it require more data to overcome the degrees of freedom?
If a single theory is correct, the mutex assumption will update toward it faster by giving it a higher prior, and the probability-distribution-over-averages would get there slower, but still assigns a substantial prior to theories close to the true one.
On the other hand, if a combination is a better model, either because the true process is a superposition, or we are modelling something outside of our model-space, then a combination will be better able to express it. So mutex assumption will be forced to put all weight on a bad nearby theory, effectively updating in the wrong direction, whereas the combination won’t lose as much because it contains more accurate models. I wonder if averaging combination will beat mutex assumption at every step?
Also interesting to note that the mutex assumption is a subset of the model space of the combination assumption, so if you are unsure which is correct, you can just add more weight to the mutex models in the combination prior and use that.
when the Data Generating Model (DGM) is not
one of the component models in the ensemble, BMA tends to
converge to the model closest to the DGM rather than to the
combination closest to the DGM [9]. He also empirically
noted that, in the cases he studied, when the DGM is not
one of the component models of an ensemble, there usually
existed a combination of models that could more closely
replicate the behavior of the DMG than could any individual
model on their own.
Versus my
if a combination is a better model, either because the true process is a superposition, or we are modelling something outside of our model-space, then a combination will be better able to express it. So mutex assumption will be forced to put all weight on a bad nearby theory,
“What does more than one model being correct mean?”
maybe something like string theory? The 5 lesser theories look totally different...and then turn out to tranform into one another when you fiddle with the coupling constant.
Seeing the words “string” and “fiddle” on top of each other primed me to think of their literal meanings, which I wouldn’t otherwise consciously thought of.
“Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is.”
Perhaps they should say “the assumption that exactly one model is perfectly correct”?
It’s an interesting exercise to look for the Bayes structure in this (and other) advice.
At least I find it helpful to tie things down to the underlying theory. Otherwise I find it easy to misinterpret things.
Good article.
Yup! Practical advice is best when it’s backed by deep theories.
Monteith et al. (2011) (linked in the OP) is an interesting read on the subject. They discuss a puzzle: why does the theoretically optimal Bayesian method for dealing with multiple models (that is, Bayesian model averaging) tend to underperform ad-hoc methods (e.g. “bagging” and “boosting”) in empirical tests? It turns out that “Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is.” The solution is simply modify the Bayesian model averaging process so that it integrates over combinations of models rather than over individual models. (They call this Bayesian model combination, to distinguish it from “normal” Bayesian model averaging.) In their tests, Bayesian model combination beats out bagging, boosting, and “normal” Bayesian model averaging.
Wait, what? That sounds significant. What does more than one model being correct mean?
Speculation before I read the paper:
I guess that’s like modelling a process as the superposition of sub-processes? That would give the model more degrees of freedom with which to fit the data. Would we expect that to do strictly better than the mutual exclusion assumption, or does it require more data to overcome the degrees of freedom?
If a single theory is correct, the mutex assumption will update toward it faster by giving it a higher prior, and the probability-distribution-over-averages would get there slower, but still assigns a substantial prior to theories close to the true one.
On the other hand, if a combination is a better model, either because the true process is a superposition, or we are modelling something outside of our model-space, then a combination will be better able to express it. So mutex assumption will be forced to put all weight on a bad nearby theory, effectively updating in the wrong direction, whereas the combination won’t lose as much because it contains more accurate models. I wonder if averaging combination will beat mutex assumption at every step?
Also interesting to note that the mutex assumption is a subset of the model space of the combination assumption, so if you are unsure which is correct, you can just add more weight to the mutex models in the combination prior and use that.
Now I’ll read the paper. Let’s see how I did.
Yup. Exactly what I thought.
Versus my
“What does more than one model being correct mean?”
maybe something like string theory? The 5 lesser theories look totally different...and then turn out to tranform into one another when you fiddle with the coupling constant.
Seeing the words “string” and “fiddle” on top of each other primed me to think of their literal meanings, which I wouldn’t otherwise consciously thought of.
Perhaps they should say “the assumption that exactly one model is perfectly correct”?