I suspect final forecasts that are “good enough” are often shockingly simple, and the hard part of a forecast is building/extracting a “correct enough” simplified model of reality and getting a small amount of the appropriate data that you actually need.
I think that it’s often true that good forecasts can be simple, but I also think that the gulf between “good enough” and “very good” usually contains a perverse effect, where slightly more complexity makes the model perhaps slightly better in expectation, and far worse in properly estimating variance or accounting for uncertainties outside the model. That means that for the purpose of forecasting, you get much worse (brier scores) before you get better.
As a concrete example, this is seen when people forecast COVID deaths. They start with a simple linear trend, then say they don’t really think it’s linear, it’s actually exponential, so they roughly adjust their confidence and have appropriate uncertainties around a bad model. Then they get fancier, and try using a SIR model that gives “the” answer, and the forecaster simulates 100 runs to create a distribution by varying R_0 withing a reasonable range. That gives an uncertainty range, and a very narrow resulting distribution—which the forecaster is more narrowly willing to adjust, because their model accounts for the obvious sources of variance. Then schools are reopened, or treatment methods improve, or contact rates drop as people see case counts rise, and the model’s assumptions are invalidated in a different way than was expected.
I think while consulting many models is a good reminder, the hard part is choosing which model(s) to use in the end. I think your ensemble of models can often do much better than an unweighted average of all the models you’ve considered, since some models are a) much less applicable, b) much more brittle, c) much less intuitively plausible, or d) much too strongly correlated than other models you have.
As I said to Luke in a comment to his link to an excellent earlier post that discusses this, I think there is far more to be said about how to do model fusion, and agreed with his point in his paper that ensembles which simply average models are better than single models, but still worse than actually figuring out what each model tells you.
I think that it’s often true that good forecasts can be simple, but I also think that the gulf between “good enough” and “very good” usually contains a perverse effect, where slightly more complexity makes the model perhaps slightly better in expectation, and far worse in properly estimating variance or accounting for uncertainties outside the model. That means that for the purpose of forecasting, you get much worse (brier scores) before you get better.
As a concrete example, this is seen when people forecast COVID deaths. They start with a simple linear trend, then say they don’t really think it’s linear, it’s actually exponential, so they roughly adjust their confidence and have appropriate uncertainties around a bad model. Then they get fancier, and try using a SIR model that gives “the” answer, and the forecaster simulates 100 runs to create a distribution by varying R_0 withing a reasonable range. That gives an uncertainty range, and a very narrow resulting distribution—which the forecaster is more narrowly willing to adjust, because their model accounts for the obvious sources of variance. Then schools are reopened, or treatment methods improve, or contact rates drop as people see case counts rise, and the model’s assumptions are invalidated in a different way than was expected.
As I said to Luke in a comment to his link to an excellent earlier post that discusses this, I think there is far more to be said about how to do model fusion, and agreed with his point in his paper that ensembles which simply average models are better than single models, but still worse than actually figuring out what each model tells you.