The robust beauty of improper linear models
It should come as no surprise to people on this list that models often outperform experts. But these are generally finely calibrated models, integrating huge amounts of data, so this seems less surprising. How can the poor experts compete against that?
But sometimes the models are much simpler than that, and still perform better. For instance, the models could be linear, rather than having higher order complexities. These models can still outperform experts, because in practice, despite their beliefs that they are doing a non-linear task, expert decisions can often best be modelled as being entirely linear.
But surely the weights of the linear models are subtle and need to be set exactly? Not really. It seems that if you take a linear model, and weigh the variables by +1 or −1 depending on whether it has a positive or negative impact on the result, then you will get a model that still often outperforms experts. These models with ±1 weights are called improper linear models.
What’s going on here? Well, there’s been a bit of a dodge. I’ve been talking about “taking” a linear model, with “variables”, and weighing the factors depending on a positive or negative “impact”. And to do all that, you need experts. They are the ones that know which variables are important, and know the direction (positive or negative) in which they impact the result. They don’t choose these variables by just taking random possibilities and then figuring out what the direction is. Instead, they understand the situation, to some extent, and choose important variables.
So that’s the real role of the expert here: knowing what should go into the model, what really makes the underlying dependent variable change. Selecting and coding the variable information, in the terms that are often used.
But, just as experts can be very good at that task, they are human, and humans are terrible at integrating lots of information together. So, having selected the variables, they get regularly outperformed by proper linear models. And when you add the fact that the experts have selected variables of comparable importance, and that these variables are often correlated with each other, it’s not surprising that they get outperformed by improper linear models as well.
I work in the data science industry—as a programmer, not a data scientist or statistician. From my general understanding of the field what you’re describing is a broadly accepted assumption. But I might be misled by the fact that the company I work for bases its product on this assumption, so I’m not sure if you’re just describing this thing from another angle or if there’s a different point that I’m missing here or if, in fact, many people spend too much effort trying to hand-tune models.
The data scientists I work with make predictive models in two stages. The first one is to invent (or choose) “features”, which include natural variables from the original dataset, functions acting on one or more variables, or supplementary datasets that they think are relevant. The data scientist applies their understanding of statistics as well as domain knowledge to tell the computer which things to look for and which are clearly false positives to be ignored. And the second stage is to build the actual models using mostly standard algorithms like Random Forest or XGBoost or whatnot, where the data scientist might tweak arguments but the underlying algorithm is generally given and doesn’t allow for as much user choice.
A common toy example is the Titanic dataset. This is a list of passengers on the Titanic, with variables like age, name, ticket class, etc.. The task is to build a model that predicts which ones survived when the ship sank. A data scientist would mostly work on feature engineering, e.g. introducing a variable that deduces passenger’s sex from their name, and focus less on model tuning, e.g. determining the exact weight that should be given to the feature in the model (women and children had much higher rates of survival).
In a more serious example, a data scientist might work on figuring out which generic datasets are relevant at all. Suppose you’re trying to predict where to best open a new Starbucks branch. Should you look at the locations of competing coffee shops? Noise from nearby construction? Public transit stops or parking lots? Nearby tourist attractions or campuses or who knows what else? You can’t really afford to look at everything, it would both take too long (and maybe cost too much) and risk false positives. A good domain expert is the one who generates the best hypotheses. But to actually test those hypotheses, you use standard algorithms to build predictive models, and if a simple linear model works, that’s a good thing—it shows your chosen features were really powerful predictors.
Relevant term is judgmental bootstrapping in the forecasting literature if anyone wants to dive deeper. It is extremely practically relevant for many circumstances such as hiring, where adhoc linear models outperformed veteran hiring managers.
I don’t think this is why improper linear models work. If you have a large number of variables, most of which are irrelevant in the sense of being uncorrelated with the outcome, then the irrelevant variables will be randomly assigned to +1 or −1 weights and will on average cancel out, leaving the signal from the relevant variables who do not cancel each other out.
So even without an implicit prior from an expert relevance selection effect or any explicit prior enforcing sparsity, you would still get good performance from improper linear models. (And IIRC, when you use something like ridge regression or Laplacian priors, the typical result, especially in high-dimensional settings like genomics or biology, most of the variables do drop out or get set to zero, so even in these ‘enriched’ datasets, most of the variables are irrelevant. What’s sauce for the goose is sauce for the gander.)
Adding in more irrelevant variables does change things quantitatively by lowering power due to increased variance and requiring more data, but I don’t see how this leads to any qualitative transition from working to not working such that it might explain why they work. That seems to have more to do with the human subjects overweighting noise and the ‘bet on sparsity’ principle.
If I’m not mistaken, a similar principle is at work in explaining why Random Forests / Extremely Randomized Trees empirically work so well on machine learning tasks (and why they also seem to be fairly robust to numerous irrelevant variables). They aren’t linear models in terms of the original variables, but if each tree is a new variable than the collection of trees is a linear model of equally weighted predictors.
Maybe. The explanation I’ve seen floated is that the tree methods are exploiting nearest-neighbor effects with adaptive distances; maybe that winds up being about the same thing.
This will seriously degrade the signal. Normally there are only a few key variables, so adding more random ones with similar will increase the amount of spurious results.
ie making the model worse.
I don’t think this is true. All the useful weights are set to +1 or −1 by expert assessment, and the non-useful weights are just noise. Why would more data be required?
Yes, but again, where is the qualitative difference? In what sense does this explain the performance of improper linear models versus human experts? Why does the subtle difference between a model based on an ‘enriched’ set of variables and a model based on a non-enriched-but-slightly-worse ‘explain’ how they perform better than humans?
? I’m not sure what you’re asking for. The basic points are a) experts are bad integrating information, and b) experts are good at selecting important variables of roughly equal importance, c) these variables are often highly correlated.
a) explains why experts are bad (as in worse than proper linear models), b) and c) explain why improper linear models might perform not too far off proper linear models (and hence be better than experts).
Nice. To make your proposed explanation more precise:
Take a random vector on the n-dim unit sphere. Project to the nearest (+1,-1)/sqrt(n) vector; what is the expected l2-distance / angle? How does it scale with n?
If this value decreases in n, then your explanation is essentially correct, or did you want to propose something else?
Start by taking a random vector x where each coordinate is unit gaussian (normalize later). The projection px just splits into positive coordinates and negative coordinates.
We are interested in E[ / |x| sqrt(n)].
If the dimension is large enough, then we wont really need to normalize; it is enough to start with 1/sqrt(n) gaussians, as we will almost almost surely get almost unit length. Then all components are independent.
For the angle, we then (approximately) need to compute E(sum_i |x_i| / n), where each x_i is unit Gaussian. This is asymptotically independent of n; so it appears like this explanation of improper linear models fails.
Darn, after reading your comment I mistakenly believed that this would be yet another case of “obvious from high-dimensional geometry” / random projection.
PS. In what sense are improper linear models working? l_1, l2, l\infty sense?
Edit: I was being stupid, leaving the above for future ridicule. We want E(sum_i |x_i| / n)=1, not E(sum_i |x_i|/n)=0.
Folded Gaussian tells us that E[ sum_i |x_i|/n]= sqrt(2/pi), for large n. The explanation still does not work, since 2/pi <1, and this gives us the expected error margin of improper high-dimensional models.
@Stuart: What are the typical empirical errors? Do they happen to be near sqrt(2/pi), which is close enough to 1 to be summarized as “kinda works”?
I am not sure of the point you are making. In particular, I don’t see why would anyone use those improper linear models. It’s not 1979 and we can easily fit a variety of linear models including robust ones. Under which circumstances would you prefer an improper linear model to other alternatives available?
You could use them when building a new model in a new field with experts but little data. But the point is not so much to use these models, but to note that they still outperform experts.
Seconded. Back when I studied this topic for my thesis, the conclusion was not that “improper linear models are great”, but more “experts suck”. And that’s because in cases of repeated predictions, a statistical model is at least going to be consistent, but experts will not be.
Again, why would you use this particular model class instead of other alternatives?
That statement badly needs modifiers. I would suggest “some improper linear models sometimes outperform experts”. Note that there is huge selection bias here. Also, your link is from 1979, where is that “still” coming from?
Fair qualifiers.