DanArmak comments on The robust beauty of improper linear models

DanArmak 20 May 2017 16:35 UTC
4 points

So that’s the real role of the expert here

I work in the data science industry—as a programmer, not a data scientist or statistician. From my general understanding of the field what you’re describing is a broadly accepted assumption. But I might be misled by the fact that the company I work for bases its product on this assumption, so I’m not sure if you’re just describing this thing from another angle or if there’s a different point that I’m missing here or if, in fact, many people spend too much effort trying to hand-tune models.

The data scientists I work with make predictive models in two stages. The first one is to invent (or choose) “features”, which include natural variables from the original dataset, functions acting on one or more variables, or supplementary datasets that they think are relevant. The data scientist applies their understanding of statistics as well as domain knowledge to tell the computer which things to look for and which are clearly false positives to be ignored. And the second stage is to build the actual models using mostly standard algorithms like Random Forest or XGBoost or whatnot, where the data scientist might tweak arguments but the underlying algorithm is generally given and doesn’t allow for as much user choice.

A common toy example is the Titanic dataset. This is a list of passengers on the Titanic, with variables like age, name, ticket class, etc.. The task is to build a model that predicts which ones survived when the ship sank. A data scientist would mostly work on feature engineering, e.g. introducing a variable that deduces passenger’s sex from their name, and focus less on model tuning, e.g. determining the exact weight that should be given to the feature in the model (women and children had much higher rates of survival).

In a more serious example, a data scientist might work on figuring out which generic datasets are relevant at all. Suppose you’re trying to predict where to best open a new Starbucks branch. Should you look at the locations of competing coffee shops? Noise from nearby construction? Public transit stops or parking lots? Nearby tourist attractions or campuses or who knows what else? You can’t really afford to look at everything, it would both take too long (and maybe cost too much) and risk false positives. A good domain expert is the one who generates the best hypotheses. But to actually test those hypotheses, you use standard algorithms to build predictive models, and if a simple linear model works, that’s a good thing—it shows your chosen features were really powerful predictors.