[QUESTION]: Academic social science and machine learning
I asked this question on Facebook here, and got some interesting answers, but I thought it would be interesting to ask LessWrong and get a larger range of opinions. I’ve modified the list of options somewhat.
What explains why some classification, prediction, and regression methods are common in academic social science, while others are common in machine learning and data science?
For instance, I’ve encountered probit models in some academic social science, but not in machine learning.
Similarly, I’ve encountered support vector machines, artificial neural networks, and random forests in machine learning, but not in academic social science.
The main algorithms that I believe are common to academic social science and machine learning are the most standard regression algorithms: linear regression and logistic regression.
Possibilities that come to mind:
(0) My observation is wrong and/or the whole question is misguided.
(1) The focus in machine learning is on algorithms that can perform well on large data sets. Thus, for instance, probit models may be academically useful but don’t scale up as well as logistic regression.
(2) Academic social scientists take time to catch up with new machine learning approaches. Of the methods mentioned above, random forests and support vector machines was introduced as recently as 1995. Neural networks are older but their practical implementation is about as recent. Moreover, the practical implementations of these algorithm in the standard statistical softwares and packages that academics rely on is even more recent. (This relates to point (4)).
(3) Academic social scientists are focused on publishing papers, where the goal is generally to determine whether a hypothesis is true. Therefore, they rely on approaches that have clear rules for hypothesis testing and for establishing statistical significance (see also this post of mine). Many of the new machine learning approaches don’t have clearly defined statistical approaches for significance testing. Also, the strength of machine learning approaches is more exploratory than testing already formulated hypotheses (this relates to point (5)).
(4) Some of the new methods are complicated to code, and academic social scientists don’t know enough mathematics, computer science, or statistics to cope with the methods (this may change if they’re taught more about these methods in graduate school, but the relative newness of the methods is a factor here, relating to (2)).
(5) It’s hard to interpret the results of fancy machine learning tools in a manner that yields social scientific insight. The results of a linear or logistic regression can be interpreted somewhat intuitively: the parameters (coefficients) associated with individual features describe the extent to which those features affect the output variable. Modulo issues of feature scaling, larger coefficients mean those features play a bigger role in determining the output. Pairwise and listwise R^2 values provide additional insight on how much signal and noise there is in individual features. But if you’re looking at a neural network, it’s quite hard to infer human-understandable rules from that. (The opposite direction is not too hard: it is possible to convert human-understandable rules to a decision tree and then to use a neural network to approximate that, and add appropriate fuzziness. But the neural networks we obtain as a result of machine learning optimization may be quite different from those that we can interpret as humans). To my knowledge, there haven’t been attempts to reinterpret neural network results in human-understandable terms, though Sebastian Kwiatkowski’s comment on my Facebook post points to an example where the results of naive Bayes and SVM classifiers for hotel reviews could be translated into human-understandable terms (namely, reviews that mentioned physical aspects of the hotel, such as “small bedroom”, were more likely to be truthful than reviews that talked about the reasons for the visit or the company that sponsored the visit). But Kwiatkowski’s comment also pointed to other instances where the machine’s algorithms weren’t human-interpretable.
What’s your personal view on my main question, and on any related issues?
(6) Another possibility (related to (2) and (4)) is that academic social scientists are primarily in the business of sharing information with other academics in their field, so they tend to rely on tools that are already standard within their field. A paper that uses complicated statistics which their colleagues don’t understand is going to have less impact than a paper that makes the same point using the field’s standard tools. It may also have trouble making it through peer review, if the peers don’t have the technical knowledge to evaluate it.
So the incentives aren’t there for an academic to stay on the cutting edge in adopting new complicated techniques. People who are in the business of building things to get results have a stronger incentive to adopt complicated new methods which offer any technical advantage.
I agree the above is true in nearly all cases. In some fields (economics), some papers try to signal value by using needlessly complicated statistics borrowed from other fields.
You should be asking why “statistics” and “machine learning” are different fields. It is a good question!
edit: To clarify, stats is a “service field” for a lot of empirical fields, so lots of them use stats methods and not ML methods. More comp. sci. aligned areas also use more ML e.g. computational bio. There’s been a lot of cross fertilization lately, and stats and ML are converging, but the fact that there is “department level division” is supremely weird.
Neural networks are non-linear regression.
Arguably we often can’t usefully interpret statistical models unless they correspond to causal ones! One of the historical differences between ML and stats is that the latter was always concerned about experiments and interpretability, and thus about causal matters, whereas the former more about prediction and fancy algorithms.
It is very weird to me ML is so little interested in causality.
In my understanding of academia, people can be very resistant to integrating ideas from “unrelated” fields. The statistical tools that any one person uses are probably more determined by the status quo than anything else. I vote for options (2) and (4) as the most likely.
Every point you made (0)-(5) is correct!
(0) There are some social scientists, especially in political science, who are focused on applying machine learning and text mining methods to political texts. This is a big movement and it’s under the heading “text as data”. Most publications use fairly simple methods, basically calibrated regressions, but a lot of thought went into choosing those and some of the people publishing are mathematically sophisticated.
Example: http://www.justingrimmer.org/
Another prominent example comes in Social Networks, where people from the CS and physics world work on the social side, and some social scientists use the methodology too.
Example: http://cs.stanford.edu/people/jure/
At the Santa Fe institute people from all kinds of disciplines do all kind of stuff, but an overall theme is methods drawing from math and physics applied to social sciences. This include networks, statistical physics, and game theory.
Not exactly social science, but Jennifer Dunne applies network analysis to food webs: http://www.santafe.edu/about/people/profile/Jennifer%20A.%20Dunne
I am certain that cutting edge mathematics and ML are applied in pockets of econometrics too. Finance is often in economics departments and ML has thoroughly invaded that, but I admit that’s a stretch.
(1) Social science academics have only recently gained access to large datasets. Especially in survey-based fields like sociology and experimental psychology, small-data-oriented methods are definitely the focus. Large datasets include medical datasets, to the extent that they have access; various massive text repositories including academic paper databases and online datasets; and a very few surveys that have the size and depth to support fancier analyses.
This applies less to probit and more to clustering, bayes nets, decision trees, etc.
(2) The culture is definitely conservative. I’ve talked to many people interested in the more advanced methods and they have to fight harder to get published; but the tide is changing.
(3) Absolutely. It’s very hard to figure out what coefficients represent when data is ambiguous and many factors are highly correlated (as they are in social science) and when the model is very possibly misspecified. Clusterings with “high score” from most methods can be completely spurious and it take advanced statistical knowledge to identify this. ML is good for prediction and classification, but this is very rarely the goal of social scientists (though one can imagine how it could be). SVMs and decision trees do a poor job of extracting causal relationships with any certainty.
(4) Again, the culture is conservative and many don’t have these training. A good number know their way around R though, and newer ones often come in with quite a bit of stats/CS knowledge. The amount of statistical knowledge in the social sciences is growing fast.
(5) Yes; this is especially true of something like neural networks.
Giving that we are on Lesswrong you missed a core one: Academic social science uses a bunch of frequentist statistics that perform well if your goal is to prove that your thesis has “statistical significance” but that aren’t useful for learning what’s true. Machine learning algorithms don’t give you p values.
Do they even give you Bayesian forms of summary values like Bayes factors?
(This is actually a relevant concern for me now: my magnesium self-experiment has finished, and the results are really surprising. To check my linear model, I tried looking at what a random forests might say; it mostly agrees with the analysis… except it also places a lot of importance on a covariate which with the linear model, the fit is better with that coavariate discarded entirely. What does this mean? I dunno. There’s no statistic like a p-value I can use to interpret this.)
You can turn any kind of analysis (which returns a scalar) into a p-value by generating a zillion fake data sets assuming the null hypothesis, analysing them all, and checking for what fraction of the fake data sets your statistic exceeds that for the real data set.
This doesn’t sound true to me. How do you know the underlying distribution of the null when it’s just something like “these variables are independent”?
If you’re working with composite hypotheses, replace “your statistic” with “the supremum of your statistic over the relevant set of hypotheses”.
If there are infinitely many hypotheses in the set then the algorithm in the grandparent doesn’t terminate :).
What I was saying was sort of vague, so I’m going to formalize here.
Data is coming from some random process X(θ,ω), where θ parameterizes the process and ω captures all the randomness. Let’s suppose that for any particular θ, living in the set Θ of parameters where the model is well-defined, it’s easy to sample from X(θ,ω). We don’t put any particular structure (in particular, cardinality assumptions) on Θ. Since we’re being frequentists here, nature’s parameter θ′ is fixed and unknown. We only get to work with the realization of the random process that actually happens, X’ = X(θ′,ω′).
We have some sort of analysis t(⋅) that returns a scalar; applying it to the random data gives us the random variables t(X(θ,ω)), which is still parameterized by θ and still easy to sample from. We pick some null hypothesis Θ0 ⊂ Θ, usually for scientific or convenience reasons.
We want some measure of how weird/surprising the value t(X’) is if θ′ were actually in Θ0. One way to do this, if we have a simple null hypothesis Θ0 = { θ0 }, is to calculate the p-value p(X’) = P(t(X(θ0,ω)) ≥ t(X’)). This can clearly be approximated using samples from t(X(θ0,ω)).
For composite null hypotheses, I guessed that using p(X’) = sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ t(X’)) would work. Paraphrasing jsteinhardt, if Θ0 = { θ01, …, θ0n }, you could approximate p(X’) using samples from t(X(θ01,ω)), … t(X(θ01,ω)), but it’s not clear what to do when Θ0 has infinite cardinality. I see two ways forward. One is approximating p(X’) by doing the above computation over a finite subset of points in Θ0, chosen by gridding or at random. This should give an approximate lower bound on the p-value, since it might miss θ where the observed data look unexceptional. If the approximate p-value leads you to fail to reject the null, you can believe it; if it leads you to reject the null, you might be less sure and might want to continue trying more points in Θ0. Maybe this is what jsteinhardt means by saying it “doesn’t terminate”? The other way forward might be to use features of t and Θ0, which we do have some control over, to simplify the expression sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ c). Say, if t(X(θ,ω)) is convex in θ for any ω and Θ0 is a convex bounded polytope living in some Euclidean space, then the supremum only depends on how P(t(X(θ0,ω)) ≥ c) behaves at a finite number of points.
So yeah, things are far more complicated than I claimed and realize now working through it. But you can do sensible things even with a composite null.
Yup I agree with all of that. Nice explanation!
I don’t have knowledge on random forests in particular but I did learn a little bit about machine learning in bioinformatics classes.
As far as I understand you can train your machine learning algorithm on one set of data and then see how it predicts values of a different set of data. That means you have values for sensitivity and specificity of your model. You can build a receiver operating characteristic (ROC) plot with it. You can also do things like seeing whether you get a different model if you build the model on a different set of your data. That can tell you whether your model is robust.
The idea of p values is to decide whether or not your model is true. In general that’s not what machine learning folks are concerned with. The know that their model is a model and not reality and they care about the receiver operating characteristic.
You don’t know what you are talking about.
The grandchild comment suggests that he does, at least to the the level of a typical user (though not a researcher or developer) of these methods.
You really should have mentioned here one of your Facebook responses that maybe the data generating processes seen in social science problems don’t look like (the output of generative versions of) ML algorithms. What’s the point of using a ML method that scales well computationally if looking at more data doesn’t bring you to the truth (consistency guarantees can go away if the truth is outside the support of your model class) or has terrible bang for the buck (even if you keep consistency, you may take an efficiency hit)?
Also, think about how well these methods work over the entire research process. Looking at probit modeling, the first thing that pops out about it is how those light normal tails suggest that it is sensitive to outliers. If you gave a statistician a big, messy-looking data set on an unfamiliar subject, this would probably push them to use something like logistic regression with its reliance on a heavier-tailed distribution and better expected robustness. But if you’re the social scientist who assembled the data set, you may be sure that you’ve dealt with any data collection, data entry, measurement, etc. errors and you may be deeply familiar with each observation. At this stage, outliers are not unknowable random noise that gets in the way of the signal but may themselves be the signal, as they have an disproportionate effect on the learned model. At the least, they are where additional scrutiny should be focused, as long as the entity doing the analysis can provide that scrutiny.