I’ve always disliked discussing statistics and finance, even though I enjoy learning about almost everything. The sense I got was that to understand and use it at all, you’d have to constantly master it and all its tricks—that there was no real in-between. The rules were always changing, and the underlying conditions.
One thing going on there for statistics is that the field greatly dislikes presenting it in any of the unifications which are available, which is something I learned only quite late myself. As often taught or discussed, statistics is treated as a bag of tricks and p-values and problem-specific algorithms. But there are paradigms one could teach.
For example, around the 1940s, led by Abraham Wald, there was a huge paradigm shift towards the decision-theoretic interpretation of statistics, where all these Fisherian gizmos can be understood, justified, and criticized as being about minimizing loss given specific loss functions; the mean is a good way to estimate your parameter (rather than the mode or median or a bazillion other univariate statistics one could invent) not because that particular function was handed down at Sinai but because it does a good job of minimizing your loss under such-and-such conditions like having a squared error loss (because bigger errors hurt you much more), and if those conditions do not hold, that is why the, say, median is better, and you can say precisely how much better and when you’d go back to the mean (as opposed to rules of thumbs about standard deviations or arbitrary p-value thresholds testing normality). Many issues in meta-science are much more transparent if you simply ask how they would affect decision-making.
Similarly, Bayesianism means you can just ‘turn the crank’ on many problems: define a model, your priors, and turn the MCMC crank, without all the fancy problem-specific derivations and special-cases. Instead of all these mysterious distributions and formulas and tests and likelihoods dropping out of the sky, you understand that you are just setting up equations (or even just writing a program) which reflect how you think something works in a sufficiently formalized way that you can run data through it and see how the prior updates into the posterior. The distributions & likelihoods then do not drop out of the sky but are pragmatic choices: what particular bits of mathematics are implemented in your MCMC library, and which match up well with how you think the problem works, without being too confusing or hard to work with or computationally-inefficient?
And causal modeling is another good example: there is an endless zoo of biases and problems in fields like epidemiology which look like a mess of special cases you just have to memorize, but they all reduce to pretty straightforward and obvious issues if you draw out a DAG of a causal graph of how things might work.
Much of the ‘experience’ that statisticians or analysts rely on when they apply the bag of tricks is actually a hidden theory learned from experience & osmosis, used to reach the correct results while ostensibly using the bag of tricks: the analyst knows he ought to use a median here because he has a vaguely defined loss in mind for the downstream experiment, and he knows the data sometimes throws outliers which screwed up experiments in the past so the mean is a bad choice and he ought to use ‘robust statistics’; or he knows from experience that most of the variables are irrelevant so it’d be good to get shrinkage by sleight of hand by picking a lasso regression instead of a regular regression and if anyone asks, talk vaguely about ‘regularization’; or he has a particular causal model of how enrollment in a group is a collider so he knows to ask about “Simpson’s paradox”. Thus, in the hands of an expert, the bag of tricks works out, even as the neophyte is mystified and wonders how the expert knew to pull this or that trick out of, seemingly, their nether regions.
Teachers don’t like this because they don’t want to defend the philosophies of things like Bayesianism, often aren’t trained in them in the first place, and because teaching them is simultaneously too easy (the concepts are universal, straightforward, and can be one-liners) and too hard (reducing them to practice and actually computing anything—it’s easy to write down Bayes’s formula, not so easy to actually compute a real posterior, much less maximize over a decision tree).
There’s a lot of criticisms that can be made of each paradigm, of course, none of them are universally assented to, to say the last—but I think it would generally be better to teach people in those principled approaches, and then later critique them, than to teach people without any principles at all.
Thanks Gwern! I was wondering if you had any pointers as to where beginners should start in terms of understanding statistics paradigmatically? I’ve not come across statistics explained this way before, and I am quite interested to learn more.
Thanks for the explanation—that all makes sense. I guess what I was getting at is that as you said, it can be done in a completely sensible way by people who know what they’re doing, but it tends to become split up in awkward ways.
One thing going on there for statistics is that the field greatly dislikes presenting it in any of the unifications which are available, which is something I learned only quite late myself. As often taught or discussed, statistics is treated as a bag of tricks and p-values and problem-specific algorithms. But there are paradigms one could teach.
For example, around the 1940s, led by Abraham Wald, there was a huge paradigm shift towards the decision-theoretic interpretation of statistics, where all these Fisherian gizmos can be understood, justified, and criticized as being about minimizing loss given specific loss functions; the mean is a good way to estimate your parameter (rather than the mode or median or a bazillion other univariate statistics one could invent) not because that particular function was handed down at Sinai but because it does a good job of minimizing your loss under such-and-such conditions like having a squared error loss (because bigger errors hurt you much more), and if those conditions do not hold, that is why the, say, median is better, and you can say precisely how much better and when you’d go back to the mean (as opposed to rules of thumbs about standard deviations or arbitrary p-value thresholds testing normality). Many issues in meta-science are much more transparent if you simply ask how they would affect decision-making.
Similarly, Bayesianism means you can just ‘turn the crank’ on many problems: define a model, your priors, and turn the MCMC crank, without all the fancy problem-specific derivations and special-cases. Instead of all these mysterious distributions and formulas and tests and likelihoods dropping out of the sky, you understand that you are just setting up equations (or even just writing a program) which reflect how you think something works in a sufficiently formalized way that you can run data through it and see how the prior updates into the posterior. The distributions & likelihoods then do not drop out of the sky but are pragmatic choices: what particular bits of mathematics are implemented in your MCMC library, and which match up well with how you think the problem works, without being too confusing or hard to work with or computationally-inefficient?
And causal modeling is another good example: there is an endless zoo of biases and problems in fields like epidemiology which look like a mess of special cases you just have to memorize, but they all reduce to pretty straightforward and obvious issues if you draw out a DAG of a causal graph of how things might work.
Much of the ‘experience’ that statisticians or analysts rely on when they apply the bag of tricks is actually a hidden theory learned from experience & osmosis, used to reach the correct results while ostensibly using the bag of tricks: the analyst knows he ought to use a median here because he has a vaguely defined loss in mind for the downstream experiment, and he knows the data sometimes throws outliers which screwed up experiments in the past so the mean is a bad choice and he ought to use ‘robust statistics’; or he knows from experience that most of the variables are irrelevant so it’d be good to get shrinkage by sleight of hand by picking a lasso regression instead of a regular regression and if anyone asks, talk vaguely about ‘regularization’; or he has a particular causal model of how enrollment in a group is a collider so he knows to ask about “Simpson’s paradox”. Thus, in the hands of an expert, the bag of tricks works out, even as the neophyte is mystified and wonders how the expert knew to pull this or that trick out of, seemingly, their nether regions.
Teachers don’t like this because they don’t want to defend the philosophies of things like Bayesianism, often aren’t trained in them in the first place, and because teaching them is simultaneously too easy (the concepts are universal, straightforward, and can be one-liners) and too hard (reducing them to practice and actually computing anything—it’s easy to write down Bayes’s formula, not so easy to actually compute a real posterior, much less maximize over a decision tree).
There’s a lot of criticisms that can be made of each paradigm, of course, none of them are universally assented to, to say the last—but I think it would generally be better to teach people in those principled approaches, and then later critique them, than to teach people without any principles at all.
Thanks Gwern! I was wondering if you had any pointers as to where beginners should start in terms of understanding statistics paradigmatically? I’ve not come across statistics explained this way before, and I am quite interested to learn more.
Thanks for the explanation—that all makes sense. I guess what I was getting at is that as you said, it can be done in a completely sensible way by people who know what they’re doing, but it tends to become split up in awkward ways.