The two examples you give (Bayesian statistics and calculus) are very good ones, I would definitely recommend becoming familiar with these. I am not sure how much is covered by the ‘calculus’ label, but I would recommend trying to understand on a gut level what a differential equation means (this is simpler than it might sound. Solving them, on the other hand, is hard and often tedious). I believe vector calculus (linear algebra) and the combination with differential equations (linear ODE’s of dimension at least two) are also covered by ‘calculus’? Again having the ability to solve them isn’t that important in most fields (in my limited experience), but grasping what exactly is happening is very valuable.
If you are wholly unfamiliar with statistics then I would also advice looking into frequentist statistics after having studied the Bayesian statistics—frequentist tools provide very accurate and easily computable approximations to the Bayesian inference, and being able to recognise/use these is useful in most sciences (from social science all the way to theoretical physics).
I would advise looking into frequentist statistics before studying Bayesian statistics. Inference done under Bayesian statistics is curiously silent about anything besides the posterior probability, including whether the model makes sense for the data, whether the knowledge gained about the model is likely to match reality, etc. Frequentist concepts like consistency, coverage probability, ancillarity, model checking, etc., don’t just apply to frequentist estimation; they can be used to asses and justify Bayesian procedures.
If anything, Bayesian statistics should just be treated as a factory that churns out estimation procedures. By a corollary of the complete class theorem, this is also the only way you can get good estimation procedures.
ETA: Can I get comments in addition to (or instead of) down votes here? This is a topic I don’t want to be mistaken about, so please tell me if I’m getting something wrong. Or rather if my comment is coming across as “boo Bayes”, which calls out for punishment.
I would advise looking into frequentist statistics before studying Bayesian statistics.
Actually, if you have the necessary math background, it will probably be useful to start by looking at why and how the frequentists and the Bayesians differ.
Thanks for pointing out the Gelman and Shalizi paper. Just skimmed it so far, but it looks like it really captures the zeitgeist of what reasonably thoughtful statisticians think of the framework they’re in the business of developing and using.
Plus, their final footnote, describing their misgivings about elevating Bayesianism beyond a tool in the hypothetico-deductive toolbox, is great:
Ghosh and Ramamoorthi (2003, p. 112) see a similar attitude as discouraging inquiries into consistency: ‘the prior and the posterior given by Bayes theorem [sic] are imperatives arising out of axioms of rational behavior – and since we are already rational why worry about one more’ criterion, namely convergence to the truth?
I’m afraid I don’t understand. (Theoretical) Bayesian statistics is the study of probability flows under minimal assumptions—any quantity that behaves like we want a probability to behave can be described by Bayesian statistics. Therefore learning this general framework is useful when later looking at applications and most notably approximations. For what reasons do you suggest studying the approximation algorithms before studying the underlying framework?
Also you mention ‘Bayesian procedures’, I would like to clarify that I wasn’t referring to any particular Bayesian algorithm but to the complete study of (uncomputable) ideal Bayesian statistics.
(Theoretical) Bayesian statistics is the study of probability flows under minimal assumptions—any quantity that behaves like we want a probability to behave can be described by Bayesian statistics.
But nobody, least of all Bayesian statistical practitioners, does this. They encounter data, get familiar with it, pick/invent a model, pick/invent a prior, run (possibly approximate) inference of the model against the data, verify if inference is doing something reasonable, and jump back to an earlier step and change something if it doesn’t. After however long this takes (if they don’t give up), they might make some decision based on the (possibly approximate) posterior distribution they end up with. This decision might involve taking some actions in the wider world and/or writing a paper.
This is essentially the same workflow a frequentist statistician would use, and it’s only reasonable that a lot of the ideas that work in one of these settings would be useful, if not obvious or well-motivated, in the other.
A healthy interplay between theory and application is crucial for statistics, as no doubt for other fields. This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods. The very word foundations may, however, be a little misleading in that it suggests a solid base on which a large structure rests for its entire security. But foundations in the present context equally depend on and must be tested and revised in the light of experience and assessed by relevance to the very wide variety of contexts in which statistical considerations arise. It would be misleading to draw too close a parallel with the notion of a structure that would collapse if its foundations were destroyed.
But nobody, least of all Bayesian statistical practitioners, does this.
Well obviously. Same for physicists, nobody (other than some highly specialised teams working at particle accelerators) use the standard model to compute the predictions of their models. Or for computer science—most computer scientists don’t write code at the binary level, or explicitly give commands to individual transistors. Or chemists—just how many of the reaction equations do you think are being checked by solving the quantum mechanics? But just because the underlying theory doesn’t give as good a result-vs-time-tradeoff as some simplified model does not mean that the underlying theory can be ignored altogether (in my particular examples above I remark that the respective researchers do study the fundamentals, but then hardly ever need to apply them!)! By studying the underlying (often mathematically elegant) theory first one can later look at the messy real-world examples through the lens of this theory, and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory. This is why studying theoretical Bayesian statistics is a good investment of time—after this all other parts of statistics become more accessible and intuitive, as the specific methods can be fitted into the overarching theory.
Of course if you actually want to apply statistical methods to a real-world problem I think that the frequentist toolbox is one of the best options available (in terms of results vs. effort). But it becomes easier to understand these algorithms (where they make which assumptions, where they use shortcuts/substitutions to approximate for the sake of computation, exactly where, how and why they might fail etc.) if you become familiar with the minimal consistent framework for statistics, which to the best of my knowledge is Bayesian statistics.
Have you seen the series of blog posts by Robins and Wasserman that starts here? In problems like the one discussed there (such as the high-dimensional ones that are commonly seen these days), Bayesian procedures, and more broadly any procedures that satisfy the likelihood principle, just don’t work. The procedures that do work, according to frequentist criteria, do not arise from the likelihood so it’s hard to see how they could be approximations to a Bayesian solution.
You can also see this situation in the (frequentist) classic Theory of Point Estimation written by Lehmann and Casella. The text has four central chapters: “Unbiasedness”, “Equivariance”, “Average Risk Optimality”, and “Minimaxity and Admissibility”. Each of these introduces a principle for the design of estimators and then shows where this principle leads. “Average Risk Optimality” leads to Bayesian inference, but also Bayes-Lite methods like empirical Bayes. But each of the other three chapters leads to its own theory, with its own collection of methods that are optimal under that theory. Bayesian statistics is an important and substantial part of the story told by in that book, but it’s not the whole story. Said differently, Bayesian statistics may be a framework for Bayesian procedures and a useful way of analyzing non-Bayesian statistics, but they are not the framework for all of statistics.
and I think I get it. The example is phrased in the language of sampling/missing data, but for those in the audience familiar w/ Pearl, we can rephrase it as a causal inference problem. After all, causal inference is just another type of missing data problem.
We have a treatment A (a drug), and an outcome Y (death). Doctors assign A to some patients, but not others, based on their baseline covariates C. Then some patients die. The resulting data is an observational study, and we want to infer from it the effect of drug on survival, which we can obtain from p(Y | do(A=yes)).
We know in this case that p(Y | do(A=yes)) = sum{C} p(Y | A=yes,C) p(C) (this is just what “adjusting for confounders” means).
If we then had a parametric model for E[Y | A=yes,C], we could just fit that model and average (this is “likelihood based inference.”) Larry and Jamie are worried about the (admittedly adversarial) situation where maybe the relationship between Y and A and C is really complicated, and any specific parametric model we might conceivably use will be wrong, while non-parametric methods may have issues due to the curse of dimensionality in moderate samples. But of course the way we specified the problem, we know p(A | C) exactly, because doctors told us the rule by which they assign treatments.
Something like the Horvitz/Thompson estimator which uses this (correct) model only, or other estimators which address issues with the H/T estimator by also using the conditional model for Y, may have better behavior in such settings. But importantly, these methods are exploiting a part of the model we technically do not need (p(A | C) does not appear in the above “adjustment for confounders” expression anywhere), because in this particular setting it happens to be specified exactly, while the parts of the models we do technically need for likelihood based inference to work are really complicated and hard to get right at moderate samples.
But these kinds of estimators are not Bayesian. Of course arguably this entire setting is one Bayesians don’t worry about (but maybe they should? These settings do come up).
The CODA paper apparently stimulated some subsequent Bayesian activity, e.g.:
You’re welcome for the link, and it’s more than repaid by your causal inference restatement of the Robins-Ritov problem.
Of course arguably this entire setting is one Bayesians don’t worry about (but maybe they should? These settings do come up).
Yeah, I think this is the heart of the confusion. When you encounter a problem, you can turn the Bayesian crank and it will always do the Right thing, but it won’t always do the right thing. What I find disconcerting (as a Bayesian drifting towards frequentism) is that it’s not obvious how to assess the adequacy of a Bayesian analysis from within the Bayesian framework. In principle, you can do this mindlessly by marginalizing over all the model classes that might apply, maybe? But in practice, a single model class usually gets picked by non-Bayesian criteria like “does the posterior depend on the data in the right way?” or “does the posterior capture the ‘true model’ from simulated data?”. Or a Bayesian may (rightly or wrongly) decide that a Bayesian analysis is not appropriate in that setting.
As I’ve mentioned several times above Bayesian statistics are not just a set of estimators to be used on problems, they are the minimal framework of probability that satisfy Cox’ law. This means that any algorithm that isn’t even approximately Bayesian will spit out something other than (an approximation of) the posterior probability. In other words, in order to even get any sort of answer that can reasonably be used for further computation there has to be a Bayesian explanation, otherwise what your algorithm is doing just doesn’t have anything to do with statistics. This does not mean that the only useful algorithms are those crafted by trying to compute the likelihood ratio, nor does it mean that there is always a simple algorithm that would be classified as a ‘Bayesian algorithm’. It merely means that to do probability you have to do Bayes, and then maybe some more.
Did you actually read and understand the linked example? The entire point of it is that unless you basically craft your prior to mirror the frequentist behavior, your posterior will center on the truth super slowly. And the setting is not very artificial, exposure/outcome relationships w/ baseline covariates often are complicated, and we often do know randomization probabilities in trials.
Why would I want to approximate your posterior if it has this shitty behavior?
I will give it a shot (I recall reading a well-written explanation elsewhere on LW, and I don’t expect to be as clear as what I read there).
In any estimation or prediction setting we are interested in making accurate probabilistic claims about the behaviour of our system of study. In particular we would like to give a description of how the system will behave in the future (for example: ‘this drug cures patients 30% of the time’). This is captured by the posterior probability distribution.
Now if we have any algorithm whatsover that makes statements about our situation, and reliably gives the right answers (for example: most frequentist algorithms) then that algorithm must do something to convert data into predictions, i.e. it must spit out something that can be rewritten into a(n) (ideal) posterior distribution (which captures all our knowledge and therefore predictions).
Now Bayes’ Law gives a formula for this posterior distribution—when having observed data D then the posterior probability of any hypothesis A is equal to P(A|D) = P(D|A)P(A)/P(D) (here the P(D) is often written as sum(P(D|B)P(B)), and B runs over a complete set of mutually exclusive hypotheses). In order to give any accurate results whatsover an algorithm has to produce results that can be interpreted as the quantity above, P(A|D). This is starting to look Bayesian.
But now one can raise the (valid) point that even though we might want to produce the distribution above we don’t have to explicitly use the equation above to determine it—and indeed this is what gives birth the wide range of statistical algorithms (this along with the fact that the formula above is impossible to compute—there are simply too many hypotheses. So we have to make approximations anyway, and different approximations lead to different algorithms). There are two important observations about this point:
1) There are many, many theoretically possible algorithms (by which I mean computer programs, not just limited to statistics). If you were to create an algorithm at random then it will most likely not produce (approximations of) posterior distributions in a reliable and efficient way. So to produce any statistical algorithm at all you have to somehow make use of the mathematics of probabilities, and a good starting point is to use Bayes’ Theorem. Unless you know/can come up with a different formula for the l.h.s. (or at least a good new approximation) you have no choice but to approximate the r.h.s. (note that most of the time P(D|B) is known, so several of the terms on the r.h.s. are already known). This is part of the reason why many statistical algorithms can easily be interpreted and understood with Bayes’ Theorem—quite a few of them are derived from Bayes’ Theorem (for example for the classical testing of a hypothesis we assume that we initially have a set of hypotheses that all have exactly the same probability, and furthermore assume that some hypothesis neatly fits the data (P(D|B) ~ 1 for some B). Then if for our favourite hypothesis H0 we find a small p-value P(D|H0), i.e. the model did not predict the data, we find a posterior of P(H0|D) = P(D|H0) / sum(P(D|B)), which is also very small (the denominator is order 1, the enumalerator is small). Viewed from a Bayesian framework it is clear that it will therefore often suffice to compute P(D|H) rather than go for P(H|D), unless the data really is inexplicable by our hypotheses (P(D|B) is small for all B we consider) or our hypotheses do not have equal (or almost equal) initial likelihood (for example the theory of general relativity is supported by quite a bit of data, so if a single experiment were to disagree with the theory, even with intimidatingly low p-value, the posterior probability of the theory being right might still be large). By considering our algorithm with Bayes’ Law in mind at all times we do not only understand why the algorithm (p-values) work but also in exactly which cases they break down, and are even presented with a way to improve the algorithm in those cases).
2) The more important reason that the formula above is important is that it is true, even if you decide to use a radically different way of summarising your information to make predictions. If an ideal Bayesian reasoner shows up and computes the posterior probability, and then proceeds to make predictions, then most of the time your algorithm and the reasoner are going to agree (since most of the time you both make valid predictions). So there has to be some part of Bayes’ law that reflects what your algorithm does—if your algorithm works reliably, and Bayes’ law gives a mathematical formula for what you should predict, then for some unknown reason Bayes’ law can describe your algorithm (that is to say, if your algorithm predicts ‘Hypothesis H is likely/true, now that we have seen data D’ then there must be some part of the formula P(H|D) that makes it large, otherwise you two would not agree!). This is the most valuable insight into Bayesian statistics that I know—if anybody computes anything at all that acts like a prediction/probability and reliably gets correct results then in Bayes’ theorem the terms must happen to line up just right to give similar predictions, so we can describe the algorithm (and maybe even improve it) by making precise which terms act how. By looking at exactly what predictions a statistical algorithm will make under which conditions we can find out how the unknown terms on the r.h.s. of Bayes’ Theorem should act to produce similar results (the predictions from our non-Bayesian algorithm were correct most of the time, so there has to be some behaviour of these terms that will let us mimic the results), which explains the overt/hidden assumptions and thereby limitations of the model, provides understanding and more insight (I hope that my example above of p-values will provide an example of this insight—I certainly learned more about why people weren’t concerned about 5-sigma disagreements with GR by studying Bayes) and sometimes even improvements. This is what I was trying to say (but I was lazy so I tried to use fewer words. Clearly this backfired) earlier, when I said things like “look at the messy real-world examples through the lens of [Bayesian statistics] and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory”.
Lastly I would like to remark that in this post so far there has been no mention whatsoever of computational efficiency, and it should not be surprising that actual Bayesian algorithms are therefore often slower (or even uncomputable) when compared to other options (like I said above, there are many possible algorithms. There is no reason why the corrent one and the fast one should be the same). Therefore for real-world applications a non-Bayesian algoritm is often your best bet. But to study and understand the probabilities and general algorithms the Bayesian approach is superior—all because Bayes’ Theorem is a Theorem.
Lastly I would like to remark that often people (including me, above) write as if Bayesian statistics and frequentist statistics are in active conflict, which is mostly but not fully correct. Frequentist tests (and other tests, I tend to forget that there are more than two schools of statistics) are often superior in practice, and should most definitely be used in practice. But given an algorithm, some peculiar black box that gives the right answers, or given a philosophical conundrum on probabilies, the correct thing to do is write down Bayes’ Theorem and figure out how the unknowns should act to give the results we are faced with, i.e. figure out how an ideal Bayesian reasoner could come to the same conclusion. And it really is a shame that this doesn’t always lead to quick algorithms in practice.
I hope this has cleared up some misunderstandings.
Thanks for the write-up. I read it as you arguing that most any prediction can be interpreted in the Bayesian framework which, I think is a weaker claim.
However, there are issues with treating it as the only right way for it leaves a number of important questions unanswered. For example, how do you pick the prior? How do you assemble your set of possible outcomes (=hypotheses)? What happens if your forecast influences the result?
I also think that being uncomputable is a bigger deal than you make it to be.
I think that the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim, I don’t think I want to claim much more than that (although I do want to add that if we have such a unique framework that is both minimal and complete when it comes to making predictions then that seems like a very natural choice for Statistics with a capital s).
I don’t think we’re going to agree about the importance of computability without more context. I agree that every time I try to build myself a nice Bayesian algorithm I run into the problem of uncomputability, but personally I consider Bayesian statistics to be more of a method of evaluating algorithms than a method for creating them (although Bayesian statistics is by no means limited to this!).
As for your other questions: important to note is that your issues are issues with Bayesian statistics as much as they are issues with any other form of prediction making. To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes’ Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes’ Theorem to find the unknowns. At least, I think this is possible (it worked so far)). And indeed picking the prior and set of hypotheses is not an easy task—this is precisely what leads to different competing algorithms in the field of statistics.
To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes’ Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes’ Theorem to find the unknowns.
Okay, this is the last thing I’ll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you’re updating on the data twice and it’s hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?
(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I’m giving up. It is supremely dishonest to pretend there’s no trade-off present in this situation. And a Bayes-first education doesn’t even give you the concepts to see what you gain and what you lose by being a Bayesian.)
the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim
The Bayes Rule by itself is not a framework. It’s just a particular statistical operation, useful no doubt, but hardly arising to the level of framework.
The claim that you can interpret any prediction as forecasting a particular probability distribution has nothing to do with Bayes. For example, let’s say that an analyst predicts the average growth in the GDP of China for the next five years to be 5%. If we dig and poke we can re-express this as a forecast of something like a normal distribution centered at 5% and with some width which corresponds to the expected error—so there is your forecast probability distribution. But is there a particular prior here? Any specific pieces of evidence on which the analyst updated the prior? Um, not really.
The two examples you give (Bayesian statistics and calculus) are very good ones, I would definitely recommend becoming familiar with these. I am not sure how much is covered by the ‘calculus’ label, but I would recommend trying to understand on a gut level what a differential equation means (this is simpler than it might sound. Solving them, on the other hand, is hard and often tedious). I believe vector calculus (linear algebra) and the combination with differential equations (linear ODE’s of dimension at least two) are also covered by ‘calculus’? Again having the ability to solve them isn’t that important in most fields (in my limited experience), but grasping what exactly is happening is very valuable.
If you are wholly unfamiliar with statistics then I would also advice looking into frequentist statistics after having studied the Bayesian statistics—frequentist tools provide very accurate and easily computable approximations to the Bayesian inference, and being able to recognise/use these is useful in most sciences (from social science all the way to theoretical physics).
I would advise looking into frequentist statistics before studying Bayesian statistics. Inference done under Bayesian statistics is curiously silent about anything besides the posterior probability, including whether the model makes sense for the data, whether the knowledge gained about the model is likely to match reality, etc. Frequentist concepts like consistency, coverage probability, ancillarity, model checking, etc., don’t just apply to frequentist estimation; they can be used to asses and justify Bayesian procedures.
If anything, Bayesian statistics should just be treated as a factory that churns out estimation procedures. By a corollary of the complete class theorem, this is also the only way you can get good estimation procedures.
ETA: Can I get comments in addition to (or instead of) down votes here? This is a topic I don’t want to be mistaken about, so please tell me if I’m getting something wrong. Or rather if my comment is coming across as “boo Bayes”, which calls out for punishment.
Actually, if you have the necessary math background, it will probably be useful to start by looking at why and how the frequentists and the Bayesians differ.
Some good starting points, in addition to Bayes, are Fisher information and Neyman-Pearson hypothesis testing. This paper by Gelman and Shalizi could be interesting as well.
Thanks for pointing out the Gelman and Shalizi paper. Just skimmed it so far, but it looks like it really captures the zeitgeist of what reasonably thoughtful statisticians think of the framework they’re in the business of developing and using.
Plus, their final footnote, describing their misgivings about elevating Bayesianism beyond a tool in the hypothetico-deductive toolbox, is great:
I’m afraid I don’t understand. (Theoretical) Bayesian statistics is the study of probability flows under minimal assumptions—any quantity that behaves like we want a probability to behave can be described by Bayesian statistics. Therefore learning this general framework is useful when later looking at applications and most notably approximations. For what reasons do you suggest studying the approximation algorithms before studying the underlying framework?
Also you mention ‘Bayesian procedures’, I would like to clarify that I wasn’t referring to any particular Bayesian algorithm but to the complete study of (uncomputable) ideal Bayesian statistics.
But nobody, least of all Bayesian statistical practitioners, does this. They encounter data, get familiar with it, pick/invent a model, pick/invent a prior, run (possibly approximate) inference of the model against the data, verify if inference is doing something reasonable, and jump back to an earlier step and change something if it doesn’t. After however long this takes (if they don’t give up), they might make some decision based on the (possibly approximate) posterior distribution they end up with. This decision might involve taking some actions in the wider world and/or writing a paper.
This is essentially the same workflow a frequentist statistician would use, and it’s only reasonable that a lot of the ideas that work in one of these settings would be useful, if not obvious or well-motivated, in the other.
I know that philosophical underpinnings and underlying frameworks matter but to quote from a recent review article by Reid and Cox (2014):
Well obviously. Same for physicists, nobody (other than some highly specialised teams working at particle accelerators) use the standard model to compute the predictions of their models. Or for computer science—most computer scientists don’t write code at the binary level, or explicitly give commands to individual transistors. Or chemists—just how many of the reaction equations do you think are being checked by solving the quantum mechanics? But just because the underlying theory doesn’t give as good a result-vs-time-tradeoff as some simplified model does not mean that the underlying theory can be ignored altogether (in my particular examples above I remark that the respective researchers do study the fundamentals, but then hardly ever need to apply them!)! By studying the underlying (often mathematically elegant) theory first one can later look at the messy real-world examples through the lens of this theory, and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory. This is why studying theoretical Bayesian statistics is a good investment of time—after this all other parts of statistics become more accessible and intuitive, as the specific methods can be fitted into the overarching theory.
Of course if you actually want to apply statistical methods to a real-world problem I think that the frequentist toolbox is one of the best options available (in terms of results vs. effort). But it becomes easier to understand these algorithms (where they make which assumptions, where they use shortcuts/substitutions to approximate for the sake of computation, exactly where, how and why they might fail etc.) if you become familiar with the minimal consistent framework for statistics, which to the best of my knowledge is Bayesian statistics.
Have you seen the series of blog posts by Robins and Wasserman that starts here? In problems like the one discussed there (such as the high-dimensional ones that are commonly seen these days), Bayesian procedures, and more broadly any procedures that satisfy the likelihood principle, just don’t work. The procedures that do work, according to frequentist criteria, do not arise from the likelihood so it’s hard to see how they could be approximations to a Bayesian solution.
You can also see this situation in the (frequentist) classic Theory of Point Estimation written by Lehmann and Casella. The text has four central chapters: “Unbiasedness”, “Equivariance”, “Average Risk Optimality”, and “Minimaxity and Admissibility”. Each of these introduces a principle for the design of estimators and then shows where this principle leads. “Average Risk Optimality” leads to Bayesian inference, but also Bayes-Lite methods like empirical Bayes. But each of the other three chapters leads to its own theory, with its own collection of methods that are optimal under that theory. Bayesian statistics is an important and substantial part of the story told by in that book, but it’s not the whole story. Said differently, Bayesian statistics may be a framework for Bayesian procedures and a useful way of analyzing non-Bayesian statistics, but they are not the framework for all of statistics.
That’s an interesting example, thanks for linking it. I read it carefully, and also some of Robins/Ritov CODA paper:
http://www.biostat.harvard.edu/robins/coda.pdf
and I think I get it. The example is phrased in the language of sampling/missing data, but for those in the audience familiar w/ Pearl, we can rephrase it as a causal inference problem. After all, causal inference is just another type of missing data problem.
We have a treatment A (a drug), and an outcome Y (death). Doctors assign A to some patients, but not others, based on their baseline covariates C. Then some patients die. The resulting data is an observational study, and we want to infer from it the effect of drug on survival, which we can obtain from p(Y | do(A=yes)).
We know in this case that p(Y | do(A=yes)) = sum{C} p(Y | A=yes,C) p(C) (this is just what “adjusting for confounders” means).
If we then had a parametric model for E[Y | A=yes,C], we could just fit that model and average (this is “likelihood based inference.”) Larry and Jamie are worried about the (admittedly adversarial) situation where maybe the relationship between Y and A and C is really complicated, and any specific parametric model we might conceivably use will be wrong, while non-parametric methods may have issues due to the curse of dimensionality in moderate samples. But of course the way we specified the problem, we know p(A | C) exactly, because doctors told us the rule by which they assign treatments.
Something like the Horvitz/Thompson estimator which uses this (correct) model only, or other estimators which address issues with the H/T estimator by also using the conditional model for Y, may have better behavior in such settings. But importantly, these methods are exploiting a part of the model we technically do not need (p(A | C) does not appear in the above “adjustment for confounders” expression anywhere), because in this particular setting it happens to be specified exactly, while the parts of the models we do technically need for likelihood based inference to work are really complicated and hard to get right at moderate samples.
But these kinds of estimators are not Bayesian. Of course arguably this entire setting is one Bayesians don’t worry about (but maybe they should? These settings do come up).
The CODA paper apparently stimulated some subsequent Bayesian activity, e.g.:
http://www.is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/techreport2007_6326[0%5D.pdf
So, things are working as intended :).
You’re welcome for the link, and it’s more than repaid by your causal inference restatement of the Robins-Ritov problem.
Yeah, I think this is the heart of the confusion. When you encounter a problem, you can turn the Bayesian crank and it will always do the Right thing, but it won’t always do the right thing. What I find disconcerting (as a Bayesian drifting towards frequentism) is that it’s not obvious how to assess the adequacy of a Bayesian analysis from within the Bayesian framework. In principle, you can do this mindlessly by marginalizing over all the model classes that might apply, maybe? But in practice, a single model class usually gets picked by non-Bayesian criteria like “does the posterior depend on the data in the right way?” or “does the posterior capture the ‘true model’ from simulated data?”. Or a Bayesian may (rightly or wrongly) decide that a Bayesian analysis is not appropriate in that setting.
As I’ve mentioned several times above Bayesian statistics are not just a set of estimators to be used on problems, they are the minimal framework of probability that satisfy Cox’ law. This means that any algorithm that isn’t even approximately Bayesian will spit out something other than (an approximation of) the posterior probability. In other words, in order to even get any sort of answer that can reasonably be used for further computation there has to be a Bayesian explanation, otherwise what your algorithm is doing just doesn’t have anything to do with statistics. This does not mean that the only useful algorithms are those crafted by trying to compute the likelihood ratio, nor does it mean that there is always a simple algorithm that would be classified as a ‘Bayesian algorithm’. It merely means that to do probability you have to do Bayes, and then maybe some more.
Did you actually read and understand the linked example? The entire point of it is that unless you basically craft your prior to mirror the frequentist behavior, your posterior will center on the truth super slowly. And the setting is not very artificial, exposure/outcome relationships w/ baseline covariates often are complicated, and we often do know randomization probabilities in trials.
Why would I want to approximate your posterior if it has this shitty behavior?
Can you elaborate on this? I don’t think that’s how most people understand Bayesian statistics.
I will give it a shot (I recall reading a well-written explanation elsewhere on LW, and I don’t expect to be as clear as what I read there).
In any estimation or prediction setting we are interested in making accurate probabilistic claims about the behaviour of our system of study. In particular we would like to give a description of how the system will behave in the future (for example: ‘this drug cures patients 30% of the time’). This is captured by the posterior probability distribution.
Now if we have any algorithm whatsover that makes statements about our situation, and reliably gives the right answers (for example: most frequentist algorithms) then that algorithm must do something to convert data into predictions, i.e. it must spit out something that can be rewritten into a(n) (ideal) posterior distribution (which captures all our knowledge and therefore predictions).
Now Bayes’ Law gives a formula for this posterior distribution—when having observed data D then the posterior probability of any hypothesis A is equal to P(A|D) = P(D|A)P(A)/P(D) (here the P(D) is often written as sum(P(D|B)P(B)), and B runs over a complete set of mutually exclusive hypotheses). In order to give any accurate results whatsover an algorithm has to produce results that can be interpreted as the quantity above, P(A|D). This is starting to look Bayesian.
But now one can raise the (valid) point that even though we might want to produce the distribution above we don’t have to explicitly use the equation above to determine it—and indeed this is what gives birth the wide range of statistical algorithms (this along with the fact that the formula above is impossible to compute—there are simply too many hypotheses. So we have to make approximations anyway, and different approximations lead to different algorithms). There are two important observations about this point:
1) There are many, many theoretically possible algorithms (by which I mean computer programs, not just limited to statistics). If you were to create an algorithm at random then it will most likely not produce (approximations of) posterior distributions in a reliable and efficient way. So to produce any statistical algorithm at all you have to somehow make use of the mathematics of probabilities, and a good starting point is to use Bayes’ Theorem. Unless you know/can come up with a different formula for the l.h.s. (or at least a good new approximation) you have no choice but to approximate the r.h.s. (note that most of the time P(D|B) is known, so several of the terms on the r.h.s. are already known). This is part of the reason why many statistical algorithms can easily be interpreted and understood with Bayes’ Theorem—quite a few of them are derived from Bayes’ Theorem (for example for the classical testing of a hypothesis we assume that we initially have a set of hypotheses that all have exactly the same probability, and furthermore assume that some hypothesis neatly fits the data (P(D|B) ~ 1 for some B). Then if for our favourite hypothesis H0 we find a small p-value P(D|H0), i.e. the model did not predict the data, we find a posterior of P(H0|D) = P(D|H0) / sum(P(D|B)), which is also very small (the denominator is order 1, the enumalerator is small). Viewed from a Bayesian framework it is clear that it will therefore often suffice to compute P(D|H) rather than go for P(H|D), unless the data really is inexplicable by our hypotheses (P(D|B) is small for all B we consider) or our hypotheses do not have equal (or almost equal) initial likelihood (for example the theory of general relativity is supported by quite a bit of data, so if a single experiment were to disagree with the theory, even with intimidatingly low p-value, the posterior probability of the theory being right might still be large). By considering our algorithm with Bayes’ Law in mind at all times we do not only understand why the algorithm (p-values) work but also in exactly which cases they break down, and are even presented with a way to improve the algorithm in those cases).
2) The more important reason that the formula above is important is that it is true, even if you decide to use a radically different way of summarising your information to make predictions. If an ideal Bayesian reasoner shows up and computes the posterior probability, and then proceeds to make predictions, then most of the time your algorithm and the reasoner are going to agree (since most of the time you both make valid predictions). So there has to be some part of Bayes’ law that reflects what your algorithm does—if your algorithm works reliably, and Bayes’ law gives a mathematical formula for what you should predict, then for some unknown reason Bayes’ law can describe your algorithm (that is to say, if your algorithm predicts ‘Hypothesis H is likely/true, now that we have seen data D’ then there must be some part of the formula P(H|D) that makes it large, otherwise you two would not agree!). This is the most valuable insight into Bayesian statistics that I know—if anybody computes anything at all that acts like a prediction/probability and reliably gets correct results then in Bayes’ theorem the terms must happen to line up just right to give similar predictions, so we can describe the algorithm (and maybe even improve it) by making precise which terms act how. By looking at exactly what predictions a statistical algorithm will make under which conditions we can find out how the unknown terms on the r.h.s. of Bayes’ Theorem should act to produce similar results (the predictions from our non-Bayesian algorithm were correct most of the time, so there has to be some behaviour of these terms that will let us mimic the results), which explains the overt/hidden assumptions and thereby limitations of the model, provides understanding and more insight (I hope that my example above of p-values will provide an example of this insight—I certainly learned more about why people weren’t concerned about 5-sigma disagreements with GR by studying Bayes) and sometimes even improvements. This is what I was trying to say (but I was lazy so I tried to use fewer words. Clearly this backfired) earlier, when I said things like “look at the messy real-world examples through the lens of [Bayesian statistics] and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory”.
Lastly I would like to remark that in this post so far there has been no mention whatsoever of computational efficiency, and it should not be surprising that actual Bayesian algorithms are therefore often slower (or even uncomputable) when compared to other options (like I said above, there are many possible algorithms. There is no reason why the corrent one and the fast one should be the same). Therefore for real-world applications a non-Bayesian algoritm is often your best bet. But to study and understand the probabilities and general algorithms the Bayesian approach is superior—all because Bayes’ Theorem is a Theorem.
Lastly I would like to remark that often people (including me, above) write as if Bayesian statistics and frequentist statistics are in active conflict, which is mostly but not fully correct. Frequentist tests (and other tests, I tend to forget that there are more than two schools of statistics) are often superior in practice, and should most definitely be used in practice. But given an algorithm, some peculiar black box that gives the right answers, or given a philosophical conundrum on probabilies, the correct thing to do is write down Bayes’ Theorem and figure out how the unknowns should act to give the results we are faced with, i.e. figure out how an ideal Bayesian reasoner could come to the same conclusion. And it really is a shame that this doesn’t always lead to quick algorithms in practice.
I hope this has cleared up some misunderstandings.
Thanks for the write-up. I read it as you arguing that most any prediction can be interpreted in the Bayesian framework which, I think is a weaker claim.
However, there are issues with treating it as the only right way for it leaves a number of important questions unanswered. For example, how do you pick the prior? How do you assemble your set of possible outcomes (=hypotheses)? What happens if your forecast influences the result?
I also think that being uncomputable is a bigger deal than you make it to be.
I think that the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim, I don’t think I want to claim much more than that (although I do want to add that if we have such a unique framework that is both minimal and complete when it comes to making predictions then that seems like a very natural choice for Statistics with a capital s).
I don’t think we’re going to agree about the importance of computability without more context. I agree that every time I try to build myself a nice Bayesian algorithm I run into the problem of uncomputability, but personally I consider Bayesian statistics to be more of a method of evaluating algorithms than a method for creating them (although Bayesian statistics is by no means limited to this!).
As for your other questions: important to note is that your issues are issues with Bayesian statistics as much as they are issues with any other form of prediction making. To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes’ Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes’ Theorem to find the unknowns. At least, I think this is possible (it worked so far)). And indeed picking the prior and set of hypotheses is not an easy task—this is precisely what leads to different competing algorithms in the field of statistics.
Okay, this is the last thing I’ll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you’re updating on the data twice and it’s hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?
(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I’m giving up. It is supremely dishonest to pretend there’s no trade-off present in this situation. And a Bayes-first education doesn’t even give you the concepts to see what you gain and what you lose by being a Bayesian.)
The Bayes Rule by itself is not a framework. It’s just a particular statistical operation, useful no doubt, but hardly arising to the level of framework.
The claim that you can interpret any prediction as forecasting a particular probability distribution has nothing to do with Bayes. For example, let’s say that an analyst predicts the average growth in the GDP of China for the next five years to be 5%. If we dig and poke we can re-express this as a forecast of something like a normal distribution centered at 5% and with some width which corresponds to the expected error—so there is your forecast probability distribution. But is there a particular prior here? Any specific pieces of evidence on which the analyst updated the prior? Um, not really.