As I’ve mentioned several times above Bayesian statistics are not just a set of estimators to be used on problems, they are the minimal framework of probability that satisfy Cox’ law. This means that any algorithm that isn’t even approximately Bayesian will spit out something other than (an approximation of) the posterior probability. In other words, in order to even get any sort of answer that can reasonably be used for further computation there has to be a Bayesian explanation, otherwise what your algorithm is doing just doesn’t have anything to do with statistics. This does not mean that the only useful algorithms are those crafted by trying to compute the likelihood ratio, nor does it mean that there is always a simple algorithm that would be classified as a ‘Bayesian algorithm’. It merely means that to do probability you have to do Bayes, and then maybe some more.
Did you actually read and understand the linked example? The entire point of it is that unless you basically craft your prior to mirror the frequentist behavior, your posterior will center on the truth super slowly. And the setting is not very artificial, exposure/outcome relationships w/ baseline covariates often are complicated, and we often do know randomization probabilities in trials.
Why would I want to approximate your posterior if it has this shitty behavior?
I will give it a shot (I recall reading a well-written explanation elsewhere on LW, and I don’t expect to be as clear as what I read there).
In any estimation or prediction setting we are interested in making accurate probabilistic claims about the behaviour of our system of study. In particular we would like to give a description of how the system will behave in the future (for example: ‘this drug cures patients 30% of the time’). This is captured by the posterior probability distribution.
Now if we have any algorithm whatsover that makes statements about our situation, and reliably gives the right answers (for example: most frequentist algorithms) then that algorithm must do something to convert data into predictions, i.e. it must spit out something that can be rewritten into a(n) (ideal) posterior distribution (which captures all our knowledge and therefore predictions).
Now Bayes’ Law gives a formula for this posterior distribution—when having observed data D then the posterior probability of any hypothesis A is equal to P(A|D) = P(D|A)P(A)/P(D) (here the P(D) is often written as sum(P(D|B)P(B)), and B runs over a complete set of mutually exclusive hypotheses). In order to give any accurate results whatsover an algorithm has to produce results that can be interpreted as the quantity above, P(A|D). This is starting to look Bayesian.
But now one can raise the (valid) point that even though we might want to produce the distribution above we don’t have to explicitly use the equation above to determine it—and indeed this is what gives birth the wide range of statistical algorithms (this along with the fact that the formula above is impossible to compute—there are simply too many hypotheses. So we have to make approximations anyway, and different approximations lead to different algorithms). There are two important observations about this point:
1) There are many, many theoretically possible algorithms (by which I mean computer programs, not just limited to statistics). If you were to create an algorithm at random then it will most likely not produce (approximations of) posterior distributions in a reliable and efficient way. So to produce any statistical algorithm at all you have to somehow make use of the mathematics of probabilities, and a good starting point is to use Bayes’ Theorem. Unless you know/can come up with a different formula for the l.h.s. (or at least a good new approximation) you have no choice but to approximate the r.h.s. (note that most of the time P(D|B) is known, so several of the terms on the r.h.s. are already known). This is part of the reason why many statistical algorithms can easily be interpreted and understood with Bayes’ Theorem—quite a few of them are derived from Bayes’ Theorem (for example for the classical testing of a hypothesis we assume that we initially have a set of hypotheses that all have exactly the same probability, and furthermore assume that some hypothesis neatly fits the data (P(D|B) ~ 1 for some B). Then if for our favourite hypothesis H0 we find a small p-value P(D|H0), i.e. the model did not predict the data, we find a posterior of P(H0|D) = P(D|H0) / sum(P(D|B)), which is also very small (the denominator is order 1, the enumalerator is small). Viewed from a Bayesian framework it is clear that it will therefore often suffice to compute P(D|H) rather than go for P(H|D), unless the data really is inexplicable by our hypotheses (P(D|B) is small for all B we consider) or our hypotheses do not have equal (or almost equal) initial likelihood (for example the theory of general relativity is supported by quite a bit of data, so if a single experiment were to disagree with the theory, even with intimidatingly low p-value, the posterior probability of the theory being right might still be large). By considering our algorithm with Bayes’ Law in mind at all times we do not only understand why the algorithm (p-values) work but also in exactly which cases they break down, and are even presented with a way to improve the algorithm in those cases).
2) The more important reason that the formula above is important is that it is true, even if you decide to use a radically different way of summarising your information to make predictions. If an ideal Bayesian reasoner shows up and computes the posterior probability, and then proceeds to make predictions, then most of the time your algorithm and the reasoner are going to agree (since most of the time you both make valid predictions). So there has to be some part of Bayes’ law that reflects what your algorithm does—if your algorithm works reliably, and Bayes’ law gives a mathematical formula for what you should predict, then for some unknown reason Bayes’ law can describe your algorithm (that is to say, if your algorithm predicts ‘Hypothesis H is likely/true, now that we have seen data D’ then there must be some part of the formula P(H|D) that makes it large, otherwise you two would not agree!). This is the most valuable insight into Bayesian statistics that I know—if anybody computes anything at all that acts like a prediction/probability and reliably gets correct results then in Bayes’ theorem the terms must happen to line up just right to give similar predictions, so we can describe the algorithm (and maybe even improve it) by making precise which terms act how. By looking at exactly what predictions a statistical algorithm will make under which conditions we can find out how the unknown terms on the r.h.s. of Bayes’ Theorem should act to produce similar results (the predictions from our non-Bayesian algorithm were correct most of the time, so there has to be some behaviour of these terms that will let us mimic the results), which explains the overt/hidden assumptions and thereby limitations of the model, provides understanding and more insight (I hope that my example above of p-values will provide an example of this insight—I certainly learned more about why people weren’t concerned about 5-sigma disagreements with GR by studying Bayes) and sometimes even improvements. This is what I was trying to say (but I was lazy so I tried to use fewer words. Clearly this backfired) earlier, when I said things like “look at the messy real-world examples through the lens of [Bayesian statistics] and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory”.
Lastly I would like to remark that in this post so far there has been no mention whatsoever of computational efficiency, and it should not be surprising that actual Bayesian algorithms are therefore often slower (or even uncomputable) when compared to other options (like I said above, there are many possible algorithms. There is no reason why the corrent one and the fast one should be the same). Therefore for real-world applications a non-Bayesian algoritm is often your best bet. But to study and understand the probabilities and general algorithms the Bayesian approach is superior—all because Bayes’ Theorem is a Theorem.
Lastly I would like to remark that often people (including me, above) write as if Bayesian statistics and frequentist statistics are in active conflict, which is mostly but not fully correct. Frequentist tests (and other tests, I tend to forget that there are more than two schools of statistics) are often superior in practice, and should most definitely be used in practice. But given an algorithm, some peculiar black box that gives the right answers, or given a philosophical conundrum on probabilies, the correct thing to do is write down Bayes’ Theorem and figure out how the unknowns should act to give the results we are faced with, i.e. figure out how an ideal Bayesian reasoner could come to the same conclusion. And it really is a shame that this doesn’t always lead to quick algorithms in practice.
I hope this has cleared up some misunderstandings.
Thanks for the write-up. I read it as you arguing that most any prediction can be interpreted in the Bayesian framework which, I think is a weaker claim.
However, there are issues with treating it as the only right way for it leaves a number of important questions unanswered. For example, how do you pick the prior? How do you assemble your set of possible outcomes (=hypotheses)? What happens if your forecast influences the result?
I also think that being uncomputable is a bigger deal than you make it to be.
I think that the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim, I don’t think I want to claim much more than that (although I do want to add that if we have such a unique framework that is both minimal and complete when it comes to making predictions then that seems like a very natural choice for Statistics with a capital s).
I don’t think we’re going to agree about the importance of computability without more context. I agree that every time I try to build myself a nice Bayesian algorithm I run into the problem of uncomputability, but personally I consider Bayesian statistics to be more of a method of evaluating algorithms than a method for creating them (although Bayesian statistics is by no means limited to this!).
As for your other questions: important to note is that your issues are issues with Bayesian statistics as much as they are issues with any other form of prediction making. To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes’ Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes’ Theorem to find the unknowns. At least, I think this is possible (it worked so far)). And indeed picking the prior and set of hypotheses is not an easy task—this is precisely what leads to different competing algorithms in the field of statistics.
To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes’ Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes’ Theorem to find the unknowns.
Okay, this is the last thing I’ll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you’re updating on the data twice and it’s hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?
(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I’m giving up. It is supremely dishonest to pretend there’s no trade-off present in this situation. And a Bayes-first education doesn’t even give you the concepts to see what you gain and what you lose by being a Bayesian.)
the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim
The Bayes Rule by itself is not a framework. It’s just a particular statistical operation, useful no doubt, but hardly arising to the level of framework.
The claim that you can interpret any prediction as forecasting a particular probability distribution has nothing to do with Bayes. For example, let’s say that an analyst predicts the average growth in the GDP of China for the next five years to be 5%. If we dig and poke we can re-express this as a forecast of something like a normal distribution centered at 5% and with some width which corresponds to the expected error—so there is your forecast probability distribution. But is there a particular prior here? Any specific pieces of evidence on which the analyst updated the prior? Um, not really.
As I’ve mentioned several times above Bayesian statistics are not just a set of estimators to be used on problems, they are the minimal framework of probability that satisfy Cox’ law. This means that any algorithm that isn’t even approximately Bayesian will spit out something other than (an approximation of) the posterior probability. In other words, in order to even get any sort of answer that can reasonably be used for further computation there has to be a Bayesian explanation, otherwise what your algorithm is doing just doesn’t have anything to do with statistics. This does not mean that the only useful algorithms are those crafted by trying to compute the likelihood ratio, nor does it mean that there is always a simple algorithm that would be classified as a ‘Bayesian algorithm’. It merely means that to do probability you have to do Bayes, and then maybe some more.
Did you actually read and understand the linked example? The entire point of it is that unless you basically craft your prior to mirror the frequentist behavior, your posterior will center on the truth super slowly. And the setting is not very artificial, exposure/outcome relationships w/ baseline covariates often are complicated, and we often do know randomization probabilities in trials.
Why would I want to approximate your posterior if it has this shitty behavior?
Can you elaborate on this? I don’t think that’s how most people understand Bayesian statistics.
I will give it a shot (I recall reading a well-written explanation elsewhere on LW, and I don’t expect to be as clear as what I read there).
In any estimation or prediction setting we are interested in making accurate probabilistic claims about the behaviour of our system of study. In particular we would like to give a description of how the system will behave in the future (for example: ‘this drug cures patients 30% of the time’). This is captured by the posterior probability distribution.
Now if we have any algorithm whatsover that makes statements about our situation, and reliably gives the right answers (for example: most frequentist algorithms) then that algorithm must do something to convert data into predictions, i.e. it must spit out something that can be rewritten into a(n) (ideal) posterior distribution (which captures all our knowledge and therefore predictions).
Now Bayes’ Law gives a formula for this posterior distribution—when having observed data D then the posterior probability of any hypothesis A is equal to P(A|D) = P(D|A)P(A)/P(D) (here the P(D) is often written as sum(P(D|B)P(B)), and B runs over a complete set of mutually exclusive hypotheses). In order to give any accurate results whatsover an algorithm has to produce results that can be interpreted as the quantity above, P(A|D). This is starting to look Bayesian.
But now one can raise the (valid) point that even though we might want to produce the distribution above we don’t have to explicitly use the equation above to determine it—and indeed this is what gives birth the wide range of statistical algorithms (this along with the fact that the formula above is impossible to compute—there are simply too many hypotheses. So we have to make approximations anyway, and different approximations lead to different algorithms). There are two important observations about this point:
1) There are many, many theoretically possible algorithms (by which I mean computer programs, not just limited to statistics). If you were to create an algorithm at random then it will most likely not produce (approximations of) posterior distributions in a reliable and efficient way. So to produce any statistical algorithm at all you have to somehow make use of the mathematics of probabilities, and a good starting point is to use Bayes’ Theorem. Unless you know/can come up with a different formula for the l.h.s. (or at least a good new approximation) you have no choice but to approximate the r.h.s. (note that most of the time P(D|B) is known, so several of the terms on the r.h.s. are already known). This is part of the reason why many statistical algorithms can easily be interpreted and understood with Bayes’ Theorem—quite a few of them are derived from Bayes’ Theorem (for example for the classical testing of a hypothesis we assume that we initially have a set of hypotheses that all have exactly the same probability, and furthermore assume that some hypothesis neatly fits the data (P(D|B) ~ 1 for some B). Then if for our favourite hypothesis H0 we find a small p-value P(D|H0), i.e. the model did not predict the data, we find a posterior of P(H0|D) = P(D|H0) / sum(P(D|B)), which is also very small (the denominator is order 1, the enumalerator is small). Viewed from a Bayesian framework it is clear that it will therefore often suffice to compute P(D|H) rather than go for P(H|D), unless the data really is inexplicable by our hypotheses (P(D|B) is small for all B we consider) or our hypotheses do not have equal (or almost equal) initial likelihood (for example the theory of general relativity is supported by quite a bit of data, so if a single experiment were to disagree with the theory, even with intimidatingly low p-value, the posterior probability of the theory being right might still be large). By considering our algorithm with Bayes’ Law in mind at all times we do not only understand why the algorithm (p-values) work but also in exactly which cases they break down, and are even presented with a way to improve the algorithm in those cases).
2) The more important reason that the formula above is important is that it is true, even if you decide to use a radically different way of summarising your information to make predictions. If an ideal Bayesian reasoner shows up and computes the posterior probability, and then proceeds to make predictions, then most of the time your algorithm and the reasoner are going to agree (since most of the time you both make valid predictions). So there has to be some part of Bayes’ law that reflects what your algorithm does—if your algorithm works reliably, and Bayes’ law gives a mathematical formula for what you should predict, then for some unknown reason Bayes’ law can describe your algorithm (that is to say, if your algorithm predicts ‘Hypothesis H is likely/true, now that we have seen data D’ then there must be some part of the formula P(H|D) that makes it large, otherwise you two would not agree!). This is the most valuable insight into Bayesian statistics that I know—if anybody computes anything at all that acts like a prediction/probability and reliably gets correct results then in Bayes’ theorem the terms must happen to line up just right to give similar predictions, so we can describe the algorithm (and maybe even improve it) by making precise which terms act how. By looking at exactly what predictions a statistical algorithm will make under which conditions we can find out how the unknown terms on the r.h.s. of Bayes’ Theorem should act to produce similar results (the predictions from our non-Bayesian algorithm were correct most of the time, so there has to be some behaviour of these terms that will let us mimic the results), which explains the overt/hidden assumptions and thereby limitations of the model, provides understanding and more insight (I hope that my example above of p-values will provide an example of this insight—I certainly learned more about why people weren’t concerned about 5-sigma disagreements with GR by studying Bayes) and sometimes even improvements. This is what I was trying to say (but I was lazy so I tried to use fewer words. Clearly this backfired) earlier, when I said things like “look at the messy real-world examples through the lens of [Bayesian statistics] and see how the tricks that are used in practice are mostly making use of but often partly disagree with the overarching theory”.
Lastly I would like to remark that in this post so far there has been no mention whatsoever of computational efficiency, and it should not be surprising that actual Bayesian algorithms are therefore often slower (or even uncomputable) when compared to other options (like I said above, there are many possible algorithms. There is no reason why the corrent one and the fast one should be the same). Therefore for real-world applications a non-Bayesian algoritm is often your best bet. But to study and understand the probabilities and general algorithms the Bayesian approach is superior—all because Bayes’ Theorem is a Theorem.
Lastly I would like to remark that often people (including me, above) write as if Bayesian statistics and frequentist statistics are in active conflict, which is mostly but not fully correct. Frequentist tests (and other tests, I tend to forget that there are more than two schools of statistics) are often superior in practice, and should most definitely be used in practice. But given an algorithm, some peculiar black box that gives the right answers, or given a philosophical conundrum on probabilies, the correct thing to do is write down Bayes’ Theorem and figure out how the unknowns should act to give the results we are faced with, i.e. figure out how an ideal Bayesian reasoner could come to the same conclusion. And it really is a shame that this doesn’t always lead to quick algorithms in practice.
I hope this has cleared up some misunderstandings.
Thanks for the write-up. I read it as you arguing that most any prediction can be interpreted in the Bayesian framework which, I think is a weaker claim.
However, there are issues with treating it as the only right way for it leaves a number of important questions unanswered. For example, how do you pick the prior? How do you assemble your set of possible outcomes (=hypotheses)? What happens if your forecast influences the result?
I also think that being uncomputable is a bigger deal than you make it to be.
I think that the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim, I don’t think I want to claim much more than that (although I do want to add that if we have such a unique framework that is both minimal and complete when it comes to making predictions then that seems like a very natural choice for Statistics with a capital s).
I don’t think we’re going to agree about the importance of computability without more context. I agree that every time I try to build myself a nice Bayesian algorithm I run into the problem of uncomputability, but personally I consider Bayesian statistics to be more of a method of evaluating algorithms than a method for creating them (although Bayesian statistics is by no means limited to this!).
As for your other questions: important to note is that your issues are issues with Bayesian statistics as much as they are issues with any other form of prediction making. To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes’ Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes’ Theorem to find the unknowns. At least, I think this is possible (it worked so far)). And indeed picking the prior and set of hypotheses is not an easy task—this is precisely what leads to different competing algorithms in the field of statistics.
Okay, this is the last thing I’ll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you’re updating on the data twice and it’s hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?
(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I’m giving up. It is supremely dishonest to pretend there’s no trade-off present in this situation. And a Bayes-first education doesn’t even give you the concepts to see what you gain and what you lose by being a Bayesian.)
The Bayes Rule by itself is not a framework. It’s just a particular statistical operation, useful no doubt, but hardly arising to the level of framework.
The claim that you can interpret any prediction as forecasting a particular probability distribution has nothing to do with Bayes. For example, let’s say that an analyst predicts the average growth in the GDP of China for the next five years to be 5%. If we dig and poke we can re-express this as a forecast of something like a normal distribution centered at 5% and with some width which corresponds to the expected error—so there is your forecast probability distribution. But is there a particular prior here? Any specific pieces of evidence on which the analyst updated the prior? Um, not really.