I am suspicious of the framing of the question, which doesn’t make clear which of several things it is talking about. Here’s Jacob Steinhardt’s post “Beyond Bayesians and Frequentists,” on some of the options.
In the notation of that post, I’d say I am interested mostly in the argument over “Whether a Bayesian or frequentist algorithm is better suited to solving a particular problem”, generalized over a wide range of problems. And the sort of frequentism I have in mind seems to be “frequentist guarantee”—the process of taking data and making inferences from it on some quantity of interest, and the importance to be given to guarantees on the process.
It’s certainly not the case that Bayesian methods are universally better than frequentist ones.
Examples where frequentist methods are better?
My guess is in hugely overdetermined cases where prior gets swamped by likelihood, and in cases where explicitly representing uncertainty is utterly intractable (like numerical methods), but I’d like to hear it from someone who knows what they are talking about.
Also, if it’s not “Bayesian”, is there a term for the statistical methodology that is always best in all situations (in the spirit of “rationalists should win”)? It seems to me that given that Bayesianism is correct in the ideal sense, the “best” method will always just be the best approximation of the Bayesian answer (where “best” includes factors like computational simplicity).
Well, the most commonly used statistical methods are probably:
Logistic regression
Support vector machines
Principle components analysis
All of these are frequentist: logistic regression is quite explicitly computing the maximum-likelihood-estimate of a parameter vector, SVMs are minimizing a surrogate to generalization error, and PCA is a bit weird but is basically just trying to find a low-rank approximation to the data.
ETA: And to answer your other question, I think that would just be called “the best method”; why would we need another name? No one is going to design a method that they think is strictly dominated by all other methods anyway...at least, not if you also take into account time to implement, which I think is an important consideration (at least in the limit, where with infinite time I can just hard-code everything as a special case).
ETA2: It’s also not clear to me that Bayesianism is correct in the ideal sense (or even what that means), or that it’s fruitful to think of what you’re doing as trying to approximate Bayes (at least not in all situations; I definitely agree that it can be helpful sometimes). I don’t know if I’ll be able to convince you of either of these here though, as this is a disagreement that Eliezer and I still have despite a 4-hour-long discussion (and of course this causes me to update in the direction of me being mistaken).
logistic regression is quite explicitly computing the maximum-likelihood-estimate of a parameter vector
So, it explicitly considers only P(data|model) and doesn’t work with a nontrivial distribution over P(model), and it’s widely used.
Suppose that there is a significant difference of P(model) across relevant models. Do you think in this case that maximizing P(model)*P(data|model) in order to get P(model|data) would be worse?
Well, there’s a couple of issues here: first, logP(data|model) is a concave function for logistic regression, so unless logP(model) is also concave, the maximization may not reach the global optimum.
Secondly, the proper Bayesian thing to do would be to sample from the posterior, not maximize; for instance, in logistic regression the model is given by a vector of parameters denoted by theta. Suppose that we actually believed that the prior on theta was exp(-|theta|), where |theta| is the sum of the absolute values of the coordinates of theta. Then maximizing P(model|data) in this case will tend to give you solutions where most of the entries of theta are equal to 0, whereas the actual posterior places zero probability mass on such solutions.
On the second point—fair enough, though even under Bayes it’s sometimes reasonable to want a single answer on account of you only get to actually do one thing.
If you have that prior and you maximize P(model|data) on solutions with a zero probability mass on either P(data|model) or P(model), you’re screwing up multiplication.
Well, the point is that if you have a continuous-space, then the maximum-likelihood solution will have zero entries with positive probability, but the posterior probability of a zero entry is 0.
How? If any of the probabilities that the posterior probability factors into are zero, the product is also zero. Or do you just mean that since data are unlimited precision in a continuous space, no answer can ever have a positive probability because it’s infinitely unlikely?
Can you explain in what sense PCA is frequentist? I’m not sure it even deserves to be called a statistical method except insofar as it happens to be useful in statistics.
Yeah, calling PCA frequentist may be a bit of a stretch (although it’s certainly not Bayesian). I think ICA (independent components analysis) could legitimately be called frequentist though, as it solves the blind source separation problem under certain independence assumptions (I don’t know that much about either of these though, so I could be wrong).
It’s also not clear to me that Bayesianism is correct in the ideal sense (or even what that means)
Interesting. Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology? Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer? Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
(as for how they do, we can put them in bayesian terms to see. Maximum liklihood methods assume a flat improper prior, and report the mode of the resulting probability distribution. We can immediately see that building in the prior disallows aggregation of different information sources. Only reporting the mode hides confidence interval and goes way off in the presence of skew. Also, we can’t apply safety factors sensibly (they involve utility calculation, which involves confidence intervals at the least).)
I don’t know much about SVM and PCA, but bayesian logistic regression is easy and superior to max liklihood for most things.
Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology?
Not Cox’s theorem, although the complete class theorem is more convincing (as well as dutch book arguments).
Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer?
Only in the very weak sense that by the complete class theorem there exists a Bayesian method (or a limit of Bayesian methods) that does at least as well as whatever you’re doing. So sure, if you really had infinite computational resources then you could find such a method and use it...but I think that has almost no bearing on practice. Certainly I think there are many situations where a prior is unavailable.
Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
Almost certainly not, although maybe we should taboo “because”. First of all, the “correct” probability-theory answer is not well-defined because the choice of both the prior and likelihood are both completely unconstrained. Secondly, I think the choice of whether to be Bayesian or frequentist is not nearly as important as e.g. the choice of likelihood function.
We can immediately see that building in the prior disallows aggregation of different information sources.
I don’t think the prior is what allows aggregation of different information sources, you can do transfer learning with vanilla logistic regression if you choose the right set of features.
Only reporting the mode hides confidence interval and goes way off in the presence of skew.
I agree with this although “being Bayesian” is neither necessary nor sufficient to deal with this (but would probably help on average).
Bayesian logistic regression is easy and superior to max liklihood for most things.
What do you mean by “Bayesian logistic regression”?
Can you recommend an explanation of the complete class theorem(s)? Preferably online. I’ve been googling pretty hard and I’ve turned up almost nothing. I’d like to understand what conditions they start from (suspecting that maybe the result is not quite as strong as “Bayes Rules!”). I’ve found only one paper, which basically said “what Wald proved is extremely difficult to understand, and probably not what you wanted.”
Maybe try this one? Let me know if that helps or if you’re looking for something different.
The complete class theorem states, informally: any Pareto optimal decision rule is a Bayesian decision rule (i.e. it can be obtained by choosing some prior, observing data, and then maximizing expected utility relative to the posterior).
Roughly, the argument is that if I have a collection W of possible worlds that I could be in, and a value U(w) to taking a particular action in world w, then any Pareto optimal strategy implicitly assigns an “importance” p(w) to each world, and takes the action that maximizes the sum of p(w)*U(w). We can then show that this is equivalent to using the Bayesian decision rule with p(w) as the prior over W. The main thing needed to formalize this argument is the separating hyperplane theorem, which is what the linked paper does.
Does the complete class theorem thus provide what Peterson (2004) and Easwaran (unpublished) think is missing in classical axiomatic decision theory: namely, a justification for choosing a prior, observing data, and then maximizing expected utility relative to the posterior?
Well, I think there is some sense of Bayesianism as a meta-approach, without regard to specific methods, which most of us would consider healthier than the frequentist mindset.
There are surely papers showing the superiority of frequentism over Bayesianism, and papers showing the differences between various flavors of Bayesianism and various flavors of frequentism. But that’s not what I’m after right now (with the understanding that a paper can be on the “Bayesian” side and be correct).
I am suspicious of the framing of the question, which doesn’t make clear which of several things it is talking about. Here’s Jacob Steinhardt’s post “Beyond Bayesians and Frequentists,” on some of the options.
In the notation of that post, I’d say I am interested mostly in the argument over “Whether a Bayesian or frequentist algorithm is better suited to solving a particular problem”, generalized over a wide range of problems. And the sort of frequentism I have in mind seems to be “frequentist guarantee”—the process of taking data and making inferences from it on some quantity of interest, and the importance to be given to guarantees on the process.
How wide a range did you have in mind? It’s certainly not the case that Bayesian methods are universally better than frequentist ones.
Examples where frequentist methods are better?
My guess is in hugely overdetermined cases where prior gets swamped by likelihood, and in cases where explicitly representing uncertainty is utterly intractable (like numerical methods), but I’d like to hear it from someone who knows what they are talking about.
Also, if it’s not “Bayesian”, is there a term for the statistical methodology that is always best in all situations (in the spirit of “rationalists should win”)? It seems to me that given that Bayesianism is correct in the ideal sense, the “best” method will always just be the best approximation of the Bayesian answer (where “best” includes factors like computational simplicity).
Well, the most commonly used statistical methods are probably:
Logistic regression
Support vector machines
Principle components analysis
All of these are frequentist: logistic regression is quite explicitly computing the maximum-likelihood-estimate of a parameter vector, SVMs are minimizing a surrogate to generalization error, and PCA is a bit weird but is basically just trying to find a low-rank approximation to the data.
ETA: And to answer your other question, I think that would just be called “the best method”; why would we need another name? No one is going to design a method that they think is strictly dominated by all other methods anyway...at least, not if you also take into account time to implement, which I think is an important consideration (at least in the limit, where with infinite time I can just hard-code everything as a special case).
ETA2: It’s also not clear to me that Bayesianism is correct in the ideal sense (or even what that means), or that it’s fruitful to think of what you’re doing as trying to approximate Bayes (at least not in all situations; I definitely agree that it can be helpful sometimes). I don’t know if I’ll be able to convince you of either of these here though, as this is a disagreement that Eliezer and I still have despite a 4-hour-long discussion (and of course this causes me to update in the direction of me being mistaken).
So, it explicitly considers only P(data|model) and doesn’t work with a nontrivial distribution over P(model), and it’s widely used.
Suppose that there is a significant difference of P(model) across relevant models. Do you think in this case that maximizing P(model)*P(data|model) in order to get P(model|data) would be worse?
Well, there’s a couple of issues here: first, logP(data|model) is a concave function for logistic regression, so unless logP(model) is also concave, the maximization may not reach the global optimum.
Secondly, the proper Bayesian thing to do would be to sample from the posterior, not maximize; for instance, in logistic regression the model is given by a vector of parameters denoted by theta. Suppose that we actually believed that the prior on theta was exp(-|theta|), where |theta| is the sum of the absolute values of the coordinates of theta. Then maximizing P(model|data) in this case will tend to give you solutions where most of the entries of theta are equal to 0, whereas the actual posterior places zero probability mass on such solutions.
On the second point—fair enough, though even under Bayes it’s sometimes reasonable to want a single answer on account of you only get to actually do one thing.
If you have that prior and you maximize P(model|data) on solutions with a zero probability mass on either P(data|model) or P(model), you’re screwing up multiplication.
Well, the point is that if you have a continuous-space, then the maximum-likelihood solution will have zero entries with positive probability, but the posterior probability of a zero entry is 0.
How? If any of the probabilities that the posterior probability factors into are zero, the product is also zero. Or do you just mean that since data are unlimited precision in a continuous space, no answer can ever have a positive probability because it’s infinitely unlikely?
Can you explain in what sense PCA is frequentist? I’m not sure it even deserves to be called a statistical method except insofar as it happens to be useful in statistics.
Yeah, calling PCA frequentist may be a bit of a stretch (although it’s certainly not Bayesian). I think ICA (independent components analysis) could legitimately be called frequentist though, as it solves the blind source separation problem under certain independence assumptions (I don’t know that much about either of these though, so I could be wrong).
Interesting. Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology? Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer? Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
(as for how they do, we can put them in bayesian terms to see. Maximum liklihood methods assume a flat improper prior, and report the mode of the resulting probability distribution. We can immediately see that building in the prior disallows aggregation of different information sources. Only reporting the mode hides confidence interval and goes way off in the presence of skew. Also, we can’t apply safety factors sensibly (they involve utility calculation, which involves confidence intervals at the least).)
I don’t know much about SVM and PCA, but bayesian logistic regression is easy and superior to max liklihood for most things.
Not Cox’s theorem, although the complete class theorem is more convincing (as well as dutch book arguments).
Only in the very weak sense that by the complete class theorem there exists a Bayesian method (or a limit of Bayesian methods) that does at least as well as whatever you’re doing. So sure, if you really had infinite computational resources then you could find such a method and use it...but I think that has almost no bearing on practice. Certainly I think there are many situations where a prior is unavailable.
Almost certainly not, although maybe we should taboo “because”. First of all, the “correct” probability-theory answer is not well-defined because the choice of both the prior and likelihood are both completely unconstrained. Secondly, I think the choice of whether to be Bayesian or frequentist is not nearly as important as e.g. the choice of likelihood function.
I don’t think the prior is what allows aggregation of different information sources, you can do transfer learning with vanilla logistic regression if you choose the right set of features.
I agree with this although “being Bayesian” is neither necessary nor sufficient to deal with this (but would probably help on average).
What do you mean by “Bayesian logistic regression”?
Can you recommend an explanation of the complete class theorem(s)? Preferably online. I’ve been googling pretty hard and I’ve turned up almost nothing. I’d like to understand what conditions they start from (suspecting that maybe the result is not quite as strong as “Bayes Rules!”). I’ve found only one paper, which basically said “what Wald proved is extremely difficult to understand, and probably not what you wanted.”
Thank you very much!
Maybe try this one? Let me know if that helps or if you’re looking for something different.
The complete class theorem states, informally: any Pareto optimal decision rule is a Bayesian decision rule (i.e. it can be obtained by choosing some prior, observing data, and then maximizing expected utility relative to the posterior).
Roughly, the argument is that if I have a collection W of possible worlds that I could be in, and a value U(w) to taking a particular action in world w, then any Pareto optimal strategy implicitly assigns an “importance” p(w) to each world, and takes the action that maximizes the sum of p(w)*U(w). We can then show that this is equivalent to using the Bayesian decision rule with p(w) as the prior over W. The main thing needed to formalize this argument is the separating hyperplane theorem, which is what the linked paper does.
Does the complete class theorem thus provide what Peterson (2004) and Easwaran (unpublished) think is missing in classical axiomatic decision theory: namely, a justification for choosing a prior, observing data, and then maximizing expected utility relative to the posterior?
Well, I think there is some sense of Bayesianism as a meta-approach, without regard to specific methods, which most of us would consider healthier than the frequentist mindset.
There are surely papers showing the superiority of frequentism over Bayesianism, and papers showing the differences between various flavors of Bayesianism and various flavors of frequentism. But that’s not what I’m after right now (with the understanding that a paper can be on the “Bayesian” side and be correct).