What is the best paper explaining the superiority of Bayesianism over frequentism?
Question in title.
This is obviously subjective, but I figure there ought to be some “go-to” paper. Maybe I’ve even seen it once, but can’t find it now and I don’t know if there’s anything better.
Links to multiple papers with different focus would be welcome. For my current purpose I have a preference for one that aims low and isn’t too long.
It might help if you told us which of the thousands of varieties of Bayesianism you have in mind with your question. (I would link to I.J. Good’s letter on the 46656 Varieties of Bayesians, but the best I could come up with was the citation in Google Scholar, which does not make the actual text available.)
In terms of pure (or mostly pure) criticisms of frequentist interpretations of probability, you might look at two papers by Alan Hajek: fifteen arguments against finite frequentism and fifteen arguments against hypothetical frequentism.
In terms of Bayesian statistics, you might take a look at a couple of papers by Dennis Lindley: an older paper on The Present Position in Bayesian Statistics and a newer one on The Philosophy of Statistics.
Lindley gives a personalist Bayesian account. If you want “objective Bayes,” you might take a look at this paper by James Berger. (The link actually has a bunch of papers, some of them discussing Berger’s paper, which is the first in the set.)
You might also find Bradley Efron’s paper Why Isn’t Everyone a Bayesian? useful. And on that note, I’ll just say that the presupposition of your question (that Bayesianism is straightforwardly superior to frequentism in all or most all cases) is more fraught than you might think.
Would this be I.J. Good’s letter on the 46656 Varieties of Bayesians? (I’m practicing my google-fu)
That pdf is a scan of chapters 3 and 4 of I. J. Good’s book, Good Thinking: The Foundations of Probability and Its Applications (free pdf) (Minnesota: University of Minnesota Press, 1983). Chapter 3, ’46656 varieties of Bayesians’, reprints a letter in American Statistician (December, 1971), vol. 25, pp. 62-63. This is indeed the letter which JonathanLivengood cited in his comment above.
Wow! Thanks for the Good Thinking link. Now I won’t have to scan it myself.
Yes, that’s the letter!
I am suspicious of the framing of the question, which doesn’t make clear which of several things it is talking about. Here’s Jacob Steinhardt’s post “Beyond Bayesians and Frequentists,” on some of the options.
In the notation of that post, I’d say I am interested mostly in the argument over “Whether a Bayesian or frequentist algorithm is better suited to solving a particular problem”, generalized over a wide range of problems. And the sort of frequentism I have in mind seems to be “frequentist guarantee”—the process of taking data and making inferences from it on some quantity of interest, and the importance to be given to guarantees on the process.
How wide a range did you have in mind? It’s certainly not the case that Bayesian methods are universally better than frequentist ones.
Examples where frequentist methods are better?
My guess is in hugely overdetermined cases where prior gets swamped by likelihood, and in cases where explicitly representing uncertainty is utterly intractable (like numerical methods), but I’d like to hear it from someone who knows what they are talking about.
Also, if it’s not “Bayesian”, is there a term for the statistical methodology that is always best in all situations (in the spirit of “rationalists should win”)? It seems to me that given that Bayesianism is correct in the ideal sense, the “best” method will always just be the best approximation of the Bayesian answer (where “best” includes factors like computational simplicity).
Well, the most commonly used statistical methods are probably:
Logistic regression
Support vector machines
Principle components analysis
All of these are frequentist: logistic regression is quite explicitly computing the maximum-likelihood-estimate of a parameter vector, SVMs are minimizing a surrogate to generalization error, and PCA is a bit weird but is basically just trying to find a low-rank approximation to the data.
ETA: And to answer your other question, I think that would just be called “the best method”; why would we need another name? No one is going to design a method that they think is strictly dominated by all other methods anyway...at least, not if you also take into account time to implement, which I think is an important consideration (at least in the limit, where with infinite time I can just hard-code everything as a special case).
ETA2: It’s also not clear to me that Bayesianism is correct in the ideal sense (or even what that means), or that it’s fruitful to think of what you’re doing as trying to approximate Bayes (at least not in all situations; I definitely agree that it can be helpful sometimes). I don’t know if I’ll be able to convince you of either of these here though, as this is a disagreement that Eliezer and I still have despite a 4-hour-long discussion (and of course this causes me to update in the direction of me being mistaken).
So, it explicitly considers only P(data|model) and doesn’t work with a nontrivial distribution over P(model), and it’s widely used.
Suppose that there is a significant difference of P(model) across relevant models. Do you think in this case that maximizing P(model)*P(data|model) in order to get P(model|data) would be worse?
Well, there’s a couple of issues here: first, logP(data|model) is a concave function for logistic regression, so unless logP(model) is also concave, the maximization may not reach the global optimum.
Secondly, the proper Bayesian thing to do would be to sample from the posterior, not maximize; for instance, in logistic regression the model is given by a vector of parameters denoted by theta. Suppose that we actually believed that the prior on theta was exp(-|theta|), where |theta| is the sum of the absolute values of the coordinates of theta. Then maximizing P(model|data) in this case will tend to give you solutions where most of the entries of theta are equal to 0, whereas the actual posterior places zero probability mass on such solutions.
On the second point—fair enough, though even under Bayes it’s sometimes reasonable to want a single answer on account of you only get to actually do one thing.
If you have that prior and you maximize P(model|data) on solutions with a zero probability mass on either P(data|model) or P(model), you’re screwing up multiplication.
Well, the point is that if you have a continuous-space, then the maximum-likelihood solution will have zero entries with positive probability, but the posterior probability of a zero entry is 0.
How? If any of the probabilities that the posterior probability factors into are zero, the product is also zero. Or do you just mean that since data are unlimited precision in a continuous space, no answer can ever have a positive probability because it’s infinitely unlikely?
Can you explain in what sense PCA is frequentist? I’m not sure it even deserves to be called a statistical method except insofar as it happens to be useful in statistics.
Yeah, calling PCA frequentist may be a bit of a stretch (although it’s certainly not Bayesian). I think ICA (independent components analysis) could legitimately be called frequentist though, as it solves the blind source separation problem under certain independence assumptions (I don’t know that much about either of these though, so I could be wrong).
Interesting. Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology? Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer? Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
(as for how they do, we can put them in bayesian terms to see. Maximum liklihood methods assume a flat improper prior, and report the mode of the resulting probability distribution. We can immediately see that building in the prior disallows aggregation of different information sources. Only reporting the mode hides confidence interval and goes way off in the presence of skew. Also, we can’t apply safety factors sensibly (they involve utility calculation, which involves confidence intervals at the least).)
I don’t know much about SVM and PCA, but bayesian logistic regression is easy and superior to max liklihood for most things.
Not Cox’s theorem, although the complete class theorem is more convincing (as well as dutch book arguments).
Only in the very weak sense that by the complete class theorem there exists a Bayesian method (or a limit of Bayesian methods) that does at least as well as whatever you’re doing. So sure, if you really had infinite computational resources then you could find such a method and use it...but I think that has almost no bearing on practice. Certainly I think there are many situations where a prior is unavailable.
Almost certainly not, although maybe we should taboo “because”. First of all, the “correct” probability-theory answer is not well-defined because the choice of both the prior and likelihood are both completely unconstrained. Secondly, I think the choice of whether to be Bayesian or frequentist is not nearly as important as e.g. the choice of likelihood function.
I don’t think the prior is what allows aggregation of different information sources, you can do transfer learning with vanilla logistic regression if you choose the right set of features.
I agree with this although “being Bayesian” is neither necessary nor sufficient to deal with this (but would probably help on average).
What do you mean by “Bayesian logistic regression”?
Can you recommend an explanation of the complete class theorem(s)? Preferably online. I’ve been googling pretty hard and I’ve turned up almost nothing. I’d like to understand what conditions they start from (suspecting that maybe the result is not quite as strong as “Bayes Rules!”). I’ve found only one paper, which basically said “what Wald proved is extremely difficult to understand, and probably not what you wanted.”
Thank you very much!
Maybe try this one? Let me know if that helps or if you’re looking for something different.
The complete class theorem states, informally: any Pareto optimal decision rule is a Bayesian decision rule (i.e. it can be obtained by choosing some prior, observing data, and then maximizing expected utility relative to the posterior).
Roughly, the argument is that if I have a collection W of possible worlds that I could be in, and a value U(w) to taking a particular action in world w, then any Pareto optimal strategy implicitly assigns an “importance” p(w) to each world, and takes the action that maximizes the sum of p(w)*U(w). We can then show that this is equivalent to using the Bayesian decision rule with p(w) as the prior over W. The main thing needed to formalize this argument is the separating hyperplane theorem, which is what the linked paper does.
Does the complete class theorem thus provide what Peterson (2004) and Easwaran (unpublished) think is missing in classical axiomatic decision theory: namely, a justification for choosing a prior, observing data, and then maximizing expected utility relative to the posterior?
Well, I think there is some sense of Bayesianism as a meta-approach, without regard to specific methods, which most of us would consider healthier than the frequentist mindset.
There are surely papers showing the superiority of frequentism over Bayesianism, and papers showing the differences between various flavors of Bayesianism and various flavors of frequentism. But that’s not what I’m after right now (with the understanding that a paper can be on the “Bayesian” side and be correct).
E.T. Jaynes’ book is a pretty good resource: here are the first 3 chapters as pdf. Also see Savage’s alternate axiomization, which replaces the direct assumption of real numbers. The point from there is just that since the desiderata uniquely specify these rules, anything that bends the rules violates a desideratum. This leads to bad stuff, a claim which dutch book arguments are used to support.
Obviously you can’t always do the axiomatically correct thing and enumerate every single hypothesis (out to infinite length, but weighted inverse exponentially) to see which one is best supported by the data, so in the real world you approximate. Physical impossibility is the main flaw of using the complete Bayesian process :P
For the specific case of an artificial intelligence, Bayesianism is pretty much necessary because it talks about degrees of belief and probabilistic logic where frequentism doesn’t. If you instead need a mathematical tool to give you a feeling for how significant your results are, then frequentist and Bayesian textbooks will be pretty much equivalent. Though perhaps the Bayesian textbook will have better warnings about the problems with p-value significance tests.
Note that instrumentally, there is little difference, according to Jaynes himself:
[...]
By “technical issues” I think that Jaynes means the measure theory basis of probability, not the everyday applications of statistics.
EDIT: And he’s comparing Kolmogorov’s definition of probability (which starts with a set of possible outcomes and then treats events as subsets) to his own view (IIRC he thinks you can calculate with events directly and need no underlying set). He’s not comparing Frequentism and Bayesianism.
Not directly, no. But you would be hard-pressed to call Kolmogorov a Bayesian, at least when discussing axiomatic probability theory and not complexity/minimum description length, given that he never talked about “the degree of belief”, only about axioms of probability. Yet his results match those of Jaynes (the other way around, really).
Further discussion on Kolmogorov and Bayes is in this paper posted on Luke’s old blog. For example:
For those curious, these quotes come from Probability Theory: The Logic of Science, p.xxi under the header “Foundations.”
In my opinion, this would be better suited for the Open Thread.
Would it? Maybe the question (in its current form) isn’t good, but I think there are good answers for it. Those answers should be prominently searchable.
It’s a short question, but it’s of sufficient local jargon importance to rate a discussion post IMO.
http://www.wikipedia.org/
Dont get lost 8D