What is the best paper explaining the superiority of Bayesianism over frequentism?

Meni_Rosenfeld1 Jan 2013 20:58 UTC

−6 points

Question in title.

This is obviously subjective, but I figure there ought to be some “go-to” paper. Maybe I’ve even seen it once, but can’t find it now and I don’t know if there’s anything better.

Links to multiple papers with different focus would be welcome. For my current purpose I have a preference for one that aims low and isn’t too long.

Meni_Rosenfeld1 Jan 2013 20:58 UTC

−6 points

32 comments1 min readLW link Archive

JonathanLivengood 2 Jan 2013 3:34 UTC
19 points
0
It might help if you told us which of the thousands of varieties of Bayesianism you have in mind with your question. (I would link to I.J. Good’s letter on the 46656 Varieties of Bayesians, but the best I could come up with was the citation in Google Scholar, which does not make the actual text available.)

In terms of pure (or mostly pure) criticisms of frequentist interpretations of probability, you might look at two papers by Alan Hajek: fifteen arguments against finite frequentism and fifteen arguments against hypothetical frequentism.

In terms of Bayesian statistics, you might take a look at a couple of papers by Dennis Lindley: an older paper on The Present Position in Bayesian Statistics and a newer one on The Philosophy of Statistics.

Lindley gives a personalist Bayesian account. If you want “objective Bayes,” you might take a look at this paper by James Berger. (The link actually has a bunch of papers, some of them discussing Berger’s paper, which is the first in the set.)

You might also find Bradley Efron’s paper Why Isn’t Everyone a Bayesian? useful. And on that note, I’ll just say that the presupposition of your question (that Bayesianism is straightforwardly superior to frequentism in all or most all cases) is more fraught than you might think.
- Axel 2 Jan 2013 14:09 UTC
  3 points
  0
  Parent
  Would this be I.J. Good’s letter on the 46656 Varieties of Bayesians? (I’m practicing my google-fu)
  - Pablo 2 Jan 2013 14:47 UTC
    6 points
    0
    Parent
    That pdf is a scan of chapters 3 and 4 of I. J. Good’s book, Good Thinking: The Foundations of Probability and Its Applications (free pdf) (Minnesota: University of Minnesota Press, 1983). Chapter 3, ’46656 varieties of Bayesians’, reprints a letter in American Statistician (December, 1971), vol. 25, pp. 62-63. This is indeed the letter which JonathanLivengood cited in his comment above.
    - JonathanLivengood 2 Jan 2013 18:32 UTC
      3 points
      0
      Parent
      Wow! Thanks for the Good Thinking link. Now I won’t have to scan it myself.
  - JonathanLivengood 2 Jan 2013 18:30 UTC
    2 points
    0
    Parent
    Yes, that’s the letter!
CarlShulman 1 Jan 2013 21:18 UTC
14 points
0
I am suspicious of the framing of the question, which doesn’t make clear which of several things it is talking about. Here’s Jacob Steinhardt’s post “Beyond Bayesians and Frequentists,” on some of the options.
- Meni_Rosenfeld 1 Jan 2013 21:56 UTC
  9 points
  0
  Parent
  In the notation of that post, I’d say I am interested mostly in the argument over “Whether a Bayesian or frequentist algorithm is better suited to solving a particular problem”, generalized over a wide range of problems. And the sort of frequentism I have in mind seems to be “frequentist guarantee”—the process of taking data and making inferences from it on some quantity of interest, and the importance to be given to guarantees on the process.
  - jsteinhardt 2 Jan 2013 3:16 UTC
    1 point
    0
    Parent
    How wide a range did you have in mind? It’s certainly not the case that Bayesian methods are universally better than frequentist ones.
    - [deleted] 2 Jan 2013 3:34 UTC
      3 points
      0
      Parent
      
      It’s certainly not the case that Bayesian methods are universally better than frequentist ones.
      
      Examples where frequentist methods are better?
      
      My guess is in hugely overdetermined cases where prior gets swamped by likelihood, and in cases where explicitly representing uncertainty is utterly intractable (like numerical methods), but I’d like to hear it from someone who knows what they are talking about.
      
      Also, if it’s not “Bayesian”, is there a term for the statistical methodology that is always best in all situations (in the spirit of “rationalists should win”)? It seems to me that given that Bayesianism is correct in the ideal sense, the “best” method will always just be the best approximation of the Bayesian answer (where “best” includes factors like computational simplicity).
      - jsteinhardt 2 Jan 2013 9:33 UTC
        1 point
        0
        Parent
        Well, the most commonly used statistical methods are probably:
        
        Logistic regression
        Support vector machines
        Principle components analysis
        
        All of these are frequentist: logistic regression is quite explicitly computing the maximum-likelihood-estimate of a parameter vector, SVMs are minimizing a surrogate to generalization error, and PCA is a bit weird but is basically just trying to find a low-rank approximation to the data.
        
        ETA: And to answer your other question, I think that would just be called “the best method”; why would we need another name? No one is going to design a method that they think is strictly dominated by all other methods anyway...at least, not if you also take into account time to implement, which I think is an important consideration (at least in the limit, where with infinite time I can just hard-code everything as a special case).
        
        ETA2: It’s also not clear to me that Bayesianism is correct in the ideal sense (or even what that means), or that it’s fruitful to think of what you’re doing as trying to approximate Bayes (at least not in all situations; I definitely agree that it can be helpful sometimes). I don’t know if I’ll be able to convince you of either of these here though, as this is a disagreement that Eliezer and I still have despite a 4-hour-long discussion (and of course this causes me to update in the direction of me being mistaken).
        Luke_A_Somers 2 Jan 2013 17:20 UTC
        2 points
        0
        Parent
        
        logistic regression is quite explicitly computing the maximum-likelihood-estimate of a parameter vector
        
        So, it explicitly considers only P(data|model) and doesn’t work with a nontrivial distribution over P(model), and it’s widely used.
        
        Suppose that there is a significant difference of P(model) across relevant models. Do you think in this case that maximizing P(model)*P(data|model) in order to get P(model|data) would be worse?
        jsteinhardt 2 Jan 2013 19:42 UTC
        1 point
        0
        Parent
        Well, there’s a couple of issues here: first, logP(data|model) is a concave function for logistic regression, so unless logP(model) is also concave, the maximization may not reach the global optimum.
        
        Secondly, the proper Bayesian thing to do would be to sample from the posterior, not maximize; for instance, in logistic regression the model is given by a vector of parameters denoted by theta. Suppose that we actually believed that the prior on theta was exp(-|theta|), where |theta| is the sum of the absolute values of the coordinates of theta. Then maximizing P(model|data) in this case will tend to give you solutions where most of the entries of theta are equal to 0, whereas the actual posterior places zero probability mass on such solutions.
        Luke_A_Somers 2 Jan 2013 22:13 UTC
        0 points
        0
        Parent
        On the second point—fair enough, though even under Bayes it’s sometimes reasonable to want a single answer on account of you only get to actually do one thing.
        
        If you have that prior and you maximize P(model|data) on solutions with a zero probability mass on either P(data|model) or P(model), you’re screwing up multiplication.
        jsteinhardt 2 Jan 2013 22:46 UTC
        0 points
        0
        Parent
        Well, the point is that if you have a continuous-space, then the maximum-likelihood solution will have zero entries with positive probability, but the posterior probability of a zero entry is 0.
        Luke_A_Somers 3 Jan 2013 15:27 UTC
        0 points
        0
        Parent
        How? If any of the probabilities that the posterior probability factors into are zero, the product is also zero. Or do you just mean that since data are unlimited precision in a continuous space, no answer can ever have a positive probability because it’s infinitely unlikely?
        Qiaochu_Yuan 2 Jan 2013 9:53 UTC
        2 points
        0
        Parent
        Can you explain in what sense PCA is frequentist? I’m not sure it even deserves to be called a statistical method except insofar as it happens to be useful in statistics.
        jsteinhardt 2 Jan 2013 19:48 UTC
        0 points
        0
        Parent
        Yeah, calling PCA frequentist may be a bit of a stretch (although it’s certainly not Bayesian). I think ICA (independent components analysis) could legitimately be called frequentist though, as it solves the blind source separation problem under certain independence assumptions (I don’t know that much about either of these though, so I could be wrong).
        [deleted] 2 Jan 2013 22:17 UTC
        0 points
        0
        Parent
        
        It’s also not clear to me that Bayesianism is correct in the ideal sense (or even what that means)
        
        Interesting. Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology? Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer? Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
        
        (as for how they do, we can put them in bayesian terms to see. Maximum liklihood methods assume a flat improper prior, and report the mode of the resulting probability distribution. We can immediately see that building in the prior disallows aggregation of different information sources. Only reporting the mode hides confidence interval and goes way off in the presence of skew. Also, we can’t apply safety factors sensibly (they involve utility calculation, which involves confidence intervals at the least).)
        
        I don’t know much about SVM and PCA, but bayesian logistic regression is easy and superior to max liklihood for most things.
        jsteinhardt 2 Jan 2013 23:02 UTC
        1 point
        0
        Parent
        
        Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology?
        
        Not Cox’s theorem, although the complete class theorem is more convincing (as well as dutch book arguments).
        
        Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer?
        
        Only in the very weak sense that by the complete class theorem there exists a Bayesian method (or a limit of Bayesian methods) that does at least as well as whatever you’re doing. So sure, if you really had infinite computational resources then you could find such a method and use it...but I think that has almost no bearing on practice. Certainly I think there are many situations where a prior is unavailable.
        
        Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
        
        Almost certainly not, although maybe we should taboo “because”. First of all, the “correct” probability-theory answer is not well-defined because the choice of both the prior and likelihood are both completely unconstrained. Secondly, I think the choice of whether to be Bayesian or frequentist is not nearly as important as e.g. the choice of likelihood function.
        
        We can immediately see that building in the prior disallows aggregation of different information sources.
        
        I don’t think the prior is what allows aggregation of different information sources, you can do transfer learning with vanilla logistic regression if you choose the right set of features.
        
        Only reporting the mode hides confidence interval and goes way off in the presence of skew.
        
        I agree with this although “being Bayesian” is neither necessary nor sufficient to deal with this (but would probably help on average).
        
        Bayesian logistic regression is easy and superior to max liklihood for most things.
        
        What do you mean by “Bayesian logistic regression”?
        David_Chapman 29 Aug 2013 2:23 UTC
        2 points
        0
        Parent
        Can you recommend an explanation of the complete class theorem(s)? Preferably online. I’ve been googling pretty hard and I’ve turned up almost nothing. I’d like to understand what conditions they start from (suspecting that maybe the result is not quite as strong as “Bayes Rules!”). I’ve found only one paper, which basically said “what Wald proved is extremely difficult to understand, and probably not what you wanted.”
        
        Thank you very much!
        jsteinhardt 2 Sep 2013 1:04 UTC
        5 points
        0
        Parent
        Maybe try this one? Let me know if that helps or if you’re looking for something different.
        
        The complete class theorem states, informally: any Pareto optimal decision rule is a Bayesian decision rule (i.e. it can be obtained by choosing some prior, observing data, and then maximizing expected utility relative to the posterior).
        
        Roughly, the argument is that if I have a collection W of possible worlds that I could be in, and a value U(w) to taking a particular action in world w, then any Pareto optimal strategy implicitly assigns an “importance” p(w) to each world, and takes the action that maximizes the sum of p(w)*U(w). We can then show that this is equivalent to using the Bayesian decision rule with p(w) as the prior over W. The main thing needed to formalize this argument is the separating hyperplane theorem, which is what the linked paper does.
        lukeprog 15 Sep 2013 1:49 UTC
        2 points
        0
        Parent
        Does the complete class theorem thus provide what Peterson (2004) and Easwaran (unpublished) think is missing in classical axiomatic decision theory: namely, a justification for choosing a prior, observing data, and then maximizing expected utility relative to the posterior?
    - Meni_Rosenfeld 2 Jan 2013 6:39 UTC
      0 points
      0
      Parent
      Well, I think there is some sense of Bayesianism as a meta-approach, without regard to specific methods, which most of us would consider healthier than the frequentist mindset.
      
      There are surely papers showing the superiority of frequentism over Bayesianism, and papers showing the differences between various flavors of Bayesianism and various flavors of frequentism. But that’s not what I’m after right now (with the understanding that a paper can be on the “Bayesian” side and be correct).
Manfred 2 Jan 2013 0:25 UTC
11 points
0
E.T. Jaynes’ book is a pretty good resource: here are the first 3 chapters as pdf. Also see Savage’s alternate axiomization, which replaces the direct assumption of real numbers. The point from there is just that since the desiderata uniquely specify these rules, anything that bends the rules violates a desideratum. This leads to bad stuff, a claim which dutch book arguments are used to support.

Obviously you can’t always do the axiomatically correct thing and enumerate every single hypothesis (out to infinite length, but weighted inverse exponentially) to see which one is best supported by the data, so in the real world you approximate. Physical impossibility is the main flaw of using the complete Bayesian process :P

For the specific case of an artificial intelligence, Bayesianism is pretty much necessary because it talks about degrees of belief and probabilistic logic where frequentism doesn’t. If you instead need a mathematical tool to give you a feeling for how significant your results are, then frequentist and Bayesian textbooks will be pretty much equivalent. Though perhaps the Bayesian textbook will have better warnings about the problems with p-value significance tests.
Shmi 2 Jan 2013 7:52 UTC
2 points
0
Note that instrumentally, there is little difference, according to Jaynes himself:

For example our system of probability could hardly, in style, philosophy, and purpose, be more different from that of Kolmogorov.

[...]

Yet when all is said and done we find ourselves, to our own surprise, in agreement with Kolmogorov and in disagreement with his critics, on nearly all technical issues.
- Oscar_Cunningham 2 Jan 2013 14:42 UTC
  10 points
  0
  Parent
  By “technical issues” I think that Jaynes means the measure theory basis of probability, not the everyday applications of statistics.
  
  EDIT: And he’s comparing Kolmogorov’s definition of probability (which starts with a set of possible outcomes and then treats events as subsets) to his own view (IIRC he thinks you can calculate with events directly and need no underlying set). He’s not comparing Frequentism and Bayesianism.
  - Shmi 2 Jan 2013 19:14 UTC
    2 points
    0
    Parent
    
    He’s not comparing Frequentism and Bayesianism.
    
    Not directly, no. But you would be hard-pressed to call Kolmogorov a Bayesian, at least when discussing axiomatic probability theory and not complexity/minimum description length, given that he never talked about “the degree of belief”, only about axioms of probability. Yet his results match those of Jaynes (the other way around, really).
    
    Further discussion on Kolmogorov and Bayes is in this paper posted on Luke’s old blog. For example:
    
    For an agent to be perfectly rational, her degrees of belief must obey the axioms of probability theory.
- [deleted] 2 Jan 2013 13:08 UTC
  0 points
  0
  Parent
  For those curious, these quotes come from Probability Theory: The Logic of Science, p.xxi under the header “Foundations.”
TrE 1 Jan 2013 21:08 UTC
2 points
0
In my opinion, this would be better suited for the Open Thread.
- Meni_Rosenfeld 1 Jan 2013 21:47 UTC
  6 points
  0
  Parent
  Would it? Maybe the question (in its current form) isn’t good, but I think there are good answers for it. Those answers should be prominently searchable.
- David_Gerard 2 Jan 2013 0:50 UTC
  3 points
  0
  Parent
  It’s a short question, but it’s of sufficient local jargon importance to rate a discussion post IMO.
NickCarlough 2 Jan 2013 20:08 UTC
−10 points
0
http://www.wikipedia.org/

Dont get lost 8D