Where did mathematics begin to disagree between frequentist and Bayesian statistics, and why?
I still can’t see the relevance of Bayesian Statistics over Frequentist Statistics, and I take Less Wrong as evidence that this is a cause for clarification.
I’m looking for a historical narrative of the development of mathematics that tells me what mistake lead to frequentism over Bayesianism, which is supposedly the correct view. Alternatively, you can just say “Read PT:TLOS!” if it’s that silly of a question.
The development of mathematics isn’t relevant. The question was (and continues to be) what constitutes a valid and/or useful operationalization of mathematical probability.
It was always clear that the relative frequency of events in some suitably defined “random experiment” obeys the probability axioms (even though those axioms weren’t spelled out until Kolmogorov got around to it). John Venn of Venn diagram fame was the first influential promoter of the idea that probability should be restricted to just relative frequency. I think the notion was to exclude anything unobservable. Prior to that, people had treated mathematical probability as equivalent to the colloquial notion of probability without any particular justification.
In the early part of the 20th century, frequentists statistics provided a framework that seemed to permit reasonably well-principled data analysis while excluding subjective or nonsensical prior probability distributions from the scene. The result was a bit of a grab-bag, but practising scientists didn’t have to worry about that—they just consulted with “statisticians”, the newly trained class of professionals whose job it was to know which data analysis recipe should be followed.
Meanwhile, defenders of the “inverse” probability approach (as Bayesian statistics was then known) got busy providing justifications. Bruno de Finetti provided foundations in terms of coherence, which means immunity to Dutch books. Harold Jeffreys took an axiomatic approach. L. J. Savage also took an axiomatic approach, but in contrast to Jeffreys, Savage’s approach was in terms of rational preferences and mixed rational inference and rational decision making together. Perhaps the cleanest approach (and my personal favorite) was that of R. T. Cox (warning: pdf).
Interest in Bayes was revived in the frequentist community by a result in frequentist statistical decision theory known as the complete class theorem. It showed that (subject to some weak conditions) the class of estimators with a certain desirable property called “admissibility” was exactly the class of Bayes estimators. That plus Savage’s work on Bayesian foundations lead to a resurgence of interest in Bayesian statistics. But Bayesian statistics only really started to gain ground in the early 90s, when improvements in computing power made practical a class of algorithms collectively called Markov chain Monte Carlo. Suddenly problems that had never been tractable before became doable—complex, high dimensional non-linear models beyond the mathematical reach of frequentist approaches became practical to analyze by Bayesian methods.
Frequentist statistics are like Bayesian statistics with a default set of model-based priors provided, but hidden under a rug. The prior-hiding is bad, because it leaves broken mathematics that can’t be built upon to handle more complex cases. Unfortunately, “you can’t build on this to handle complex cases” is an extremely difficult argument to present convincingly, even when true; and by the time someone knows enough that talking about complex cases is feasible, they’re already locked in to one style or the other.
The hidden priors are also an advantage when publishing papers that aren’t about statistics, because they protect you from arguments over priors that could delay publication. Frequentist statistics also provide an agreed-upon schelling point for positive results (“95% confidence”). While this undoubtedly helped its adoption immensely, it seems like it’s turning out to be frequentism’s downfall, since this threshold is attainable even for conclusions that are false.
This is one of the most succinctly informative comments I have seen on this site. The above two paragraphs manage to encapsulate many of the major themes of Less Wrong, all within the context of answering a specific question.
They needn’t be..
That can help in some instances, but it won’t work for everything. In particular, if the problem contains lots of parameters, some of which are of substantive interest and the rest of which are necessary for accurate modelling but are otherwise nuisances, then useful likelihood ratios don’t exist.
In what cases can “95% statistical significance” be useful while appropriately selected and specified likelihood ratios can not be similarly useful? (Essentially I do not believe you.)
Let me clarify that I’m not defending the notion of statistical significance in data analysis—I’m merely saying that the advice to publish likelihood ratios is not a complete answer for avoiding debate over priors.
I analyzed some data using two versions of a model that had ~6000 interest parameters and ~6000 nuisance parameters. One of my goals was to determine which version was more appropriate for the problem. The strict Bayesian answer is to compare different possible models using the Bayes factor, which marginalizes over every parameter in each version of the model with respect to each version’s prior. Likelihood ratios are no help here.
It turned out to be a lot easier and more convincing to do a simple residual plot for each version of the model. For one version the residual plot matched the model’s assumptions about the error distribution; for the other, it didn’t. This is a kind of self-consistency check: passing it doesn’t mean that the model is adequate, but failing it definitely means the model is not adequate.
(BTW, the usual jargon goes, “statistically significant at the 0.05 / 0.01 / 0.001 level.”)
Your ability to distinguish them that way means that there was a large likelihood ratio from the evidence.
A large likelihood ratio? I have two likelihood functions—at what values of the parameter arguments should I evaluate them when forming the ratio? Given that one of the versions is nested in the other at the boundary of the parameter space (Gaussian errors versus Student-t errors with degrees of freedom fit to the data), what counts as a large enough likelihood ratio to prefer the more general version of the model?
Likelihood ratios are computed at a single point in parameter space. P values are summary values computed over part of the parameter space.
I think you’ll find this post relevant: A History of Bayes’ Theorem
Wikipedia seems to have a reasonable overview, and this overview references Aldrich’s “R. A. Fisher on Bayes and Bayes’ theorem” which gives more information on why Fisher rejected Bayesian reasoning.
Try Sharon McGrayne’s book, The Theory That Would Not Die.
I think this is probably better suited to the open thread, since it’s just a small question.
Meh.
Timing doesn’t work, by multiple decades.
Painfully ignorant comment. Fisher was worried about lots of things, not all of them having to do with Bayes vs frequentist methods.
Grah! I’m pretty sure that sort of trivia is really memetically contagious. Other comments indicate that your statement is very false, so I’m inclined to ask you to be a bit more careful, even though you used the cynicism tag.
This is what I was talking about!
OK, I’ve retracted it.