The Objective Bayesian Programme
Followup to: Bayesian Flame.
This post is a chronicle of my attempts to understand Cyan’s #2. (Bayesian Flame was an approximate parse of #1.) Warning: long, some math, lots of links, probably lots of errors. At the very least I want this to serve as a good reference for further reading.
Introduction
To the mathematical eye, many statistical problems share the following minimal structure:
A space of parameters. (Imagine a freeform blob without assuming any metric or measure.)
A space of possible outcomes. (Imagine another, similarly unstructured blob.)
Each point in the parameter space determines a probability measure on the outcome space.
By itself, this kind of input is too sparse to yield solutions to statistical problems. What additional structure on the spaces should we introduce?
The answer that we all know and love
Assuming some “prior” probability measure on the parameter space yields a solution that’s unique, consistent and wonderful in all sorts of ways. This has led some people to adopt the “subjectivist” position saying priors are so basic that they ought not be questioned. One of its most prominent defenders was Leonard Jimmie Savage who put forward the following argument:
Suppose, for example, that the person is offered an even-money bet for five dollars—or, to be ultra-rigorous, for five utiles—that internal combustion engines in American automobiles will be obsolete by 1970. If there is any event to which an objectivist would refuse to attach probability, that corresponding to the obsolescence in question is one… Yet, I think I may say without presumption that you would regard the bet against obsolescence as a very sound investment.
This is a fine argument for using priors when you’re betting money, but there’s a snag: however much you are willing to bet, this doesn’t give you grounds to publish papers about the future that you inferred from your intuitive prior! Any apriori information used in science should be justified for scientific objectivity.
(At this point Eliezer raises the suggestion that scientists ought to communicate with likelihood ratios only. That might be a brave new world to live in; too bad we’ll have to stop teaching kids that g approximately equals 9.8 m/s2 and give them likelihood profiles instead.)
Rather than dive deeper into the fascinating topic of “uninformative priors”, let’s go back to the surface. Take a closer look at the basic formulation above to see what other structures we can introduce instead of priors to get interesting results.
The minimax approach
In the mid-20th century a statistician named Abraham Wald made a valiant effort to step outside the problem. His decision theory idea encompasses both frequentist and Bayesian inference. Roughly, it goes like this: we no longer know our prior probabilities, but we do know our utilities. More concretely, we compute a decision from the observed dataset, and later suffer a loss that depends on our decision and the actual true parameter value. Substituting different “spaces of decisions” and “loss functions”, we get a wide range of situations to study.
But wait! Doesn’t the “optimal” decision depend on the prior distribution of parameters as well?
Wald’s crucial insight was that… no, not necessarily.
If we don’t know the prior and are trying to be “scientifically objective”, it makes sense to treat the problem of statistical inference as a game. The statistician chooses a decision rule, Nature chooses a true parameter value, randomness determines the payoff. Since the game is zero-sum, we can reasonably expect it to have a minimax value: there’s a decision rule that minimizes the maximum loss the statistician can suffer, whatever Nature may choose.
Now, as Ken Binmore accurately noted, in real life you don’t minimax unless “your relationship with the universe has reached such a low ebb that you keep your pants up with both belt and suspenders”, so the minimax principle gives off a whiff of the paranoia that we’ve come to associate with frequentism. Haha, gotcha! Wald’s results apply to Bayesianism just as well. His “complete class theorem” proves that Bayesian-rational strategies with well-defined priors constitute precisely the class of non-dominated strategies in the game described. (If you squint the right way, this last sentence compresses the whole philosophical justification of Bayesianism.)
The game-theoretic approach gives our Bayesian friends even more than that. The statistical game’s minimax decision rules often correspond to Bayes strategies with a certain uninformative prior, called the “least favorable prior” for that risk function. This gives you a frequentist-valid procedure that also happens to be Bayesian, which means immunity to Dutch books, negative masses and similar criticisms. In a particularly fascinating convergence, the well-known “reference prior” (the Jeffreys prior properly generalized to N dimensions) turns out to be asymptotically least favorable when optimizing the Shannon mutual information between the parameter and the sample.
At this point the Bayesians in the audience should be rubbing their hands. I told ya it would be fun! Our frequentist friends on the other hand have dozed off, so let’s pull another stunt to wake them up.
Confidence coverage demystified
Informally, we want to say things about the world like “I’m 90% sure that this physical constant lies within those bounds” and be actually right 90% of the time when we say such things.
...Semi-formally, we want to a procedure to calculate from each sample a “confidence subset” of the parameter space such that such subsets cover include the true parameter values with probability greater or equal to 90%, while the sets themselves are as small as possible.
(NB: this is not equivalent to deriving a “correct” posterior distribution on the parameter space. Not every method of choosing small subsets with given posterior masses will give you uniformly correct confidence coverage, and each such method corresponds to many different posterior distributions in the N-dimensional case.)
...Formally, we introduce a new structure on the parameter space—a “not-quite-measure” to determine the size of confidence sets—and then, upon receiving a sample, determine from it a 90% confidence set with the smallest possible “not-quite-measure”.
(NB: I’m calling it “not-quite-measure” because of a subtlety in the N-dimensional case. If we’re estimating just one parameter out of several, the “measure” corresponds to span in that coordinate and thus is not additive under set union, hence “not-quite”.)
Except this doesn’t work. There might be two procedures to compute confidence sets, the first of which is sometimes better and sometimes worse than the second. We have no comparison function to determine the winner, and in reality the “uniformly most accurate” procedure doesn’t always exist.
But if we replace the “size” of the confidence set with its expected size under each single parameter value, this gives us just enough information to apply the game-theoretic minimax approach. Solving the game thus gives us “minimax expected size” confidence sets, or MES, that people are actually using. Which isn’t saying much, but still.
More on subjectivity
The minimax principle sounds nice, but the construction of the least favorable prior distribution for any given experiment and risk function has a problem: it typically depends on the whole sample space and thus on the experiment’s stopping rule. When do we stop gathering data? What subsets of observed samples do we thus rule out? In the general case the least favorable prior depends on the number of samples we intend to draw! This blatantly violates the likelihood principle that Eliezer so eloquently defended.
But, ordinary probability theory tells us unambiguously that 90% of your conclusions will be true whatever stopping rules you choose for each of them, as long you choose before observing any data from the experiments. (Otherwise all bets are off, like if you’d decided to pick your Bayesian prior based on the data.) But, the conclusions themselves will be different from rule to rule. But, you cannot deliberately engineer a situation where the minimax of one stopping rule reliably makes you more wrong than another one...
Does this look more like an eternal mathematical law or an ad hoc tool? To me it looks like a mystery. Like frequentists are trying to solve a problem that Bayesians don’t even attempt to solve. The answer is somewhere out there; we can guess that something like today’s Bayesianism will be a big part of it, but not the only part.
Conclusion
When some field is afflicted with deep and persistent philosophical conflicts, this isn’t necessarily a sign that one of the sides is right and the other is just being silly. It might be a sign that some crucial unifying insight is waiting several steps ahead. Minimaxing doesn’t look to me like the beginning and end of “objective” statistics, but the right answer that we don’t know yet has got to be at least this normal.
Further reading: James Berger, The Case for Objective Bayesian Analysis.
- 1 Jul 2010 10:56 UTC; 2 points) 's comment on Open Thread June 2010, Part 4 by (
- 8 Sep 2009 13:10 UTC; 0 points) 's comment on An idea: Sticking Point Learning by (
Savage’s argument doesn’t seem to me to be an “argument for using priors” but an argument for interpreting probability theory more broadly than strict frequentists do. (Or, kinda equivalently modulo terminology, for permitting yourself to make betting decisions using tools more general than probability theory.)
Do you mean most favorable? If not, I am very confused.
“Least favorable” here is the “min” part of minimax. (The max part is doing the best you can with this least favorable prior.)
Nope, it’s least. :-) The only way to actually unconfuse yourself in this subject is to follow the math carefully. Here’s a Google Books link that might help. (The URL is long and weird, does it open on your computer?)
URL opens, but it is not especially useful as the odd numbered pages are not displayed by Google books. I suppose it would be a good exercise to try to infer the missing information, but life is short and the way is long.
More useful was the Clarke Barron 1994 paper that states: Jeffrey’s prior asymptotically maximizes the Shannon mutual information between a sample of size N and the parameter.
That agrees with my intuition.