is there a simple explanation of the conflict between bayesianism and frequentialism? I have sort of a feel for it from reading background materials but a specific example where they yield different predictions would be awesome. has such already been posted before?
Eliezer’s views as expressed in Blueberry’s links touch on a key identifying characteristic of frequentism: the tendency to think of probabilities as inherent properties of objects. More concretely, a pure frequentist (a being as rare as a pure Bayesian) treats probabilities as proper only to outcomes of a repeatable random experiment. (The definition of such a thing is pretty tricky, of course.)
What does that mean for frequentist statistical inference? Well, it’s forbidden to assign probabilities to anything that is deterministic in your model of reality. So you have estimators, which are functions of the random data and thus random themselves, and you assess how good they are for your purpose by looking at their sampling distributions. You have confidence interval procedures, the endpoints of which are random variables, and you assess the sampling probability that the interval contains the true value of the parameter (and the width of the interval, to avoid pathological intervals that have nothing to do with the data). You have statistical hypothesis testing, which categorizes a simple hypothesis as “rejected” or “not rejected” based on a procedure assessed in terms of the sampling probability of an error in the categorization. You have, basically, anything you can come up with, provided you justify it in terms of its sampling properties over infinitely repeated random experiments.
Here is a more general definition of “pure frequentism” (which includes frequentists such as Reichenbach):
Consider an assertion of probability of the form “This X has probability p of being a Y.” A frequentist holds that this assertion is meaningful only if the following conditions are met:
The speaker has already specified a determinate set X of things that actually have or will exist, and this set contains “this X”.
The speaker has already specified a determinate set Y containing all things that have been or will be Ys.
The assertion is true if the proportion of elements of X that are also in Y is precisely p.
A few remarks:
The assertion would mean something different if the speaker had specified different sets X and Y, even though X and Y aren’t mentioned explicitly in the assertion.
If no such sets had been specified in the preceding discourse, the assertion by itself would be meaningless.
However, the speaker has complete freedom in what to take as the set X containing “this X”, so long as X contains X. In particular, the other elements don’t have to be exactly like X, or be generated by exactly the same repeatable procedure, or anything like that. There are practical constraints on X, though. For example, X should be an interesting set.
[ETA:] An important distinction between Bayesianism and Frequentism is this: Note that, according to the above, the correct probability has nothing to do with the state of knowledge of the speaker. Once the sets X and Y are determined, there is an objective fact of the matter regarding the proportion of things in X that are also in Y. The speaker is objectively right or wrong in asserting that this proportion is p, and that rightness or wrongness had nothing to do with what the speaker knew. It had only to do with the objective frequency of elements of Y among the elements of X.
I’m sorry to see such wrongheaded views of frequentism here. Frequentists also assign probabilities to events where the probabilistic introduction is entirely based on limited information rather than a literal randomly generated phenomenon. If Fisher or Neyman was ever actually read by people purporting to understand frequentist/Bayesian issues, they’d have a radically different idea. Readers to this blog should take it upon themselves to check out some of the vast oversimplifications… And I’m sorry but Reichenbach’s frequentism has very little to do with frequentist statistics--. Reichenbach, a philosopher, had an idea that propositions had frequentist probabilities. So scientific hypotheses—which would not be assigned probabilities by frequentist statisticians—could have frequentist probabilities for Reichenbach, even though he didn’t think we knew enough yet to judge them. He thought at some point we’d be able to judge of a hypothesis of a type how frequently hypothesis like it would be true. I think it’s a problematic idea, but my point was just to illustrate that some large items are being misrepresented here, and people sold a wrongheaded view. Just in case anyone cares. Sorry to interrupt the conversation (errorstatistics.com)
What does that mean for frequentist statistical inference? Well, it’s forbidden to assign probabilities to anything that is deterministic in your model of reality.
Wait—Bayesians can assign probabilities to things that are deterministic? What does that mean?
Wait—Bayesians can assign probabilities to things that are deterministic? What does that mean?
Absolutely!
The Bayesian philosophy is that probabilities are about states of knowledge. Probability is reasoning with incomplete information, not about whether an event is “deterministic”, as probabilities do still make sense in a completely deterministic universe. In a poker game, there are almost surely no quantum events influencing how the deck is shuffled. Classical mechanics, which is deterministic, suffices to predict the ordering of cards. Even so, we have neither sufficient initial conditions (on all the particles in the dealer’s body and brain, and any incoming signals), nor computational power to calculate the ordering of the cards. In this case, we can still use probability theory to figure out probabilities of various hand combinations that we can use to guide our betting. Incorporating knowledge of what cards I’ve been dealt, and what (if any) are public is straightforward. Incorporating player’s actions and reactions is much harder, and not really well enough defined that there is a mathematically correct answer, but clearly we should use that knowledge in determining what types of hands we think it likely for our opponents to have. If we count as the dealer shuffles, and see he only shuffled three or four times, in principle we can (given a reasonable mathematical model of shuffling, such as the one Diaconis constructed to give the result that 7 shuffles are needed to randomize a deck) use the correlations left in there to give us even more clues about opponents’ likely hands.
What would a Bayesian do instead of a T-test?
In most cases we’d step back, and ask what you were trying to do, such that a T-test seemed like a good idea.
For those unaware, a t-test is a way of calculating the “likelihood” for the null hypothesis, which measures how likely the data are given that model. If the data is even moderately compatible, Frequentists say “we can’t reject it”. If it is terribly unlikely, the Frequentists say that it can be rejected—that it’s worth looking at another model.
From a Bayesian perspective, this is somewhat backwards—we don’t really care how likely the data is given this model P(D|M) -- after all, we actually got the data. We effectively want to know how useful the model is, now that we know this data. Some simple consistency requirements and scaling constraints means that this usefulness has to act just like a probability. So let’s just call it the probability of the model, given the data: P(M|D). A small bit of algebra gives us that P(M|D) = P(D|M) * P(M)/P(D), where P(D) is the sum over all models i of P(D|M_i) P(M_i), and P(M_i) is some “prior probability” of each model—how useful we think that model would be, even without any data collected (But, importantly, with some background knowledge).
In this framework, we don’t have absolute objective levels of confidence in our theories. All that is absolute and objective is how the data should change our confidence in various theories. We can’t just reject a theory if the data don’t match well, unless we have a better alternative theory to which we can switch. In many cases these models can be continuously indexed, such that the index corresponds to a parameter in a unified model, then this becomes parameter estimation—we get a range of theories with probability densities instead of probabilities, or equivalently, one theory with a probability density on a parameter, and getting new data mechanically turns a crank to give us a new probability density on this parameter.
There are a couple unsatisfying bits here: First it really would be nice to say “this theory is ridiculous because it doesn’t explain the data” without any reference to any other theory. But if we know it’s the only theory in town, we don’t have a choice. If it’s not the only theory in town, then how useful it is can really only coherently be measured relative to how useful other theories are. Second, we need to give “prior probabilities” to our various theories, and the math doesn’t give any direct justifications for what these should be. However, as long as these aren’t crazy, the incoming data will continuously update these so that the ones that seem more useful will get weighted as more useful, and the ones that aren’t will get weighted as less useful. This of course means we need reasonable spaces of theories to work over, and we’ll only pick a good model if we have a good model in this space of theories. If you eventually realize that “hey, all these models are crappy”, there is no good way of expanding the set of models you’re willing to consider, though a common way is to just “start over” with an expanded model space, and reallocated prior probabilities. You can’t just pretend that the first analysis was over some subset of this analysis, because the rescaling due to the P(D) term depends on the set of models you have. (Though you can handwave that you weren’t actually calculating P(M_i|D), but P(M_i|D, {M}), the probability of each model given the data, assuming that it was one of these models).
A sometimes useful shortcut is rather than working directly with the probabilities, and hence needing the rescaling is to work with the likelihoods (or more tractably, the log of them). The difference of the log likelihoods of two different theories for some data is a reasonable measure of how much that data should effect their relative ranking. But any given likelihood by itself hasn’t much meaning—only in comparison to the rest in a set tells you anything useful.
“Usefulness” certainly isn’t the orthodox Bayesian phrasing. I call myself a Bayesian because I recognize that Bayes’s Rule is the right thing to use in these situations. Whether or not the probabilities assigned to hypotheses “actually are” probabilities (whatever that means), they should obey the same mathematical rules of calculation as probabilities.
But precisely because only the manipulation rules matter, I’m not sure it is worth emphasizing that “to be a good Bayesian” you must accord these probabilities the same status as other probabilities. A hardcore Frequentist is not going to be comfortable doing that. Heck, I’m not sure I’m comfortable doing that. Data and event probabilities are things that can eventually be “resolved” to true or false, by looking after the fact. Probability as plausibility makes sense for these things.
But for hypotheses and models, I ask myself “plausibility of what? Being true?” Almost certainly, the “real” model (when that even makes sense) isn’t in our space of models. For example, a common, almost necessary, assumption is exchangeability: that any given permutation of the data is equally likely—effectively that all data points are drawn from the same distribution. Data often doesn’t behave like that, instead having a time drift. Coins being tossed develop wear, cards being shuffled and dealt get bent.
I really do prefer to think of some models being more or less useful. Of course, following this path shades into decision theory: we might want to assign priors according to how “tractable” the models are, including both in specification (stupid models that just specify what the data will be take lots of specification, so should have lower initial probabilities). Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
...shades into decision theory...Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
Whoa! that sounds dangerous! Why not keep the beliefs and costs separate and only apply this penalty at the decision theory stage?
Well, I said shaded into the lines of decision theory...
Yes, it absolutely is dangerous, and thinking about it more I agree it should not be done this way. Probability penalties do not scale correctly with the data collected: they’re essentially just a fixed offset. Modified utility of using a particular method really is different. If a method is unusable, we shouldn’t use it, and methods that trade off accuracy for manageability should be decided at that level, once we can judge the accuracy—not earlier.
EDIT: I suppose I was hoping for a valid way of justifying the fact that we throw out models that are too hard to use or analyze—they never make it into our set of hypotheses in the first place. It’s amazing how often conjugate priors “just happen” to be chosen...
But for hypotheses and models, I ask myself “plausibility of what? Being true?”
Plausibility of being true given the prior information. Just as Aristotelian logic gives valid arguments (but not necessarily sound ones), Bayes’s theorem gives valid but not necessarily sound plausibility assessments.
following this path shades into decision theory
That’s pretty much why I wanted to make the distinction between plausibility and usefulness. One of the things I like about the Cox-Jaynes approach is that it cleanly splits inference and decision-making apart.
Plausibility of being true given the prior information.
Okay, sure we can go back to the Bayesian mantra of “all probabilities are conditional probabilities”. But our prior information effectively includes the statement that one of our models is the “true one”. And that’s never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn’t true. This isn’t a huge problem, but it in some sense undermines the motivation for finding these probabilities and treating them seriously—they’re conditional probabilities being applied in a case where we know that what is being conditioned on is false. What is the grounding to our actual situation? I like to take the stance that in practice this is still useful—as an approximation procedure—sorting through models that are approximately right.
And that’s never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn’t true.
One does generally resort to non-Bayesian model checking methods. Andrew Gelman likes to include such checks under the rubric of “Bayesian data analysis”; he calls the computing of posterior probabilities and densities “Bayesian inference”, a preceding subcomponent of Bayesian data analysis. This makes for sensible statistical practice, but the underpinnings aren’t strong. One might consider it an attempt to approximate the Solomonoff prior.
Yes, in practice people resort to less motivated methods that work well.
I’d really like to see some principled answer that has the same feel as Bayesianism though. As it stands, I have no problem using Bayesian methods for parameter estimation. This is natural because we really are getting pdf(parameters | data, model). But for model selection and evaluation (i.e. non-parametric Bayes) I always feel that I need an “escape hatch” to include new models that the Bayes formalism simply doesn’t have any place for.
Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
I am much more comfortable leaving probability as it is but using a different term for usefulness.
On the other hand, it’s evidence to me that we’re talking about different types of minds. Have we identified whether this aspect of frequentism is a choice, or just the way their minds work?
I’m a frequentist, I think, and when I interrogate my intuition about whether 50% heads / 50% tails is a property of a fair coin, it returns ‘yes’. However, I understand that this property is an abstract one, and my intuition doesn’t make any different empirical predictions about the coin than a Bayesian would. Thus, what difference does it make if I find it natural to assign this property?
In other words, in what (empirically measurable!) sense could it be crazy?
Well, the immediate objection is that if you hand the coin to a skilled tosser, the frequencies of heads and tails in the tosses can be markedly different than 50%. If you put this probability in the coin, than you really aren’t modeling things in a manner that accords with results. You can, of course talk instead about a procedure of coin-tossing, that naturally has to specify the coin as well.
Of course, that merely pushes things back a level. If you completely specify the tossing procedure (people have built coin-tossing machines), then you can repeatedly get 100%/0% splits by careful tuning. If you don’t know whether it is tuned to 100% heads or 100% tails, is it still useful to describe this situation probabilistically? A hard-core Frequentist “should” say no, as everything is deterministic. Most people are willing to allow that 50% probability is a reasonable description of the situation. To the extent that you do allow this, you are Bayesian. To the extent that you don’t, you’re missing an apparently valuable technique.
The frequentist can account for the biased toss and determinism, in various ways.
My preferred reply would be that the 50⁄50 is a property of the symmetry of the coin. (Of course, it’s a property of an idealized coin. Heck, a real coin can land balanced on its edge.) If someone tosses the coin in a way that biases the coin, she has actually broken the symmetry in some way with her initial conditions. In particular, the tosser must begin with the knowledge of which way she is holding the coin—if she doesn’t know, she can’t bias the outcome of the coin.
I understand that Bayesian’s don’t tend to abstract things to their idealized forms … I wonder to what extent Frequentism does this necessarily. (What is the relationship between Frequentism and Platonism?)
The frequentist can account for these things, in various ways.
Oh, absolutely. The typical way is choosing some reference class of idealized experiments that could be done. Of course, the right choice of reference class is just as arbitrary as the right choice of Bayesian prior.
My preferred reply would be that the 50⁄50 is a property of the symmetry of the coin.
Whereas the Bayesian would argue that the 50⁄50 property is a symmetry about our knowledge of the coin—even a coin that you know is biased, but that you have no evidence for which way it is biased.
I understand that Bayesian’s don’t tend to abstract things to their idealized forms
Well, I don’t think Bayesians are particularly reluctant to look at idealized forms, it’s just that when you can make your model more closely match the situation (without incurring horrendous calculational difficulties) there is a benefit to do so.
And of course, the question is “which idealized form?” There are many ways to idealize almost any situation, and I think talking about “the” idealized form can be misleading. Talking about a “fair coin” is already a serious abstraction and idealization, but it’s one that has, of course, proven quite useful.
I wonder to what extent Frequentism does this necessarily. (What is the relationship between Frequentism and Platonism?)
In a nutshell: Bayesian statistics is about making probability statements, frequentist statistics is about evaluating probability statements.
So, speaking very loosely, Bayesianism is to science, inductive logic, and Aristotelianism as frequentism is to math, deductive logic, and Platonism. That is, Bayesianism is synthesis; frequentism is analysis.
This and this might be the kind of thing you’re looking for.
Though the conflict really only applies in the artificial context of a math problem. Frequentialism is more like a special case of Bayesianism where you’re making certain assumptions about your priors, sometimes specifically stated in the problem, for ease of calculation. For instance, in a Frequentialist analysis of coin flips, you might ignore all your prior information about coins, and assume the coin is fair.
thanks, that’s what I was looking for. would it be correct to say that in the frequentist interpretation your confidence interval narrows as your trials approach infinity?
If it helps, I think this is an example of a problem where they give different answers to the same problem. From Jaynes; see http://bayes.wustl.edu/etj/articles/confidence.pdf , page 22 for the details, and please let me know if I’ve erred or misinterpreted the example.
Three identical components. You run them through a reliability test and they fail at times 12, 14, and 16 hours. You know that these components fail in a particular way: they last at least X hours, then have a lifetime that you assess as an exponential distribution with an average of 1 hour. What is the shortest 90% confidence interval / probability interval for X, the time of guaranteed safe operation?
Bayesian 90% probability interval: 11.2 hours − 12.0 hours
Note: the frequentist interval has the strange property that we know for sure that the 90% confidence interval does not contain X (from the data we know that X ⇐ 12). The Bayesian interval seems to match our common sense better.
Heh, that’s a cheeky example. To explain why it’s cheeky, I have to briefly run through it, which I’ll do here (using Jaynes’s symbols so whoever clicked through and has pages 22-24 open can directly compare my summary with Jaynes’s exposition).
Call N the sample size and θ the minimum possible widget lifetime (what bill calls X). Jaynes first builds a frequentist confidence interval around θ by defining the unbiased estimator θ∗, which is the observations’ mean minus one. (Subtracting one accounts for the sample mean being >θ.) θ∗’s probability distribution turns out to be y^(N-1) exp(-Ny), where y = θ∗ - θ + 1. Note that y is essentially a measure of how far our estimator θ∗ is from the true θ, so Jaynes now has a pdf for that. Jaynes integrates that pdf to get y’s cdf, which he calls F(y). He then makes the 90% CI by computing [y1, y2] such that F(y2) - F(y1) = 0.9. That gives [0.1736, 1.8259]. Substituting in N and θ∗ for the sample and a little algebra (to get a CI corresponding to θ∗ rather than y) gives his θ CI of [12.1471, 13.8264].
For the Bayesian CI, Jaynes takes a constant prior, then jumps straight to the posterior being N exp(N(θ - x1)), where x1′s the smallest lifetime in the sample (12 in this case). He then comes up with the smallest interval that encompasses 90% of the posterior probability, and it turns out to be [11.23, 12].
Jaynes rightly observes that the Bayesian CI accords with common sense, and the frequentist CI does not. This comparison is what feels cheeky to me.
Why? Because Jaynes has used different estimators for the two methods [edit: I had previously written here that Jaynes implicitly used different estimators, but this is actually false; when he discusses the example subsequently (see p. 25 of the PDF) he fleshes out this point in terms of sufficient v. non-sufficient statistics.]. For the Bayesian CI, Jaynes effectively uses the minimum lifetime as his estimator for θ (by defining the likelihood to be solely a function of the smallest observation, instead of all of them), but for the frequentist CI, he explicitly uses the mean lifetime minus 1. If Jaynes-as-frequentist had happened to use the maximum likelihood estimator—which turns out to be the minimum lifetime here—instead of an arbitrary unbiased estimator he would’ve gotten precisely the same result as Jaynes-as-Bayesian.
So it seems to me that the exercise just demonstrates that Bayesianism-done-slyly outperformed frequentism-done-mindlessly. I can imagine that if I had tried to do the same exercise from scratch, I would have ended up faux-proving the reverse: that the Bayesian CI was dumber than the frequentist’s. I would’ve just picked up a boring, old-fashioned, not especially Bayesian reference book to look up the MLE, and used its sampling distribution to get my frequentist CI: that would’ve given me the common sense CI [11.23, 12]. Then I’d construct the Bayesian CI by mechanically defining the likelihood as the product of the individual observations’ likelihoods. That last step, I am pretty sure but cannot immediately prove, would give me a crappy Bayesian CI like [12.1471, 13.8264], if not that very interval.
Ultimately, at least in this case, I reckon your choice of estimator is far more important than whether you have a portrait of Bayes or Neyman on your wall.
[Edited to replace my asterisks with ∗ so I don’t mess up the formatting.]
So it seems to me that the exercise just demonstrates that Bayesianism-done-slyly outperformed frequentism-done-mindlessly.
This example really is Bayesianism-done-straightforwardly. The point is that you really don’t need to be sly to get reasonable results.
For the Bayesian CI, Jaynes takes a constant prior, then jumps straight to the posterior being N exp(N(θ - x1))
A constant prior ends up using only the likelihoods. The jump straight to the posterior is a completely mechanical calculation, just products, and normalization.
Then I’d construct the Bayesian CI by mechanically defining the likelihood as the product of the individual observations’ likelihoods.
Each individual likelihood goes to zero for (x < θ). This means that product also does for the smallest (x1 < θ). You will get out the same PDF as Jaynes. CIs can be constructed many ways from PDFs, but constructing the smallest one will give you the same one as Jaynes.
EDIT: for full effect, please do the calculation yourself.
Jaynes does go on to discuss everything you have pointed out here. He noted that confidence intervals had commonly been held not to require sufficient statistics, pointed out that some frequentist statisticians had been doubtful on that point, and remarked that if the frequentist estimator had been the sufficient statistic (the minimum lifetime) then the results would have agreed. I think the real point of the story is that he ran through the frequentist calculation for a group of people who did this sort of thing for a living and shocked them with it.
You got me: I didn’t read the what-went-wrong subsection that follows the example. (In my defence, I did start reading it, but rolled my eyes and stopped when I got to the claim that “there must be a very basic fallacy in the reasoning underlying the principle of confidence intervals”.)
I suspect I’m not the only one, though, so hopefully my explanation will catch some of the eyeballs that didn’t read Jaynes’s own post-mortem.
[Edit to add: you’re almost certainly right about the real point of the story, but I think my reply was fair given the spirit in which it was presented here, i.e. as a frequentism-v.-Bayesian thing rather than an orthodox-statisticians-are-taught-badly thing.]
Independently reproducing Jaynes’s analysis is excellent, but calling him “cheeky” for “implicitly us[ing] different estimators” is not fair given that he’s explicit on this point.
....given the spirit in which it was presented here, i.e. as a frequentism-v.-Bayesian thing rather than an orthodox-statisticians-are-taught-badly thing.
It’s a frequentism-v.-Bayesian thing to the extent that correct coverage is considered a sufficient condition for good frequentist statistical inference. This is the fallacy that you rolled your eyes at; the room full of shocked frequentists shows that it wasn’t a strawman at the time. [ETA: This isn’t quite right. The “v.-Bayesian” part comes in when correct coverage is considered a necessary condition, not a sufficient condition.]
ETA:
I suspect I’m not the only one, though, so hopefully my explanation will catch some of the eyeballs that didn’t read Jaynes’s own post-mortem.
This is a really good point, and it makes me happy that you wrote your explanation. For people for whom Jaynes’s phrasing gets in the way, your phrasing bypasses the polemics and lets them see the math behind the example.
Independently reproducing Jaynes’s analysis is excellent, but calling him “cheeky” for “implicitly us[ing] different estimators” is not fair given that he’s explicit on this point.
I was wrong to say that Jaynes implicitly used different estimators for the two methods. After the example he does mention it, a fact I missed due to skipping most of the post-mortem. I’ll edit my post higher up to fix that error. (That said, at the risk of being pedantic, I did take care to avoid calling Jaynes-the-person cheeky. I called his example cheeky, as well as his comparison of the frequentist CI to the Bayesian CI, kinda.)
It’s a frequentism-v.-Bayesian thing to the extent that correct coverage is considered a sufficient condition for good frequentist statistical inference. This is the fallacy that you rolled your eyes at; the room full of shocked frequentists shows that it wasn’t a strawman at the time. [ETA: This isn’t quite right. The “v.-Bayesian” part comes in when correct coverage is considered a necessary condition, not a sufficient condition.]
When I read Jaynes’s fallacy claim, I didn’t interpret it as saying that treating coverage as necessary/sufficient was fallacious; I read it as arguing that the use of confidence intervals in general was fallacious. That was made me roll my eyes. [Edit to clarify: that is, I was rolling my eyes at what I felt was a strawman, but a different one to the one you have in mind.] Having read his post-mortem fully and your reply, I think my initial, eye-roll-inducing interpretation was incorrect, though it was reasonable on first read-through given the context in which the “fallacy” statement appeared.
My intuition would be that the interval should be bounded above by 12 - epsilon, since the probability that we got one component that failed at the theoretically fastest rate seems unlikely (probability zero?).
If by epsilon, you mean a specific number greater than 0, the only reason to shave off an interval of length epsilon from the high end of the confidence interval is if you can get the probability contained in that epsilon-length interval back from a smaller interval attached to the low end of the confidence interval. (I haven’t work through the math, and the pdf link is giving me “404 not found”, but presumably this is not the case in this problem.)
Andrew Gelman wrote a parody of arguments against Bayesianism here. Note that he says that you don’t have to choose Bayesianism or frequentism; you can mix and match.
I’d be obliged if someone would explain this paragraph, from his response to his parody:
• “Why should I believe your subjective prior? If I really believed it, then I could
just feed you some data and ask you for your subjective posterior. That would
save me a lot of effort!”: I agree that this criticism reveals a serious incoherence
with the subjective Bayesian framework as well with in the classical utility theory
of von Neumann and Morgenstern (1947), which simultaneously demands that an
agent can rank all outcomes a priori and expects that he or she will make utility
calculations to solve new problems.
The resolution of this criticism is that Bayesian inference (and also utility theory)
are ideals or aspirations as much as they are descriptions. If there is serious
disagreement between your subjective beliefs and your calculated posterior, then
this should send you back to re-evaluate your model.
is there a simple explanation of the conflict between bayesianism and frequentialism? I have sort of a feel for it from reading background materials but a specific example where they yield different predictions would be awesome. has such already been posted before?
Eliezer’s views as expressed in Blueberry’s links touch on a key identifying characteristic of frequentism: the tendency to think of probabilities as inherent properties of objects. More concretely, a pure frequentist (a being as rare as a pure Bayesian) treats probabilities as proper only to outcomes of a repeatable random experiment. (The definition of such a thing is pretty tricky, of course.)
What does that mean for frequentist statistical inference? Well, it’s forbidden to assign probabilities to anything that is deterministic in your model of reality. So you have estimators, which are functions of the random data and thus random themselves, and you assess how good they are for your purpose by looking at their sampling distributions. You have confidence interval procedures, the endpoints of which are random variables, and you assess the sampling probability that the interval contains the true value of the parameter (and the width of the interval, to avoid pathological intervals that have nothing to do with the data). You have statistical hypothesis testing, which categorizes a simple hypothesis as “rejected” or “not rejected” based on a procedure assessed in terms of the sampling probability of an error in the categorization. You have, basically, anything you can come up with, provided you justify it in terms of its sampling properties over infinitely repeated random experiments.
Here is a more general definition of “pure frequentism” (which includes frequentists such as Reichenbach):
Consider an assertion of probability of the form “This X has probability p of being a Y.” A frequentist holds that this assertion is meaningful only if the following conditions are met:
The speaker has already specified a determinate set X of things that actually have or will exist, and this set contains “this X”.
The speaker has already specified a determinate set Y containing all things that have been or will be Ys.
The assertion is true if the proportion of elements of X that are also in Y is precisely p.
A few remarks:
The assertion would mean something different if the speaker had specified different sets X and Y, even though X and Y aren’t mentioned explicitly in the assertion.
If no such sets had been specified in the preceding discourse, the assertion by itself would be meaningless.
However, the speaker has complete freedom in what to take as the set X containing “this X”, so long as X contains X. In particular, the other elements don’t have to be exactly like X, or be generated by exactly the same repeatable procedure, or anything like that. There are practical constraints on X, though. For example, X should be an interesting set.
[ETA:] An important distinction between Bayesianism and Frequentism is this: Note that, according to the above, the correct probability has nothing to do with the state of knowledge of the speaker. Once the sets X and Y are determined, there is an objective fact of the matter regarding the proportion of things in X that are also in Y. The speaker is objectively right or wrong in asserting that this proportion is p, and that rightness or wrongness had nothing to do with what the speaker knew. It had only to do with the objective frequency of elements of Y among the elements of X.
I’m sorry to see such wrongheaded views of frequentism here. Frequentists also assign probabilities to events where the probabilistic introduction is entirely based on limited information rather than a literal randomly generated phenomenon. If Fisher or Neyman was ever actually read by people purporting to understand frequentist/Bayesian issues, they’d have a radically different idea. Readers to this blog should take it upon themselves to check out some of the vast oversimplifications… And I’m sorry but Reichenbach’s frequentism has very little to do with frequentist statistics--. Reichenbach, a philosopher, had an idea that propositions had frequentist probabilities. So scientific hypotheses—which would not be assigned probabilities by frequentist statisticians—could have frequentist probabilities for Reichenbach, even though he didn’t think we knew enough yet to judge them. He thought at some point we’d be able to judge of a hypothesis of a type how frequently hypothesis like it would be true. I think it’s a problematic idea, but my point was just to illustrate that some large items are being misrepresented here, and people sold a wrongheaded view. Just in case anyone cares. Sorry to interrupt the conversation (errorstatistics.com)
Do you intend to be replying to me or to Tyrrell McAllister?
Wait—Bayesians can assign probabilities to things that are deterministic? What does that mean?
What would a Bayesian do instead of a T-test?
Absolutely!
The Bayesian philosophy is that probabilities are about states of knowledge. Probability is reasoning with incomplete information, not about whether an event is “deterministic”, as probabilities do still make sense in a completely deterministic universe. In a poker game, there are almost surely no quantum events influencing how the deck is shuffled. Classical mechanics, which is deterministic, suffices to predict the ordering of cards. Even so, we have neither sufficient initial conditions (on all the particles in the dealer’s body and brain, and any incoming signals), nor computational power to calculate the ordering of the cards. In this case, we can still use probability theory to figure out probabilities of various hand combinations that we can use to guide our betting. Incorporating knowledge of what cards I’ve been dealt, and what (if any) are public is straightforward. Incorporating player’s actions and reactions is much harder, and not really well enough defined that there is a mathematically correct answer, but clearly we should use that knowledge in determining what types of hands we think it likely for our opponents to have. If we count as the dealer shuffles, and see he only shuffled three or four times, in principle we can (given a reasonable mathematical model of shuffling, such as the one Diaconis constructed to give the result that 7 shuffles are needed to randomize a deck) use the correlations left in there to give us even more clues about opponents’ likely hands.
In most cases we’d step back, and ask what you were trying to do, such that a T-test seemed like a good idea.
For those unaware, a t-test is a way of calculating the “likelihood” for the null hypothesis, which measures how likely the data are given that model. If the data is even moderately compatible, Frequentists say “we can’t reject it”. If it is terribly unlikely, the Frequentists say that it can be rejected—that it’s worth looking at another model.
From a Bayesian perspective, this is somewhat backwards—we don’t really care how likely the data is given this model P(D|M) -- after all, we actually got the data. We effectively want to know how useful the model is, now that we know this data. Some simple consistency requirements and scaling constraints means that this usefulness has to act just like a probability. So let’s just call it the probability of the model, given the data: P(M|D). A small bit of algebra gives us that P(M|D) = P(D|M) * P(M)/P(D), where P(D) is the sum over all models i of P(D|M_i) P(M_i), and P(M_i) is some “prior probability” of each model—how useful we think that model would be, even without any data collected (But, importantly, with some background knowledge).
In this framework, we don’t have absolute objective levels of confidence in our theories. All that is absolute and objective is how the data should change our confidence in various theories. We can’t just reject a theory if the data don’t match well, unless we have a better alternative theory to which we can switch. In many cases these models can be continuously indexed, such that the index corresponds to a parameter in a unified model, then this becomes parameter estimation—we get a range of theories with probability densities instead of probabilities, or equivalently, one theory with a probability density on a parameter, and getting new data mechanically turns a crank to give us a new probability density on this parameter.
There are a couple unsatisfying bits here:
First it really would be nice to say “this theory is ridiculous because it doesn’t explain the data” without any reference to any other theory. But if we know it’s the only theory in town, we don’t have a choice. If it’s not the only theory in town, then how useful it is can really only coherently be measured relative to how useful other theories are.
Second, we need to give “prior probabilities” to our various theories, and the math doesn’t give any direct justifications for what these should be. However, as long as these aren’t crazy, the incoming data will continuously update these so that the ones that seem more useful will get weighted as more useful, and the ones that aren’t will get weighted as less useful. This of course means we need reasonable spaces of theories to work over, and we’ll only pick a good model if we have a good model in this space of theories. If you eventually realize that “hey, all these models are crappy”, there is no good way of expanding the set of models you’re willing to consider, though a common way is to just “start over” with an expanded model space, and reallocated prior probabilities. You can’t just pretend that the first analysis was over some subset of this analysis, because the rescaling due to the P(D) term depends on the set of models you have. (Though you can handwave that you weren’t actually calculating P(M_i|D), but P(M_i|D, {M}), the probability of each model given the data, assuming that it was one of these models).
A sometimes useful shortcut is rather than working directly with the probabilities, and hence needing the rescaling is to work with the likelihoods (or more tractably, the log of them). The difference of the log likelihoods of two different theories for some data is a reasonable measure of how much that data should effect their relative ranking. But any given likelihood by itself hasn’t much meaning—only in comparison to the rest in a set tells you anything useful.
Very nice! I’d only replace “useful” with “plausible”. (Sure, it’s hard to define plausibility, but usefulness is not really the right concept.)
“Usefulness” certainly isn’t the orthodox Bayesian phrasing. I call myself a Bayesian because I recognize that Bayes’s Rule is the right thing to use in these situations. Whether or not the probabilities assigned to hypotheses “actually are” probabilities (whatever that means), they should obey the same mathematical rules of calculation as probabilities.
But precisely because only the manipulation rules matter, I’m not sure it is worth emphasizing that “to be a good Bayesian” you must accord these probabilities the same status as other probabilities. A hardcore Frequentist is not going to be comfortable doing that. Heck, I’m not sure I’m comfortable doing that. Data and event probabilities are things that can eventually be “resolved” to true or false, by looking after the fact. Probability as plausibility makes sense for these things.
But for hypotheses and models, I ask myself “plausibility of what? Being true?” Almost certainly, the “real” model (when that even makes sense) isn’t in our space of models. For example, a common, almost necessary, assumption is exchangeability: that any given permutation of the data is equally likely—effectively that all data points are drawn from the same distribution. Data often doesn’t behave like that, instead having a time drift. Coins being tossed develop wear, cards being shuffled and dealt get bent.
I really do prefer to think of some models being more or less useful. Of course, following this path shades into decision theory: we might want to assign priors according to how “tractable” the models are, including both in specification (stupid models that just specify what the data will be take lots of specification, so should have lower initial probabilities). Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
Whoa! that sounds dangerous! Why not keep the beliefs and costs separate and only apply this penalty at the decision theory stage?
Well, I said shaded into the lines of decision theory...
Yes, it absolutely is dangerous, and thinking about it more I agree it should not be done this way. Probability penalties do not scale correctly with the data collected: they’re essentially just a fixed offset. Modified utility of using a particular method really is different. If a method is unusable, we shouldn’t use it, and methods that trade off accuracy for manageability should be decided at that level, once we can judge the accuracy—not earlier.
EDIT: I suppose I was hoping for a valid way of justifying the fact that we throw out models that are too hard to use or analyze—they never make it into our set of hypotheses in the first place. It’s amazing how often conjugate priors “just happen” to be chosen...
Plausibility of being true given the prior information. Just as Aristotelian logic gives valid arguments (but not necessarily sound ones), Bayes’s theorem gives valid but not necessarily sound plausibility assessments.
That’s pretty much why I wanted to make the distinction between plausibility and usefulness. One of the things I like about the Cox-Jaynes approach is that it cleanly splits inference and decision-making apart.
Okay, sure we can go back to the Bayesian mantra of “all probabilities are conditional probabilities”. But our prior information effectively includes the statement that one of our models is the “true one”. And that’s never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn’t true. This isn’t a huge problem, but it in some sense undermines the motivation for finding these probabilities and treating them seriously—they’re conditional probabilities being applied in a case where we know that what is being conditioned on is false. What is the grounding to our actual situation? I like to take the stance that in practice this is still useful—as an approximation procedure—sorting through models that are approximately right.
One does generally resort to non-Bayesian model checking methods. Andrew Gelman likes to include such checks under the rubric of “Bayesian data analysis”; he calls the computing of posterior probabilities and densities “Bayesian inference”, a preceding subcomponent of Bayesian data analysis. This makes for sensible statistical practice, but the underpinnings aren’t strong. One might consider it an attempt to approximate the Solomonoff prior.
Yes, in practice people resort to less motivated methods that work well.
I’d really like to see some principled answer that has the same feel as Bayesianism though. As it stands, I have no problem using Bayesian methods for parameter estimation. This is natural because we really are getting pdf(parameters | data, model). But for model selection and evaluation (i.e. non-parametric Bayes) I always feel that I need an “escape hatch” to include new models that the Bayes formalism simply doesn’t have any place for.
I feel the same way.
I am much more comfortable leaving probability as it is but using a different term for usefulness.
the tendency to think of probabilities as inherent properties of objects.
yeah, this was my intuitive reason for thinking frequentists are a little crazy.
On the other hand, it’s evidence to me that we’re talking about different types of minds. Have we identified whether this aspect of frequentism is a choice, or just the way their minds work?
I’m a frequentist, I think, and when I interrogate my intuition about whether 50% heads / 50% tails is a property of a fair coin, it returns ‘yes’. However, I understand that this property is an abstract one, and my intuition doesn’t make any different empirical predictions about the coin than a Bayesian would. Thus, what difference does it make if I find it natural to assign this property?
In other words, in what (empirically measurable!) sense could it be crazy?
http://comptop.stanford.edu/preprints/heads.pdf
Well, the immediate objection is that if you hand the coin to a skilled tosser, the frequencies of heads and tails in the tosses can be markedly different than 50%. If you put this probability in the coin, than you really aren’t modeling things in a manner that accords with results. You can, of course talk instead about a procedure of coin-tossing, that naturally has to specify the coin as well.
Of course, that merely pushes things back a level. If you completely specify the tossing procedure (people have built coin-tossing machines), then you can repeatedly get 100%/0% splits by careful tuning. If you don’t know whether it is tuned to 100% heads or 100% tails, is it still useful to describe this situation probabilistically? A hard-core Frequentist “should” say no, as everything is deterministic. Most people are willing to allow that 50% probability is a reasonable description of the situation. To the extent that you do allow this, you are Bayesian. To the extent that you don’t, you’re missing an apparently valuable technique.
The frequentist can account for the biased toss and determinism, in various ways.
My preferred reply would be that the 50⁄50 is a property of the symmetry of the coin. (Of course, it’s a property of an idealized coin. Heck, a real coin can land balanced on its edge.) If someone tosses the coin in a way that biases the coin, she has actually broken the symmetry in some way with her initial conditions. In particular, the tosser must begin with the knowledge of which way she is holding the coin—if she doesn’t know, she can’t bias the outcome of the coin.
I understand that Bayesian’s don’t tend to abstract things to their idealized forms … I wonder to what extent Frequentism does this necessarily. (What is the relationship between Frequentism and Platonism?)
Oh, absolutely. The typical way is choosing some reference class of idealized experiments that could be done. Of course, the right choice of reference class is just as arbitrary as the right choice of Bayesian prior.
Whereas the Bayesian would argue that the 50⁄50 property is a symmetry about our knowledge of the coin—even a coin that you know is biased, but that you have no evidence for which way it is biased.
Well, I don’t think Bayesians are particularly reluctant to look at idealized forms, it’s just that when you can make your model more closely match the situation (without incurring horrendous calculational difficulties) there is a benefit to do so.
And of course, the question is “which idealized form?” There are many ways to idealize almost any situation, and I think talking about “the” idealized form can be misleading. Talking about a “fair coin” is already a serious abstraction and idealization, but it’s one that has, of course, proven quite useful.
That’s a very interesting question.
To quote from Gelman’s rejoinder that Phil Goetz mentioned,
So, speaking very loosely, Bayesianism is to science, inductive logic, and Aristotelianism as frequentism is to math, deductive logic, and Platonism. That is, Bayesianism is synthesis; frequentism is analysis.
Interesting! That makes a lot of sense to me, because I had already made connections between science and Aristotelianism, pure math and Platonism.
This and this might be the kind of thing you’re looking for.
Though the conflict really only applies in the artificial context of a math problem. Frequentialism is more like a special case of Bayesianism where you’re making certain assumptions about your priors, sometimes specifically stated in the problem, for ease of calculation. For instance, in a Frequentialist analysis of coin flips, you might ignore all your prior information about coins, and assume the coin is fair.
thanks, that’s what I was looking for. would it be correct to say that in the frequentist interpretation your confidence interval narrows as your trials approach infinity?
That is a highly desired property of Frequentist methods, but it’s not guaranteed by any means.
If it helps, I think this is an example of a problem where they give different answers to the same problem. From Jaynes; see http://bayes.wustl.edu/etj/articles/confidence.pdf , page 22 for the details, and please let me know if I’ve erred or misinterpreted the example.
Three identical components. You run them through a reliability test and they fail at times 12, 14, and 16 hours. You know that these components fail in a particular way: they last at least X hours, then have a lifetime that you assess as an exponential distribution with an average of 1 hour. What is the shortest 90% confidence interval / probability interval for X, the time of guaranteed safe operation?
Frequentist 90% confidence interval: 12.1 hours − 13.8 hours
Bayesian 90% probability interval: 11.2 hours − 12.0 hours
Note: the frequentist interval has the strange property that we know for sure that the 90% confidence interval does not contain X (from the data we know that X ⇐ 12). The Bayesian interval seems to match our common sense better.
Heh, that’s a cheeky example. To explain why it’s cheeky, I have to briefly run through it, which I’ll do here (using Jaynes’s symbols so whoever clicked through and has pages 22-24 open can directly compare my summary with Jaynes’s exposition).
Call N the sample size and θ the minimum possible widget lifetime (what bill calls X). Jaynes first builds a frequentist confidence interval around θ by defining the unbiased estimator θ∗, which is the observations’ mean minus one. (Subtracting one accounts for the sample mean being >θ.) θ∗’s probability distribution turns out to be y^(N-1) exp(-Ny), where y = θ∗ - θ + 1. Note that y is essentially a measure of how far our estimator θ∗ is from the true θ, so Jaynes now has a pdf for that. Jaynes integrates that pdf to get y’s cdf, which he calls F(y). He then makes the 90% CI by computing [y1, y2] such that F(y2) - F(y1) = 0.9. That gives [0.1736, 1.8259]. Substituting in N and θ∗ for the sample and a little algebra (to get a CI corresponding to θ∗ rather than y) gives his θ CI of [12.1471, 13.8264].
For the Bayesian CI, Jaynes takes a constant prior, then jumps straight to the posterior being N exp(N(θ - x1)), where x1′s the smallest lifetime in the sample (12 in this case). He then comes up with the smallest interval that encompasses 90% of the posterior probability, and it turns out to be [11.23, 12].
Jaynes rightly observes that the Bayesian CI accords with common sense, and the frequentist CI does not. This comparison is what feels cheeky to me.
Why? Because Jaynes has used different estimators for the two methods [edit: I had previously written here that Jaynes implicitly used different estimators, but this is actually false; when he discusses the example subsequently (see p. 25 of the PDF) he fleshes out this point in terms of sufficient v. non-sufficient statistics.]. For the Bayesian CI, Jaynes effectively uses the minimum lifetime as his estimator for θ (by defining the likelihood to be solely a function of the smallest observation, instead of all of them), but for the frequentist CI, he explicitly uses the mean lifetime minus 1. If Jaynes-as-frequentist had happened to use the maximum likelihood estimator—which turns out to be the minimum lifetime here—instead of an arbitrary unbiased estimator he would’ve gotten precisely the same result as Jaynes-as-Bayesian.
So it seems to me that the exercise just demonstrates that Bayesianism-done-slyly outperformed frequentism-done-mindlessly. I can imagine that if I had tried to do the same exercise from scratch, I would have ended up faux-proving the reverse: that the Bayesian CI was dumber than the frequentist’s. I would’ve just picked up a boring, old-fashioned, not especially Bayesian reference book to look up the MLE, and used its sampling distribution to get my frequentist CI: that would’ve given me the common sense CI [11.23, 12]. Then I’d construct the Bayesian CI by mechanically defining the likelihood as the product of the individual observations’ likelihoods. That last step, I am pretty sure but cannot immediately prove, would give me a crappy Bayesian CI like [12.1471, 13.8264], if not that very interval.
Ultimately, at least in this case, I reckon your choice of estimator is far more important than whether you have a portrait of Bayes or Neyman on your wall.
[Edited to replace my asterisks with ∗ so I don’t mess up the formatting.]
This example really is Bayesianism-done-straightforwardly. The point is that you really don’t need to be sly to get reasonable results.
A constant prior ends up using only the likelihoods. The jump straight to the posterior is a completely mechanical calculation, just products, and normalization.
Each individual likelihood goes to zero for (x < θ). This means that product also does for the smallest (x1 < θ). You will get out the same PDF as Jaynes. CIs can be constructed many ways from PDFs, but constructing the smallest one will give you the same one as Jaynes.
EDIT: for full effect, please do the calculation yourself.
I stopped reading cupholder’s comment before the last paragraph (to write my own reply) and completely missed this! D’oh!
Jaynes does go on to discuss everything you have pointed out here. He noted that confidence intervals had commonly been held not to require sufficient statistics, pointed out that some frequentist statisticians had been doubtful on that point, and remarked that if the frequentist estimator had been the sufficient statistic (the minimum lifetime) then the results would have agreed. I think the real point of the story is that he ran through the frequentist calculation for a group of people who did this sort of thing for a living and shocked them with it.
You got me: I didn’t read the what-went-wrong subsection that follows the example. (In my defence, I did start reading it, but rolled my eyes and stopped when I got to the claim that “there must be a very basic fallacy in the reasoning underlying the principle of confidence intervals”.)
I suspect I’m not the only one, though, so hopefully my explanation will catch some of the eyeballs that didn’t read Jaynes’s own post-mortem.
[Edit to add: you’re almost certainly right about the real point of the story, but I think my reply was fair given the spirit in which it was presented here, i.e. as a frequentism-v.-Bayesian thing rather than an orthodox-statisticians-are-taught-badly thing.]
Independently reproducing Jaynes’s analysis is excellent, but calling him “cheeky” for “implicitly us[ing] different estimators” is not fair given that he’s explicit on this point.
It’s a frequentism-v.-Bayesian thing to the extent that correct coverage is considered a sufficient condition for good frequentist statistical inference. This is the fallacy that you rolled your eyes at; the room full of shocked frequentists shows that it wasn’t a strawman at the time. [ETA: This isn’t quite right. The “v.-Bayesian” part comes in when correct coverage is considered a necessary condition, not a sufficient condition.]
ETA:
This is a really good point, and it makes me happy that you wrote your explanation. For people for whom Jaynes’s phrasing gets in the way, your phrasing bypasses the polemics and lets them see the math behind the example.
I was wrong to say that Jaynes implicitly used different estimators for the two methods. After the example he does mention it, a fact I missed due to skipping most of the post-mortem. I’ll edit my post higher up to fix that error. (That said, at the risk of being pedantic, I did take care to avoid calling Jaynes-the-person cheeky. I called his example cheeky, as well as his comparison of the frequentist CI to the Bayesian CI, kinda.)
When I read Jaynes’s fallacy claim, I didn’t interpret it as saying that treating coverage as necessary/sufficient was fallacious; I read it as arguing that the use of confidence intervals in general was fallacious. That was made me roll my eyes. [Edit to clarify: that is, I was rolling my eyes at what I felt was a strawman, but a different one to the one you have in mind.] Having read his post-mortem fully and your reply, I think my initial, eye-roll-inducing interpretation was incorrect, though it was reasonable on first read-through given the context in which the “fallacy” statement appeared.
Fair point.
excellent paper, thanks for the link.
My intuition would be that the interval should be bounded above by 12 - epsilon, since the probability that we got one component that failed at the theoretically fastest rate seems unlikely (probability zero?).
You can treat the interval as open at 12.0 if you like; it makes no difference.
If by epsilon, you mean a specific number greater than 0, the only reason to shave off an interval of length epsilon from the high end of the confidence interval is if you can get the probability contained in that epsilon-length interval back from a smaller interval attached to the low end of the confidence interval. (I haven’t work through the math, and the pdf link is giving me “404 not found”, but presumably this is not the case in this problem.)
The link’s a 404 because it includes a comma by accident—here’s one that works: http://bayes.wustl.edu/etj/articles/confidence.pdf.
Thanks, that makes sense, although it still butts up closely against my intuition.
Andrew Gelman wrote a parody of arguments against Bayesianism here. Note that he says that you don’t have to choose Bayesianism or frequentism; you can mix and match.
I’d be obliged if someone would explain this paragraph, from his response to his parody:
• “Why should I believe your subjective prior? If I really believed it, then I could just feed you some data and ask you for your subjective posterior. That would save me a lot of effort!”: I agree that this criticism reveals a serious incoherence with the subjective Bayesian framework as well with in the classical utility theory of von Neumann and Morgenstern (1947), which simultaneously demands that an agent can rank all outcomes a priori and expects that he or she will make utility calculations to solve new problems. The resolution of this criticism is that Bayesian inference (and also utility theory) are ideals or aspirations as much as they are descriptions. If there is serious disagreement between your subjective beliefs and your calculated posterior, then this should send you back to re-evaluate your model.