This is going to sound silly, but...could someone explain frequentist statistics to me?
Here’s my current understanding of how it works:
We’ve got some hypothesis H, whose truth or falsity we’d like to determine. So we go out and gather some evidence E. But now, instead of trying to quantify our degree of belief in H (given E) as a conditional probability estimate using Bayes’ Theorem (which would require us to know P(H), P(E|H), and P(E|~H)), what we do is simply calculate P(E|~H) (techniques for doing this being of course the principal concern of statistics texts), and then place H into one of two bins depending on whether P(E|~H) is below some threshold number (“p-value”) that somebody decided was “low”: if P(E|~H) is below that number, we put H into the “accepted” bin (or, as they say, we reject the null hypothesis ~H); otherwise, we put H into the “not accepted” bin (that is, we fail to reject ~H).
Now, if that is a fair summary, then this big controversy between frequentists and Bayesians must mean that there is a sizable collection of people who think that the above procedure is a better way of obtaining knowledge than performing Bayesian updates. But for the life of me, I can’t see how anyone could possibly think that. I mean, not only is the “p-value” threshold arbitrary, not only are we depriving ourselves of valuable information by “accepting” or “not accepting” a hypothesis rather than quantifying our certainty level, but...what about P(E|H)?? (Not to mention P(H).) To me, it seems blatantly obvious that an epistemology (and that’s what it is) like the above is a recipe for disaster—specifically in the form of accumulated errors over time.
I know that statisticians are intelligent people, so this has to be a strawman or something. Or at least, there must be some decent-sounding arguments that I haven’t heard—and surely there are some frequentist contrarians reading this who know what those arguments are. So, in the spirit of Alicorn’s “Deontology for Cosequentialists” or ciphergoth’s survey of the anti-cryonics position, I’d like to suggest a “Frequentism for Bayesians” post—or perhaps just a “Frequentism for Dummies”, if that’s what I’m being here.
Non-Bayesianism for Bayesians (based on a poor understanding of Andrew Gelman and Cosma Shalizi)
Lakatos (and Kuhn) are philosophers of science who studied science as scientists actually do it, as opposed to how scientists (at the time) claimed scientists do it. This is in contrast to taking the “scientific method” that we learned in grade school literally. Theories are not rejected at the first evidence that they have failed, they are patched, and so on.
Gelman and Shalizi’s criticism of Bayesian rhetoric (as far as I can make out from their blog posts and the slides
of Gelman’s talk) is (explicitly) similar—what Bayesians do is different than what Bayesians say Bayesians do.
In particular, humans (as opposed to ideal, which is to say nonexistent, Bayesians) do not SIMPLY update on the evidence. There are other important steps in the process, such as checking whether, given the new data, your original model still looks reasonable. (This is “posterior predictive model checking”). This step looks a lot like computing a p-value, though Gelman recommends a graphical presentation, rather than condensing to a single number.
In general, the notion of doing research on which priors are decent ones for scientific practice—strong enough to capture knowledge that we really do have, and weak enough to adapt to the evidence, given sufficient evidence—is a non-Bayesian notion; a perfect Bayesian only chooses their prior once, and never changes it. Note that historically, Jaynes worked on heuristics for how to choose a good prior, making him a non-Bayesian.
I saw an example that impressed me (and I can’t find the paper now to cite it!).
Suppose you have an urn A, with many balls in it, labeled A, and one ball labeled Z. Also, an urn B, with many (but fewer) balls in it labeled B and one ball labeled C, et cetera, until you finally have an urn Z with the fewest balls in it, labeled Z.
If we mix the urns and draw a ball from the mixture, which urn did it probably originally come from?
Suppose (because you’re a computationally-limited Bayesian) that you only include in your model the N highest-probability hypotheses. That is, you include A, B, C, in your model, but you neglect Z—that is, you put zero probability on it. (We can make Z’s pre-evidence probability arbitrarily small, to make this seem reasonable at the time.) When one, or even N balls turn out to be labeled Z, the model (due to the initial zero probability on Z) continues insisting that the balls came from one of the initially-specified hypotheses.
Of course, you could (and should) do a posterior predictive check, computing the probability that your model assigns to the observed data, and revise your model if the probability says your model is wack. However, that step “looks frequentist”, and isn’t explicitly included the rhetoric of “Bayesian Statistics = Science”. Bayesians update on the evidence, they don’t revise their models!
Anyway, don’t get caught up in factionalism and tribal us vs. them thinking!
Suppose (because you’re a computationally-limited Bayesian) that you only include in your model the N highest-probability hypotheses. That is, you include A, B, C, in your model, but you neglect Z—that is, you put zero probability on it. (We can make Z’s pre-evidence probability arbitrarily small, to make this seem reasonable at the time.) When one, or even N balls turn out to be labeled Z, the model (due to the initial zero probability on Z) continues insisting that the balls came from one of the initially-specified hypotheses.
That isn’t just a computational limitation. It’s an outright bug. Something that assigns 0 to Z is just not even an approximation of a Bayesian. A sane agent with limited resources may, for example, assign a probability to “A,B,C and ‘something else’”. If it explicitly assigned an (arbitrarily close to) 0 to Z then it just fails at life.
Hi. I found the paper containing the example in question—it’s Bayesians sometimes cannot ignore even very implausible theories. I don’t understand everything in the paper, but it seems like they’ve anticipated your objection and have another example which explicitly includes a “Something else” case.
Forgive my confusion, I’m a bad statistician, of any sort. How do you include ‘something else’ in your model? Don’t you need to at least (for monte carlo techniques) be able to generate “forward” from parameters to simulated data?
Or do you include Gelman’s posterior predictive check in the model somehow, so that data that is sufficiently surprising causes a “misspecification alarm” to go off?
I’m not sure how the best way to handle simplifying a model without doing insane things. I do know that if what you are doing amounts to overtly “putting zero probability on it” then what you are doing is a terminal mistake that makes the process distinctly non-bayesian. I get the impression that the mistakes that bayesians are trying to correct with their after the fact testing of the model are different ones to this one. If common ‘bayesian statisticians’ do in fact make mistakes that are of this order then consider me mistaken but also consider their claims to be ‘bayesians’ also, more or less, lies.
I get the impression that the mistakes that bayesians are trying to correct with their after the fact testing of the model are different ones to this one.
If you choose a single model to work with, you are effectively putting zero probability on all other models (that are not contained in your chosen model as sub-models). Gelman’s posterior predictive checks aren’t motivated by this consideration (one of his non-mainstream-for-a-Bayesian stances is that model probabilities aren’t useful). Nevertheless, the checks are directed at identifying ways in which the model fits the data poorly, with an eye to guiding further model elaboration, so they do address this issue in a sense.
“putting zero probability on it”… is a terminal mistake that makes the process distinctly non-bayesian.
Philosophically this is true, but practically speaking, it’s not. Setting certain posterior probabilities to zero can be a good approximation to a fully Bayesian analysis (e.g., this paper). In fact, if it’s appropriate to use a small number of sigfigs in your results, this approximation can yield the exact same results far faster. I don’t think it’s fair to call the labeling of such an analysis as Bayesian a lie.
If you choose a single model to work with, you are effectively putting zero probability on all other models (that are not contained in your chosen model as sub-models).
I follow this reasoning and it applies in many cases. The reason I do not consider it applicable to the example given is due to the explicit mentioning of “We can make Z’s pre-evidence probability arbitrarily small, to make this seem reasonable at the time.” That changes the meaning of the example significantly in my understanding.
I claim that if Z is given enough consideration that ‘arbitrarily small’ is plugged in rather than mere exclusion from a model then it is just an error not an approximation. There are valid examples of bayes-in-practice that support the position John takes but I just don’t consider this example a fair representation. Partly because the mistake is a bad way to handle urns and partly because explicitly plugging in bad priors for Z should make you explicitly expect bad posteriors for Z. Exclusion from the model itself is a different problem.
Good answer. I got a bit confused because Z has two meanings: “ball labelled Z was observed” (data), and “ball came from urn Z” (hypothesis). John’s model can assign zero probability to data than could possibly be observed, and that’s the big no-no.
How do you include ‘something else’ in your model? Don’t you need to at least (for monte carlo techniques) be able to generate “forward” from parameters to simulated data?
In the example provided it would be by having the labels “A, B, C and Zooblefuzz” where Zooblefuzz is clearly defined ‘any other urn than A, B or C’.
When one, or even N balls turn out to be labeled Z, the model (due to the initial zero probability on Z) continues insisting that the balls came from one of the initially-specified hypotheses.
If Pr(ball labelled Z | urn) = 0 for all urns under consideration then Pr(ball labelled Z) = 0 too, so the model tries to evaluate 0 / 0 and crashes.
Tangent: I was a huge fan of Proofs and Refutations, which is about mathematics; is there a book of Lakatos’s on the philosophy of science you would recommend?
I liked Proofs and Refutations a lot too. However, I’m ashamed to admit I have no special knowledge of Lakatos. All I know about his philosophy of science stuff (which I believe is closely related) is from his Wikipedia page (and Feyerabend’s). Gelman’s slides made the analogy with Lakatos explicitly.
I’ve always thought it would be nice to have a “Frequentist-to-Bayesian” guide. Sort of a “Here’s some example problems, here’s how you might go about it doing frequentist methods, here’s how you might go about it using Bayesian techniques.” My introduction to statistics began with an AP course in high school (and I used this HyperStat source to help out), and of course they teach hypothesis testing and barely give a nod to Bayes’ Theorem.
What you’ve described is the “statistical hypothesis testing” technique, and yes, you’ve got it right. The only reason it functions at all is that by and large, people who use it aren’t stupid, and they know that they have to submit it to peer review to other people who aren’t stupid. Nevertheless, a lot of crap gets through, just because the approach is so wrong-headed. ETA: Oops! I left an important detail out of this response.
There are other techniques for frequentist statistics, e.g., unbiased estimators, minimum mean squared error estimators, method of moments, robust estimators, confidence intervals, confidence distributions, maximum likelihood, profile likelihood, empirical likelihood, empirical Bayes, estimating equations, PAC learning, etc., etc., ad nauseum.
could someone explain frequentist statistics to me?
The central difficulty of Bayesian statistics is the problem of choosing a prior: where did it come from, how is it justified? How can Bayesians ever make objective scientific statements, if all of their methods require an apparently arbitrary choice for a prior?
Frequentist statistics is the attempt to do probabilistic inference without using a prior. So, for example, the U-test Cyan linked to above makes a statement about whether two data sets could be drawn from the same distribution, without having to assume anything about what the distribution actually is.
That’s my understanding, anyway—I would also be happy to see a “Frequentism for Bayesians” post.
The OP is one such—Bayesians aren’t permitted to ignore any part of the data except those which leave the likelihood unchanged. One classic example is that in some problems, a confidence interval procedure can return the whole real line. A mildly less pathological example also concerning a wacky confidence interval is here.
A prior gives you as much information as the mean of a distribution. So, can’t I by the same token accuse both frequentist and Bayesian statistics of attempting to do probabilistic inference without using a distribution?
I mean, the frequentist uses the U-test to ask whether 2 data sets could be drawn from the same distribution, without assuming what the mean of the distribution is. The Bayesian would use some other test, assuming a prior or perhaps a mean for the distribution, but not assuming a shape for the distribution. And some other, uninvented, and (by the standards of LW) superior statistical methodology would use another test, assuming a mean and a shape for the distribution.
A prior gives you as much information as the mean of a distribution.
No, not in general, it can give much more or much less; it depends entirely on how detailed you can make your prior. Expanding out e.g. as a series of central moments can give you as detailed a shape as you want. It may reduce to knowing only the mean in certain very special inference problems. In other problems, you may know that the distribution is very definitely Cauchy (EDIT: which doesn’t even have a well-defined mean), but not know the parameters, and put some reasonable prior on them—flat for the center over some range, and approximately using a (1/x) improper prior for the width, perhaps cutting it off at physically relevant length scales.
The Bayesian would use some other test, assuming a prior or perhaps a mean for the distribution, but not assuming a shape for the distribution.
All that information can be encoded in the prior. The prior covers your probabilities over the space of your hypotheses, not a direct probabilistic encoding of what you think one sample will be.
what we do is simply calculate P(E|~H) (techniques for doing this being of course the principal concern of statistics texts),
No no no. That would be a hundred times saner than frequentism. What you actually do is take the real data e-12 and put it into a giant bin E that also contains e-1, e-3, and whatever else you can make up a plausible excuse to include or exclude, and then you calculate P(E|~H). This is one of the key points of flexibility that enables frequentists to get whatever answer they like, the other being the choice of control variables in multivariate analyses.
See e.g. this part of the article:
The authors used what’s called a Mann-Whitney U test, which, in simplified terms, aims to determine if two sets of data come from different distributions. The essential thing to know about this test is that it doesn’t depend on the actual data except insofar as those data determine the ranks of the data points when the two data sets are combined. That is, it throws away most of the data, in the sense that data sets that generate the same ranking are equivalent under the test.
This seems to use “frequentist” to mean “as statistics are actually practiced.” It is unreasonable to compare the implementation of A to the ideal form of B. In particular, the problem of the Mann-Whitney test seem to me that the authors looked up a recipe in a cookbook without understanding it, which they could have done just as easily in a bayesian cookbook.
Well, the blatant version would be to take 5 possible control variables and try all 32 possible omissions and inclusions to see if any of the combinations turns up “statistically significant”. This might look a little suspicious if you collected the data and then threw some of it away. If you were running regressions on an existing database with lots of potential control variables, why, they’ll just have to trust that you never secretly picked and chose.
Someone who did that might not be able to convince themselves they weren’t cheating… but someone who, somehow or other, got an idea of which variables would be most convenient to control for, might well find themselves influenced just a bit in that direction.
I don’t see how being a Bayesian gets you out of cherry-picking your causal structure from a large set. You still have to decide which variables are conditional on which other variables.
You put in all the variables, use a hierarchical structure for the prior, use a weakly informative hyperprior, and let the data sort itself out if it can. Key phrase: automatic relevance determination; David MacKay originated the term while doing Bayesian inference for neural nets.
(which would require us to know P(H), P(E|H), and P(E|~H))
Is that not precisely the problem? Often, the H you are interested in is so vague (“there is some kind of effect in a certain direction”) that it is very difficult to estimate P(E / H) - or even to define it.
OTOH, P(E / ~H) is often very easy to compute from first principles, or to obtain through experiments (since conditions where “the effect” is not present are usually the most common).
Example: I have a coin. I want to know if it is “true” or “biased”. I flip it 100 times, and get 78 tails.Now how do I estimate the probability of obtaining this many tails, knowing that the coin is “biased”? How do I even express that analytically? By contrast, it is very easy to compute the probability of this sequence (or any other) with a “non-biased” coin.
So there you have it. The whole concept of “null hypotheses” is not a logical axiom, it simply derives from real-world observation: in the real world, for most of the H we are interested in, estimating P(E / ~H) is easy, and estimating P(E / H) is either hard or impossible.
what about P(E|H)?? (Not to mention P(H).)
P(H) is silently set to .5. If you know P(E / ~H), this makes P(E / H) unnecessary to compute the real quantity of interest, P(H / E) / P(~H / E). I think.
There needs to be a post specifically devoted to arguments of the form “It’s okay to do things wrong, because doing them right would be hard”. I’ve seen this so many times, in so many places, in so many subjects, that I have to conclude that people just don’t see what is wrong with it.
(No, I’m not talking about making simplifying assumptions or idealizations in models. More like presenting a collection of sometimes-useful ad-hoc tricks as a competing theory, which is then argued for as a theory against its competitors on the basis of its being “easier to apply”.)
Bayes’ Theorem says that P(H|E) = P(H)P(E|H)/P(E). That’s, like, the law. You don’t get to take P(E|H) out of the equation, or pretend it isn’t there, just because it’s difficult to estimate. As I’ve said elsewhere, if you have a belief, then you’ve done a Bayesian update—which means you have some assumption about each of those quantities appearing in the formula, whether you choose to confront these assumptions or not.
As a matter of fact, if you find P(E|H) overly difficult to estimate, that means your H isn’t paying its rent.
Now, if that is a fair summary, then this big controversy between frequentists and Bayesians must mean that there is a sizable collection of people who think that the above procedure is a better way of obtaining knowledge than performing Bayesian updates.
Not necessarily better. Just more convenient for the thumbs up/thumbs down way of looking at evidence that scientists tend to like.
But for the life of me, I can’t see how anyone could possibly think that. I mean, not only is the “p-value” threshold arbitrary,
It’s a convention. The point is to have a pre-agreed, low significance level so that testers can’t screw with the result of a test by arbitrary jacking the significance level up (if they want to reject a hypothesis) or turning it down (if they don’t). The significance level has to be low to minimize the risk of a type I error.
not only are we depriving ourselves of valuable information by “accepting” or “not accepting” a hypothesis rather than quantifying our certainty level,
The certainty level is effectively communicated via the significance level and p-value itself. (And the use of a reject vs. don’t reject dichotomy can be desirable if one wishes to decide between performing some action and not performing it based on some data.)
but...what about P(E|H)?? (Not to mention P(H).) To me, it seems blatantly obvious that an epistemology (and that’s what it is) like the above is a recipe for disaster—specifically in the form of accumulated errors over time.
A frequentist can deal in likelihoods, for example by doing hypothesis tests of likelihood ratios. As for priors, a frequentist encapsulates them in parametric and sampling assumptions about the data. A Bayesian might give a low weight to a positive result from a parapsychology study because of their “low priors”, but a frequentist might complain about sampling procedures or cherrypicking being more likely than a true positive. As I see it, the two say essentially the same thing; the frequentist is just being more specific than the Bayesian.
The certainty level is effectively communicated via the significance level and p-value itself.
No. P-values are not equivalent when they are calculated using different statistics, or even the same statistic but a different sample size. On the latter point see Royall, 1986.
As I see it, the two say essentially the same thing; the frequentist is just being more specific than the Bayesian.
I’d say the frequentist is using Bayesian reasoning informally; Jaynes discusses this exact problem from a Bayesian perspective at the beginning of Chapter 5 of his magnum opus.
No. P-values are not equivalent when they are calculated using different statistics, or even the same statistic but a different sample size. On the latter point see Royall, 1986.
Sorry. You are quite right, and I was sloppy. I had in mind the implicit idea that holding the choices of statistical test and data collection procedure constant, different p-values suggest how strongly one should reject the null hypothesis, and I should have made that explicit. It is absolutely true that if I just ask someone, “Test A gave me p = 0.008 and Test B gave me p = 0.4, which test’s null hypothesis is worse off?”, the correct answer is “how should I know?”
I’d say the frequentist is using Bayesian reasoning informally; Jaynes discusses this exact problem from a Bayesian perspective at the beginning of Chapter 5 of his magnum opus.
Yep. I think this is an example of the frequentist encapsulating what a Bayesian would call priors in their sampling assumptions.
I too would like to see a good explanation of frequentist techniques,
especially one that also explains their relationships (if any) to
Bayesian techniques.
Based on the tiny bit I know of both approaches, I think one appealing
feature of frequentist techniques (which may or may not make up for their
drawbacks) is that your initial assumptions are easier to dislodge the
more wrong they are.
It seems to be the other way around with Bayesian techniques because of
a stronger built-in assumption that your assumptions are justified. You
can immunize yourself against any particular evidence by having a
sufficiently wrong prior.
It seems to be the other way around with Bayesian techniques because of a stronger built-in assumption that your assumptions are justified. You can immunize yourself against any particular evidence by having a sufficiently wrong prior.
But you won’t be able to convince other Bayesians who don’t share that radically wrong prior. Similarly, there doesn’t seem to be something intrinsic to frequentism that keeps you from being persistently wrong. Rather, frequentists are kept in line because, as Cyan said, they have to persuade each other. Fortunately, for Bayesians and frequentists alike, a technique’s being persuasive to the community correlates with its being liable to produce less wrong answers.
The ability to get a bad result because of a sufficiently wrong prior is not a flaw in Bayesian statistics; it is a flaw is our ability to perform Bayesian statistics. Humans tend to overestimate their confidence of probabilities with very low or very high values. As such, the proper way to formulate a prior is to imagine hypothetical results that will bring the probability into a manageable range, ask yourself what you would want your posterior to be in such cases, and build your prior from that. These hypothetical results must be constructed and analyzed before the actual result is obtained to eliminate bias.
As Tyrrell said, the ability of a wrong prior to result in a bad conclusion is a strength because other Bayesians will be able to see where you went wrong by disputing the prior.
It seems to be the other way around with Bayesian techniques because of a stronger built-in assumption that your assumptions are justified. You can immunize yourself against any particular evidence by having a sufficiently wrong prior.
Someone correct me if I’m wrong here, but I don’t think even having a strong prior P(H) against the evidence is much help, because that makes your likelihood ratio on the evidence P(E|H)/P(E|~H) that much stronger.
(This issue is one my stumbling blocks in Bayescraft.)
The likelihood ratio P(E|H)/P(E|~H) is entirely independent of the prior P(H)
In theory, yes, but we’re talking about a purported “unswayable Bayesian”. If someone strongly believes leprechauns don’t exist (low P(H), where H is “leprechauns exist” ), they should strongly expect not to see evidence of leprechauns (low P(E|~H), where E is direct evidence of leprechauns, like finding one in the forest), which suggests a high likelihood ratio P(E|H)/P(E|~H).
I remember Eliezer Yudkowsky referring to typical conversations that go like:
Non-rationalist: “I don’t think there will ever be an artificial general intelligence, because my religion says that can’t happen.” EY: “So if I showed you one, that means you’d leave your religion?”
I’m not entirely sure I understand your point. The example you’re citing is more the guy saying “I believe X, and X implies ~Y, therefore ~Y”, so Eliezer is saying “So Y implies ~X then?”
But the “X implies ~Y” belief can happen when one has low belief in X or high belief in X.
Or are you saying “the likelihoods assigned led to past interpretation of analogous (lack of) evidence, and that’s why the current prior is what it is?
komponisto nailed the intuition I was going from: the likelihood ratio is independent of the prior, but an unswayable Bayesian fixes P(E), forcing extreme priors to have extreme likelihood ratios.
*blinks* I think I’m extra confused. The law of conservation of probability is basically just saying that the change in belief may be large or small, so evidence may be strong or weak in that sense. But that doesn’t leave the likelihoods up for grabs, (well, okay, P(E|~H) could depend on how you distribute your belief over the space of hypotheses other than H, but… I’m not sure that was your point)
Okay, point conceded … that still doesn’t generate a result that matches the intuition I had. I need to spend more time on this to figure out what assumptions I’m relying on to claim that “extremely wrong beliefs force quick updates”.
The quantities P(H), P(E|H), and P(E|~H) are in general independent of each other, in the sense that you can move any one of them without changing the others—provided you adjust P(E) accordingly.
I mean, not only is the “p-value” threshold arbitrary, not only are we depriving ourselves of valuable information by “accepting” or “not accepting” a hypothesis rather than quantifying our certainty level, but...what about P(E|H)?? (Not to mention P(H).)
Well, P(E|H) is actually pretty easy to calculate under a frequentist framework. That’s the basis of power analysis, a topic covered in any good intro stat course. The real missing ingredient, as you point out, is P(H).
I’m not fully fluent in Bayesian statistics, so while I’m on the topic I have a question: do Bayesian methods involve any decision making? In other words, once we’ve calculated P(H|E), do we just leave it at that? No criteria to decide on, just revising of probabilities?
This is my current understanding, but it just seems so contrary to everyday human reasoning. What we would really like to say at the end of the day (or, rather, research program) is something like “Aha! Given the accumulated evidence, we can now cease replication. Hypothesis X must be true.” Being humans, we want to make a decision. But decision making necessarily involves the ultimately arbitrary choice of where to set the criterion. Is this anti-Bayesian?
I don’t think formal decision theory is common in applied Bayesian stats in science; the only paper I can quickly recall that did a decision analysis is Andrew Gelman’s radon remediation study. Maybe econometrics is different, since it’s a lot easier to define losses in that context.
This is going to sound silly, but...could someone explain frequentist statistics to me?
Here’s my current understanding of how it works:
We’ve got some hypothesis H, whose truth or falsity we’d like to determine. So we go out and gather some evidence E. But now, instead of trying to quantify our degree of belief in H (given E) as a conditional probability estimate using Bayes’ Theorem (which would require us to know P(H), P(E|H), and P(E|~H)), what we do is simply calculate P(E|~H) (techniques for doing this being of course the principal concern of statistics texts), and then place H into one of two bins depending on whether P(E|~H) is below some threshold number (“p-value”) that somebody decided was “low”: if P(E|~H) is below that number, we put H into the “accepted” bin (or, as they say, we reject the null hypothesis ~H); otherwise, we put H into the “not accepted” bin (that is, we fail to reject ~H).
Now, if that is a fair summary, then this big controversy between frequentists and Bayesians must mean that there is a sizable collection of people who think that the above procedure is a better way of obtaining knowledge than performing Bayesian updates. But for the life of me, I can’t see how anyone could possibly think that. I mean, not only is the “p-value” threshold arbitrary, not only are we depriving ourselves of valuable information by “accepting” or “not accepting” a hypothesis rather than quantifying our certainty level, but...what about P(E|H)?? (Not to mention P(H).) To me, it seems blatantly obvious that an epistemology (and that’s what it is) like the above is a recipe for disaster—specifically in the form of accumulated errors over time.
I know that statisticians are intelligent people, so this has to be a strawman or something. Or at least, there must be some decent-sounding arguments that I haven’t heard—and surely there are some frequentist contrarians reading this who know what those arguments are. So, in the spirit of Alicorn’s “Deontology for Cosequentialists” or ciphergoth’s survey of the anti-cryonics position, I’d like to suggest a “Frequentism for Bayesians” post—or perhaps just a “Frequentism for Dummies”, if that’s what I’m being here.
Non-Bayesianism for Bayesians (based on a poor understanding of Andrew Gelman and Cosma Shalizi)
Lakatos (and Kuhn) are philosophers of science who studied science as scientists actually do it, as opposed to how scientists (at the time) claimed scientists do it. This is in contrast to taking the “scientific method” that we learned in grade school literally. Theories are not rejected at the first evidence that they have failed, they are patched, and so on.
Gelman and Shalizi’s criticism of Bayesian rhetoric (as far as I can make out from their blog posts and the slides of Gelman’s talk) is (explicitly) similar—what Bayesians do is different than what Bayesians say Bayesians do.
In particular, humans (as opposed to ideal, which is to say nonexistent, Bayesians) do not SIMPLY update on the evidence. There are other important steps in the process, such as checking whether, given the new data, your original model still looks reasonable. (This is “posterior predictive model checking”). This step looks a lot like computing a p-value, though Gelman recommends a graphical presentation, rather than condensing to a single number. In general, the notion of doing research on which priors are decent ones for scientific practice—strong enough to capture knowledge that we really do have, and weak enough to adapt to the evidence, given sufficient evidence—is a non-Bayesian notion; a perfect Bayesian only chooses their prior once, and never changes it. Note that historically, Jaynes worked on heuristics for how to choose a good prior, making him a non-Bayesian.
I saw an example that impressed me (and I can’t find the paper now to cite it!). Suppose you have an urn A, with many balls in it, labeled A, and one ball labeled Z. Also, an urn B, with many (but fewer) balls in it labeled B and one ball labeled C, et cetera, until you finally have an urn Z with the fewest balls in it, labeled Z. If we mix the urns and draw a ball from the mixture, which urn did it probably originally come from?
Suppose (because you’re a computationally-limited Bayesian) that you only include in your model the N highest-probability hypotheses. That is, you include A, B, C, in your model, but you neglect Z—that is, you put zero probability on it. (We can make Z’s pre-evidence probability arbitrarily small, to make this seem reasonable at the time.) When one, or even N balls turn out to be labeled Z, the model (due to the initial zero probability on Z) continues insisting that the balls came from one of the initially-specified hypotheses.
Of course, you could (and should) do a posterior predictive check, computing the probability that your model assigns to the observed data, and revise your model if the probability says your model is wack. However, that step “looks frequentist”, and isn’t explicitly included the rhetoric of “Bayesian Statistics = Science”. Bayesians update on the evidence, they don’t revise their models!
Anyway, don’t get caught up in factionalism and tribal us vs. them thinking!
I like your point but not your example.
That isn’t just a computational limitation. It’s an outright bug. Something that assigns 0 to Z is just not even an approximation of a Bayesian. A sane agent with limited resources may, for example, assign a probability to “A,B,C and ‘something else’”. If it explicitly assigned an (arbitrarily close to) 0 to Z then it just fails at life.
Hi. I found the paper containing the example in question—it’s Bayesians sometimes cannot ignore even very implausible theories. I don’t understand everything in the paper, but it seems like they’ve anticipated your objection and have another example which explicitly includes a “Something else” case.
Forgive my confusion, I’m a bad statistician, of any sort. How do you include ‘something else’ in your model? Don’t you need to at least (for monte carlo techniques) be able to generate “forward” from parameters to simulated data?
Or do you include Gelman’s posterior predictive check in the model somehow, so that data that is sufficiently surprising causes a “misspecification alarm” to go off?
I’m not sure how the best way to handle simplifying a model without doing insane things. I do know that if what you are doing amounts to overtly “putting zero probability on it” then what you are doing is a terminal mistake that makes the process distinctly non-bayesian. I get the impression that the mistakes that bayesians are trying to correct with their after the fact testing of the model are different ones to this one. If common ‘bayesian statisticians’ do in fact make mistakes that are of this order then consider me mistaken but also consider their claims to be ‘bayesians’ also, more or less, lies.
If you choose a single model to work with, you are effectively putting zero probability on all other models (that are not contained in your chosen model as sub-models). Gelman’s posterior predictive checks aren’t motivated by this consideration (one of his non-mainstream-for-a-Bayesian stances is that model probabilities aren’t useful). Nevertheless, the checks are directed at identifying ways in which the model fits the data poorly, with an eye to guiding further model elaboration, so they do address this issue in a sense.
Philosophically this is true, but practically speaking, it’s not. Setting certain posterior probabilities to zero can be a good approximation to a fully Bayesian analysis (e.g., this paper). In fact, if it’s appropriate to use a small number of sigfigs in your results, this approximation can yield the exact same results far faster. I don’t think it’s fair to call the labeling of such an analysis as Bayesian a lie.
I follow this reasoning and it applies in many cases. The reason I do not consider it applicable to the example given is due to the explicit mentioning of “We can make Z’s pre-evidence probability arbitrarily small, to make this seem reasonable at the time.” That changes the meaning of the example significantly in my understanding.
I claim that if Z is given enough consideration that ‘arbitrarily small’ is plugged in rather than mere exclusion from a model then it is just an error not an approximation. There are valid examples of bayes-in-practice that support the position John takes but I just don’t consider this example a fair representation. Partly because the mistake is a bad way to handle urns and partly because explicitly plugging in bad priors for Z should make you explicitly expect bad posteriors for Z. Exclusion from the model itself is a different problem.
Good answer. I neglected to read up-thread with enough thoroughness.
Good answer. I got a bit confused because Z has two meanings: “ball labelled Z was observed” (data), and “ball came from urn Z” (hypothesis). John’s model can assign zero probability to data than could possibly be observed, and that’s the big no-no.
In the example provided it would be by having the labels “A, B, C and Zooblefuzz” where Zooblefuzz is clearly defined ‘any other urn than A, B or C’.
for context: Gelman is a bayesian and Shalizi is an anti-bayesian.
If Pr(ball labelled Z | urn) = 0 for all urns under consideration then Pr(ball labelled Z) = 0 too, so the model tries to evaluate 0 / 0 and crashes.
Tangent: I was a huge fan of Proofs and Refutations, which is about mathematics; is there a book of Lakatos’s on the philosophy of science you would recommend?
I liked Proofs and Refutations a lot too. However, I’m ashamed to admit I have no special knowledge of Lakatos. All I know about his philosophy of science stuff (which I believe is closely related) is from his Wikipedia page (and Feyerabend’s). Gelman’s slides made the analogy with Lakatos explicitly.
I’ve always thought it would be nice to have a “Frequentist-to-Bayesian” guide. Sort of a “Here’s some example problems, here’s how you might go about it doing frequentist methods, here’s how you might go about it using Bayesian techniques.” My introduction to statistics began with an AP course in high school (and I used this HyperStat source to help out), and of course they teach hypothesis testing and barely give a nod to Bayes’ Theorem.
What you’ve described is the “statistical hypothesis testing” technique, and yes, you’ve got it right. The only reason it functions at all is that by and large, people who use it aren’t stupid, and they know that they have to submit it to peer review to other people who aren’t stupid. Nevertheless, a lot of crap gets through, just because the approach is so wrong-headed. ETA: Oops! I left an important detail out of this response.
There are other techniques for frequentist statistics, e.g., unbiased estimators, minimum mean squared error estimators, method of moments, robust estimators, confidence intervals, confidence distributions, maximum likelihood, profile likelihood, empirical likelihood, empirical Bayes, estimating equations, PAC learning, etc., etc., ad nauseum.
The central difficulty of Bayesian statistics is the problem of choosing a prior: where did it come from, how is it justified? How can Bayesians ever make objective scientific statements, if all of their methods require an apparently arbitrary choice for a prior?
Frequentist statistics is the attempt to do probabilistic inference without using a prior. So, for example, the U-test Cyan linked to above makes a statement about whether two data sets could be drawn from the same distribution, without having to assume anything about what the distribution actually is.
That’s my understanding, anyway—I would also be happy to see a “Frequentism for Bayesians” post.
Without acknowledging a prior.
Some frequentist techniques are strictly incoherent from a Bayesian point of view. In that case there is no prior.
I believe you and would like to know some examples for future reference.
The OP is one such—Bayesians aren’t permitted to ignore any part of the data except those which leave the likelihood unchanged. One classic example is that in some problems, a confidence interval procedure can return the whole real line. A mildly less pathological example also concerning a wacky confidence interval is here.
Yes; in Bayesian terms, many frequentist testing methods tend to implicitly assume a prior of 50% for the null hypothesis.
A prior gives you as much information as the mean of a distribution. So, can’t I by the same token accuse both frequentist and Bayesian statistics of attempting to do probabilistic inference without using a distribution?
I mean, the frequentist uses the U-test to ask whether 2 data sets could be drawn from the same distribution, without assuming what the mean of the distribution is. The Bayesian would use some other test, assuming a prior or perhaps a mean for the distribution, but not assuming a shape for the distribution. And some other, uninvented, and (by the standards of LW) superior statistical methodology would use another test, assuming a mean and a shape for the distribution.
No, not in general, it can give much more or much less; it depends entirely on how detailed you can make your prior. Expanding out e.g. as a series of central moments can give you as detailed a shape as you want. It may reduce to knowing only the mean in certain very special inference problems. In other problems, you may know that the distribution is very definitely Cauchy (EDIT: which doesn’t even have a well-defined mean), but not know the parameters, and put some reasonable prior on them—flat for the center over some range, and approximately using a (1/x) improper prior for the width, perhaps cutting it off at physically relevant length scales.
All that information can be encoded in the prior. The prior covers your probabilities over the space of your hypotheses, not a direct probabilistic encoding of what you think one sample will be.
No no no. That would be a hundred times saner than frequentism. What you actually do is take the real data e-12 and put it into a giant bin E that also contains e-1, e-3, and whatever else you can make up a plausible excuse to include or exclude, and then you calculate P(E|~H). This is one of the key points of flexibility that enables frequentists to get whatever answer they like, the other being the choice of control variables in multivariate analyses.
See e.g. this part of the article:
This seems to use “frequentist” to mean “as statistics are actually practiced.” It is unreasonable to compare the implementation of A to the ideal form of B. In particular, the problem of the Mann-Whitney test seem to me that the authors looked up a recipe in a cookbook without understanding it, which they could have done just as easily in a bayesian cookbook.
Can you elaborate on that?
Well, the blatant version would be to take 5 possible control variables and try all 32 possible omissions and inclusions to see if any of the combinations turns up “statistically significant”. This might look a little suspicious if you collected the data and then threw some of it away. If you were running regressions on an existing database with lots of potential control variables, why, they’ll just have to trust that you never secretly picked and chose.
Someone who did that might not be able to convince themselves they weren’t cheating… but someone who, somehow or other, got an idea of which variables would be most convenient to control for, might well find themselves influenced just a bit in that direction.
I don’t see how being a Bayesian gets you out of cherry-picking your causal structure from a large set. You still have to decide which variables are conditional on which other variables.
You put in all the variables, use a hierarchical structure for the prior, use a weakly informative hyperprior, and let the data sort itself out if it can. Key phrase: automatic relevance determination; David MacKay originated the term while doing Bayesian inference for neural nets.
Is that a ‘were not’?
Is that not precisely the problem? Often, the H you are interested in is so vague (“there is some kind of effect in a certain direction”) that it is very difficult to estimate P(E / H) - or even to define it.
OTOH, P(E / ~H) is often very easy to compute from first principles, or to obtain through experiments (since conditions where “the effect” is not present are usually the most common).
Example: I have a coin. I want to know if it is “true” or “biased”. I flip it 100 times, and get 78 tails.Now how do I estimate the probability of obtaining this many tails, knowing that the coin is “biased”? How do I even express that analytically? By contrast, it is very easy to compute the probability of this sequence (or any other) with a “non-biased” coin.
So there you have it. The whole concept of “null hypotheses” is not a logical axiom, it simply derives from real-world observation: in the real world, for most of the H we are interested in, estimating P(E / ~H) is easy, and estimating P(E / H) is either hard or impossible.
P(H) is silently set to .5. If you know P(E / ~H), this makes P(E / H) unnecessary to compute the real quantity of interest, P(H / E) / P(~H / E). I think.
There needs to be a post specifically devoted to arguments of the form “It’s okay to do things wrong, because doing them right would be hard”. I’ve seen this so many times, in so many places, in so many subjects, that I have to conclude that people just don’t see what is wrong with it.
(No, I’m not talking about making simplifying assumptions or idealizations in models. More like presenting a collection of sometimes-useful ad-hoc tricks as a competing theory, which is then argued for as a theory against its competitors on the basis of its being “easier to apply”.)
Bayes’ Theorem says that P(H|E) = P(H)P(E|H)/P(E). That’s, like, the law. You don’t get to take P(E|H) out of the equation, or pretend it isn’t there, just because it’s difficult to estimate. As I’ve said elsewhere, if you have a belief, then you’ve done a Bayesian update—which means you have some assumption about each of those quantities appearing in the formula, whether you choose to confront these assumptions or not.
As a matter of fact, if you find P(E|H) overly difficult to estimate, that means your H isn’t paying its rent.
Not necessarily better. Just more convenient for the thumbs up/thumbs down way of looking at evidence that scientists tend to like.
It’s a convention. The point is to have a pre-agreed, low significance level so that testers can’t screw with the result of a test by arbitrary jacking the significance level up (if they want to reject a hypothesis) or turning it down (if they don’t). The significance level has to be low to minimize the risk of a type I error.
The certainty level is effectively communicated via the significance level and p-value itself. (And the use of a reject vs. don’t reject dichotomy can be desirable if one wishes to decide between performing some action and not performing it based on some data.)
A frequentist can deal in likelihoods, for example by doing hypothesis tests of likelihood ratios. As for priors, a frequentist encapsulates them in parametric and sampling assumptions about the data. A Bayesian might give a low weight to a positive result from a parapsychology study because of their “low priors”, but a frequentist might complain about sampling procedures or cherrypicking being more likely than a true positive. As I see it, the two say essentially the same thing; the frequentist is just being more specific than the Bayesian.
No. P-values are not equivalent when they are calculated using different statistics, or even the same statistic but a different sample size. On the latter point see Royall, 1986.
I’d say the frequentist is using Bayesian reasoning informally; Jaynes discusses this exact problem from a Bayesian perspective at the beginning of Chapter 5 of his magnum opus.
Sorry. You are quite right, and I was sloppy. I had in mind the implicit idea that holding the choices of statistical test and data collection procedure constant, different p-values suggest how strongly one should reject the null hypothesis, and I should have made that explicit. It is absolutely true that if I just ask someone, “Test A gave me p = 0.008 and Test B gave me p = 0.4, which test’s null hypothesis is worse off?”, the correct answer is “how should I know?”
Yep. I think this is an example of the frequentist encapsulating what a Bayesian would call priors in their sampling assumptions.
I too would like to see a good explanation of frequentist techniques, especially one that also explains their relationships (if any) to Bayesian techniques.
Based on the tiny bit I know of both approaches, I think one appealing feature of frequentist techniques (which may or may not make up for their drawbacks) is that your initial assumptions are easier to dislodge the more wrong they are.
It seems to be the other way around with Bayesian techniques because of a stronger built-in assumption that your assumptions are justified. You can immunize yourself against any particular evidence by having a sufficiently wrong prior.
EDIT: Grammar
But you won’t be able to convince other Bayesians who don’t share that radically wrong prior. Similarly, there doesn’t seem to be something intrinsic to frequentism that keeps you from being persistently wrong. Rather, frequentists are kept in line because, as Cyan said, they have to persuade each other. Fortunately, for Bayesians and frequentists alike, a technique’s being persuasive to the community correlates with its being liable to produce less wrong answers.
The ability to get a bad result because of a sufficiently wrong prior is not a flaw in Bayesian statistics; it is a flaw is our ability to perform Bayesian statistics. Humans tend to overestimate their confidence of probabilities with very low or very high values. As such, the proper way to formulate a prior is to imagine hypothetical results that will bring the probability into a manageable range, ask yourself what you would want your posterior to be in such cases, and build your prior from that. These hypothetical results must be constructed and analyzed before the actual result is obtained to eliminate bias. As Tyrrell said, the ability of a wrong prior to result in a bad conclusion is a strength because other Bayesians will be able to see where you went wrong by disputing the prior.
Someone correct me if I’m wrong here, but I don’t think even having a strong prior P(H) against the evidence is much help, because that makes your likelihood ratio on the evidence P(E|H)/P(E|~H) that much stronger.
(This issue is one my stumbling blocks in Bayescraft.)
The likelihood ratio P(E|H)/P(E|~H) is entirely independent of the prior P(H)
Or did I misunderstand what you said?
In theory, yes, but we’re talking about a purported “unswayable Bayesian”. If someone strongly believes leprechauns don’t exist (low P(H), where H is “leprechauns exist” ), they should strongly expect not to see evidence of leprechauns (low P(E|~H), where E is direct evidence of leprechauns, like finding one in the forest), which suggests a high likelihood ratio P(E|H)/P(E|~H).
I remember Eliezer Yudkowsky referring to typical conversations that go like:
Non-rationalist: “I don’t think there will ever be an artificial general intelligence, because my religion says that can’t happen.”
EY: “So if I showed you one, that means you’d leave your religion?”
He did mention pulling that off once, but I don’t believe he said it was typical.
Thanks, that was what I had in mind.
I’m not entirely sure I understand your point. The example you’re citing is more the guy saying “I believe X, and X implies ~Y, therefore ~Y”, so Eliezer is saying “So Y implies ~X then?”
But the “X implies ~Y” belief can happen when one has low belief in X or high belief in X.
Or are you saying “the likelihoods assigned led to past interpretation of analogous (lack of) evidence, and that’s why the current prior is what it is?
komponisto nailed the intuition I was going from: the likelihood ratio is independent of the prior, but an unswayable Bayesian fixes P(E), forcing extreme priors to have extreme likelihood ratios.
*blinks* I think I’m extra confused. The law of conservation of probability is basically just saying that the change in belief may be large or small, so evidence may be strong or weak in that sense. But that doesn’t leave the likelihoods up for grabs, (well, okay, P(E|~H) could depend on how you distribute your belief over the space of hypotheses other than H, but… I’m not sure that was your point)
Okay, point conceded … that still doesn’t generate a result that matches the intuition I had. I need to spend more time on this to figure out what assumptions I’m relying on to claim that “extremely wrong beliefs force quick updates”.
Remember, though, that even fixing both P(E) and P(H), you can still make the ratio P(E|H)/P(E|~H) anything you want; the equation
a = bx + (1-b)(cx)
is guaranteed to have a solution for any a,b,c.
P(E) = P(E|H) P(H) + P(E|~H)P(~H)
The quantities P(H), P(E|H), and P(E|~H) are in general independent of each other, in the sense that you can move any one of them without changing the others—provided you adjust P(E) accordingly.
Thanks, that helps. See how I apply that point in my reply to Psy-Kosh here.
Well, P(E|H) is actually pretty easy to calculate under a frequentist framework. That’s the basis of power analysis, a topic covered in any good intro stat course. The real missing ingredient, as you point out, is P(H).
I’m not fully fluent in Bayesian statistics, so while I’m on the topic I have a question: do Bayesian methods involve any decision making? In other words, once we’ve calculated P(H|E), do we just leave it at that? No criteria to decide on, just revising of probabilities?
This is my current understanding, but it just seems so contrary to everyday human reasoning. What we would really like to say at the end of the day (or, rather, research program) is something like “Aha! Given the accumulated evidence, we can now cease replication. Hypothesis X must be true.” Being humans, we want to make a decision. But decision making necessarily involves the ultimately arbitrary choice of where to set the criterion. Is this anti-Bayesian?
The formal decision-making machinery involves picking a loss function and minimizing posterior expected loss.
Okay, but is it a part of the typical Bayesian routine to wield formal decision theory, or do we just calculate P(H|E) and call it a day?
I don’t think formal decision theory is common in applied Bayesian stats in science; the only paper I can quickly recall that did a decision analysis is Andrew Gelman’s radon remediation study. Maybe econometrics is different, since it’s a lot easier to define losses in that context.
Seconded.