Hi Less Wrong. I found a link to this site a year or so ago and have been lurking off and on since. However, I’ve self identified as a rationalist since around junior high school. My parents weren’t religious and I was good at math and science, so it was natural to me to look to science and logic to solve everything. Many years later I realize that this is harder than I hoped.
Anyway, I’ve read many of the sequences and posts, generally agreeing and finding many interesting thoughts. It’s fun reading about zombies and Newcomb’s problem and the like.
I guess this sounds heretical, but I don’t understand why Bayes theorem is placed on such a pedestal here. I understand Bayesian statistics, intuitively and also technically. Bayesian statistics is great for a lot of problems, but I don’t see it as always superior to thinking inspired by the traditional scientific method. More specifically, I would say that coming up with a prior distribution and updating can easily be harder than the problem at hand.
I assume the point is that there is more to what is considered Bayesian thinking than Bayes theorem and Bayesian statistics, and I’ve reread some of the articles with the idea of trying to pin that down, but I’ve found that difficult. The closest I’ve come is that examining what your priors are helps you to keep an open mind.
Bayesian theorem is just one of many mathematical equations, like for example Pythagorean theorem. There is inherently nothing magical about it.
It just happens to explain one problem with the current scientific publishing process: neglecting base rates. Which sometimes seems like this: “I designed an experiment that would prove a false hypothesis only with probability p = 0.05. My experiment has succeeded. Please publish my paper in your journal!”
(I guess I am exaggerating a bit here, but many people ‘doing science’ would not understand immediately what is wrong with this. And that would be those who even bother to calculate the p-value. Not everyone who is employed as a scientist is necessarily good at math. Many people get paid for doing bad science.)
This kind of thinking has the following problem: Even if you invent hundred completely stupid hypotheses; if you design experiments that would prove a false hypothesis only with p = 0.05, that means five of them would be proved by the experiment. If you show someone else all hundred experiments together, they may understand what is wrong. But you are more likely to send only the successful five ones to the journal, aren’t you? -- But how exactly is the journal supposed to react to this? Should they ask: “Did you do many other experiments, even ones completely irrelevant to this specific hypothesis? Because, you know, that somehow undermines the credibility of this one.”
The current scientific publishing process has a bias. Bayesian theorem explains it. We care about science, and we care about science being done correctly.
It just happens to explain one problem with the current scientific publishing process: neglecting base rates. Which sometimes seems like this: “I designed an experiment that would prove a false hypothesis only with probability p = 0.05. My experiment has succeeded. Please publish my paper in your journal!”
That’s not neglecting base rates, that’s called selection bias combined with incentives to publish. Bayes theorem isn’t going to help you with this.
If I understand it correctly, selection bias is when 20 researchers make an experiment with green jelly beans, 19 of them don’t find significant correlation, 1 of them finds it… and only the 1 publishes, and the 19 don’t. The essence is that we had 19 pieces of evidence against the green jelly beans, only 1 piece of evidence for the green jelly beans, but we don’t see those 19 pieces, because they are not published. Selection = “there is X and Y, but we don’t see Y, because it was filtered out by the process that gives us information”.
But imagine that you are the first researcher ever who has researched the jelly beans. And you only did one experiment. And it happened to succeed. Where is the selection here? (Perhaps selection across Everett branches or Tegmark universes. But we can’t blame the scientific publishing process for not giving us information from the parallel universes, can we?)
In this case, base rate neglect means ignoring the fact that “if you take a random thing, the probability that this specific thing causes acne is very low”. Therefore, even if the experiment shows a connection with p = 0.05, it’s still more likely that the result just happened randomly.
The proper reasoning could be something like this (all number pulled out of the hat) -- we already have pretty strong evidence that acne is caused by food; let’s say there is a 50% probability for this. With enough specificity (giving each fruit a different category, etc.), there are maybe 2000 categories of food. It is possible that more then one of them cause acne, and our probability distribution for that is… something. Considering all this information, we estimate a prior probability let’s say 0.0004 that a random food causes acne. -- Which means that if the correlation is significant on level p = 0.05, that per se means almost nothing. (Here one could use the Bayes theorem to calculate that the p = 0.05 successful experiment shows the true cause of acne with probablity cca 1%.) We need to increase it to p = 0.0004 just to get a 50% chance of being right. How can we do that? We should use a much larger sample, or we should repeat the experiment many times, record all the successed and failures, and do a meta-analysis.
But imagine that you are the first researcher ever who has researched the jelly beans. And you only did one experiment. And it happened to succeed. Where is the selection here?
That’s a different case—you have no selection bias here, but your conclusions are still uncertain—if you pick p=0.05 as your threshold, you’re clearly accepting that there is a 5% chance of a Type I error: the green jelly beans did nothing, but the noise happened to be such that you interpreted it as conclusive evidence in favor of your hypothesis.
But that all is fine—the readers of scientific papers are expected to understand that results significant to p=0.05 will be wrong around 5% of the times, more or less (not exactly because the usual test measures P(D|H), the probability of the observed data given the (null) hypothesis while you really want P(H|D), the probability of the hypothesis given the data).
base rate neglect means ignoring the fact that “if you take a random thing, the probability that this specific thing causes acne is very low”
People rarely take entirely random things and test them for causal connection to acne. Notice how you had to do a great deal of handwaving in establishing your prior (aka the base rate).
As an exercise, try to be specific. For example, let’s say I want to check if the tincture made from the bark of a certain tree helps with acne. How would I go about calculating my base rate / prior? Can you walk me through an estimation which will end with a specific number?
the readers of scientific papers are expected to understand that results significant to p=0.05 will be wrong around 5% of the times, more or less
And this is the base rate neglect. It’s not “results significant to p=0.05 will be wrong about 5% of time”. It’s “wrong results will be significant to p=0.05 about 5% of time”. And most people will confuse these two things.
It’s like when people confuse “A ⇒ B” with “B ⇒ A”, only this time it is “A ⇒ B (p=0.05)” with “B ⇒ A (p=0.05)”. It is “if wrong, then in 5% significant”. It is not “if significant, then in 5% wrong”.
Notice how you had to do a great deal of handwaving in establishing your prior (aka the base rate).
Yes, you are right. Establishing the prior is pretty difficult, perhaps impossible. (But that does not make “A ⇒ B” equal to “B ⇒ A”.) Probably the reasonable thing to do would be simply to impose strict limits in areas where many results were proved wrong.
Probably the reasonable thing to do would be simply to impose strict limits in areas where many results were proved wrong.
Um, what “strict limits” are you talking about, what will they look like, and who will be doing the imposing?
To get back to my example, let’s say I’m running experiments to check if the tincture made from the bark of a certain tree helps with acne—what strict limits would you like?
p = 0.001, and if at the end of the year too many researches fail to replicate, keep decreasing. (let’s say that “fail to replicate” in this context means that the replication attempt cannot prove it even with p = 0.05 -- we don’t want to make replications too expensive, just a simple sanity check)
let’s say I’m running experiments to check if the tincture made from the bark of a certain tree helps with acne—what strict limits would you like?
a long answer would involve a lot of handwaving again (it depends on why do you believe the bark is helpful; in other words, what other evidence do you already have)
Well, and what’s magical about this particular number? Why not p=0.01? why not p=0.0001? Confidence thresholds are arbitrary, do you have a compelling argument why any particular one is better than the rest?
Besides, you’re forgetting the costs. Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false. By your strict criteria you’re rejecting all of them—you’re rejecting 95 correct papers. There is a cost to that, is there not?
Lumifer, please update that at this moment you don’t grok the difference between “A ⇒ B (p=0.05)” and “B ⇒ A (p = 0.05)”, which is why you don’t understand what p-value really means, which is why you don’t understand the difference between selection bias and base rate neglect, which is probably why the emphasis on using Bayes theorem in scientific process does not make sense to you. You made a mistake, that happens to all of us. Just stop it already, please.
And don’t feel bad about it. Until recently I didn’t understand it too, and I had a gold medal from international mathematical olympiad. Somehow it is not explained correctly at most schools, perhaps because the teachers don’t get it themselves, or maybe they just underestimate the difficulty of proper understanding and the high chance of getting it wrong. So please don’t contibute to the confusion.
Imagine that there are 1000 possible hypotheses, among which 999 are wrong, and 1 is correct. (That’s just a random example to illustrate the concept. The numbers in real life can be different.) You have an experiment that says “yes” to 5% of the wrong hypotheses (this is what p=0.05 means), and also to the correct hypothesis. So at the end, you have 50 wrong hypotheses and 1 correct hypothesis confirmed by the experiment. So in the journal, 98% of the published articles would be wrong, not 5%. It is “wrong ⇒ confirmed (p=0.05)”, not “confirmed ⇒ wrong (p=0.05)”.
LOL. Yeah, yeah, mea culpa, I had a brain fart and expressed myself very poorly.
I do understand what p-value really means. The issue was that I had in mind a specific scenario (where in effect you’re trying to see if the difference in means between two groups is significant) but neglected to mention it in the post :-)
Lumifer, please update that at this moment you don’t grok the difference between “A ⇒ B (p=0.05)” and “B ⇒ A (p = 0.05)”, which is why you don’t understand what p-value really means, which is why you don’t understand the difference between selection bias and base rate neglect, which is probably why the emphasis on using Bayes theorem in scientific process does not make sense to you. You made a mistake, that happens to all of us. Just stop it already, please.
I feel like this could use a bit longer explanation, especially since I think you’re not hearing Lumifer’s point, so let me give it a shot. (I’m not sure a see a meaningful difference between base rate neglect and selection bias in this circumstance.)
The word “grok” in Viliam_Bur’s comment is really important. This part of the grandparent is true:
Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false.
But it’s like saying “well, assume the diagnosis is correct. Then the treatment will make the patient better with high probability.” While true, it’s totally out of touch with reality- we can’t assume the diagnosis is correct, and a huge part of being a doctor is responding correctly to that uncertainty.
Earlier, Lumifer said this, which is an almost correct explanation of using Bayes in this situation:
But that all is fine—the readers of scientific papers are expected to understand that results significant to p=0.05 will be wrong around 5% of the times, more or less (not exactly because the usual test measures P(D|H), the probability of the observed data given the (null) hypothesis while you really want P(H|D), the probability of the hypothesis given the data).
The part that makes it the “almost” is the “5% of the times, more or less.” This implies that it’s centered around 5%, with random chance determining what this instance is. But selection bias means it will almost certainly be more, and generally much more. In fields that study phenomena that don’t exist, 100% of the papers published will be of false results that were significant by chance. In many real fields, rates of failure to replicate are around 30%. Describing 30% as “5%, more or less” seems odd, to say the least.
But the proposal to reduce the p value doesn’t solve the underlying problem (which was Lumifer’s response). If we set the p value threshold lower, at .01 or .001 or wherever, we reducing the risk of false positives at the cost of increasing the risk of false negatives. A study design which needs to determine an effect at the .001 level is much more expensive than a study design which needs to determine an effect at the .05 level, and so we will have many less studies attempted, and many many less published studies.
Better to drop p entirely. Notice that stricter p thresholds go in the opposite direction as the publication of negative results, which is the real solution to the problem of selection bias. By calling for stricter p thresholds, you implicitly assume that p is a worthwhile metric, when what we really want is publication of negative results and more replications.
But it’s like saying “well, assume the diagnosis is correct. Then the treatment will make the patient better with high probability.” While true, it’s totally out of touch with reality
My grandparent post was stupid, but what I had in mind was basically a stage-2 (or −3) drug trial situation. You have declared (at least to the FDA) that you’re running a trial, so selection bias does not apply at this stage. You have two groups, one receives the experimental drug, one receives a placebo. Assume a double-blind randomized scenario and assume there is a measurable metric of improvement at the end of the trial.
After the trial you have two groups with two empirical distributions of the metric of choice. The question is how confident you are that these two distributions are different.
Better to drop p entirely.
Well, as usual it’s complicated. Yes, the p-test is suboptimal in most situations where it’s used in reality. However it fulfils a need and if you drop the test entirely you need a replacement for the need won’t go away.
Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct...
That’s not how p-values work. p=0.05 doesn’t mean that the hypothesis is 95% likely to be correct, even in principle; it means that there’s a 5% chance of seeing the same correlation if the null hypothesis is true. Pull a hundred independent data sets and we’d normally expect to find a p=0.05 correlation or better in at least five or so of them, no matter whether we’re testing, say, an association of cancer risk with smoking or with overuse of the word “muskellunge”.
This distinction’s especially important to keep in mind in an environment where running replications is relatively low-status or where negative results tend to be quietly shelved—both of which, as it happens, hold true in large chunks of academia. But even if this weren’t the case, we’d normally expect replication rates to be less than one minus the claimed p-value, simply because there are many more promising ideas than true ones and some of those will turn up false positives.
Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false.
No, they won’t. You’re committing base rate neglect. It’s entirely possible for people to publish 2000 papers in a field where there’s no hope of finding a true result, and get 100 false results with p 0.05).
I guess this sounds heretical, but I don’t understand why Bayes theorem is placed on such a pedestal here. I understand Bayesian statistics, intuitively and also technically. Bayesian statistics is great for a lot of problems, but I don’t see it as always superior to thinking inspired by the traditional scientific method.
I know a few answers to this question, and I’m sure there are others. (As an aside, these foundational questions are, in my opinion, really important to ask and answer.)
What separates scientific thought and mysticism is that scientists are okay with mystery. If you can stand to not know what something is, to be confused, then after careful observation and thought you might have a better idea of what it is and have a bit more clarity. Bayes is the quantitative heart of the qualitative approach of tracking many hypotheses and checking how concordant they are with reality, and thus should feature heavily in a modern epistemic approach. The more precisely and accurately you can deal with uncertainty, the better off you are in an uncertain world.
What separates Bayes and the “traditional scientific method” (using scare quotes to signify that I’m highlighting a negative impression of it) is that the TSM is a method for avoiding bad beliefs but Bayes is a method for finding the best available beliefs. In many uncertain situations, you can use Bayes but you can’t use the TSM (or it would be too costly to do so), but the TSM doesn’t give any predictions in those cases!
Use of Bayes focuses attention on base rates, alternate hypotheses, and likelihood ratios, which people often ignore (replacing the first with maxent, the second with yes/no thinking, and the latter with likelihoods).
I honestly don’t think the quantitative aspect of priors and updating is that important, compared to the search for a ‘complete’ hypothesis set and the search for cheap experiments that have high likelihood ratios (little bets).
I think that the qualitative side of Bayes is super important but don’t think we’ve found a good way to communicate it yet. That’s an active area of research, though, and in particular I’d love to hear your thoughts on those four answers.
Unfortunately, the end of that sentence is still true:
but [I] don’t think we’ve found a good way to communicate it yet.
I think that What Bayesianism Taught Me is a good discussion on the subject, and my comment there explains some of the components I think are part of qualitative Bayes.
I think that a lot of qualitative Bayes is incorporating the insights of the Bayesian approach into your System 1 thinking (i.e. habits on the 5 second level).
Well, yes, but most of the things there are just useful ways to think about probabilities and uncertainty, proper habits, things to check, etc. Why Bayes? He’s not a saint whose name is needed to bless a collection of good statistical practices.
It’s more or less the same reason people call a variety of essentialist positions ‘platonism’ or ‘aristotelianism’. Those aren’t the only thinkers to have had views in this neighborhood, but they predated or helped inspire most of the others, and the concepts have become pretty firmly glued together. Similarly, the phrases ‘Bayes’ theorem’ and ‘Bayesian interpretation of probability’ (whence, jointly, the idea of Bayesian inference) have firmly cemented the name Bayes to the idea of quantifying psychological uncertainty and correctly updating on the evidence. The Bayesian interpretation is what links these theorems to actual practice.
Bayes himself may not have been a ‘Bayesian’ in the modern sense, just as Plato wasn’t a ‘platonist’ as most people use the term today. But the names have stuck, and ‘Laplacian’ or ‘Ramseyan’ wouldn’t have quite the same ring.
If I were to pretend that I’m a mainstream frequentist and consider “quantifying psychological uncertainty” to be subjective mumbo-jumbo with no place anywhere near real science :-D I would NOT have serious disagreements with e.g. Vaniver’s list. Sure, I would quibble about accents, importances, and priorities, but there’s nothing there that would be unacceptable from the mainstream point of view.
My biggest concern with the label ‘Bayesianism’ isn’t that it’s named after the Reverend, nor that it’s too mainstream. It’s that it’s really ambiguous.
For example, when Yvain speaks of philosophical Bayesianism, he means something extremely modest—the idea that we can successfully model the world without certainty. This view he contrasts, not with frequentism, but with Aristotelianism (‘we need certainty to successfully model the world, but luckily we have certainty’) and Anton-Wilsonism (‘we need certainty to successfully model the world, but we lack certainty’). Frequentism isn’t this view’s foil, and this philosophical Bayesianism doesn’t have any respectable rivals, though it certainly sees plenty of assaults from confused philosophers, anthropologists, and poets.
If frequentism and Bayesianism are just two ways of defining a word, then there’s no substantive disagreement between them. Likewise, if they’re just two different ways of doing statistics, then it’s not clear that any philosophical disagreement is at work; I might not do Bayesian statistics because I lack skill with R, or because I’ve never heard about it, or because it’s not the norm in my department.
There’s a substantive disagreement if Bayesianism means ‘it would be useful to use more Bayesian statistics in science’, and if frequentism means ‘no it wouldn’t!‘. But this methodological Bayesianism is distinct from Yvain’s philosophical Bayesianism, and both of those are distinct from what we might call ‘Bayesian rationalism’, the suite of mantras, heuristics, and exercises rationalists use to improve their probabilistic reasoning. (Or the community that deems such practices useful.) Viewing the latter as an ideology or philosophy is probably a bad idea, since the question of which of these tricks are useful should be relatively easy to answer empirically.
Err, actually, yes it is. The frequentist interpretation of probability makes the claim that probability theory can only be used in situations involving large numbers of repeatable trials, or selection from a large population. William Feller:
There is no place in our system for speculations concerning the probability that the sun will rise tomorrow. Before speaking of it we should have to agree on an (idealized) model which would presumably run along the lines “out of infinitely many worlds one is selected at random...” Little imagination is required to construct such a model, but it appears both uninteresting and meaningless.
Or to quote from the essay coined the term frequentist:
The essential distinction between the frequentists and the [Bayesians] is, I think, that the former, in an effort to avoid anything savouring of matters of opinion, seek to define probability in terms of the objective properties of a population, real or hypothetical, whereas the latter do not.
Frequentism is only relevant to epistemological debates in a negative sense: unlike Aristotelianism and Anton-Wilsonism, which both present their own theories of epistemology, frequentism’s relevance is almost only in claiming that Bayesianism is wrong. (Frequentism separately presents much more complicated and less obviously wrong claims within statistics and probability; these are not relevant, given that frequentism’s sole relevance to epistemology is its claim that no theory of statistics and probability could be a suitable basis for an epistemology, since there are many events they simply don’t apply to.)
(I agree that it would be useful to separate out the three versions of Bayesianism, whose claims, while related, do not need to all be true or false at the same time. However, all three are substantively opposed to one or both of the views labelled frequentist.)
Err, actually, yes it is. The frequentist interpretation of probability makes the claim that probability theory can only be used in situations involving large numbers of repeatable trials, or selection from a large population.
It is argued that the proposed frequentist interpretation, not only achieves this objective, but contrary to the conventional wisdom, the charges of ‘circularity’, its inability to assign probabilities to ‘single events’, and its reliance on ‘random samples’ are shown to be unfounded.
and
The error statistical perspective identifies the probability of an event A—viewed in the context of a statistical model Mθ(x), x∈R^n_X—with the limit of its relative frequency of occurrence by invoking the SLLN. This frequentist interpretation is defended against the charges of [i] ‘circularity’ and [ii] inability to assign ‘single event’ probabilities, by showing that in model-based induction the defining characteristic of the long-run metaphor is neither its temporal nor its physical dimension, but its repeatability (in principle) which renders it operational in practice.
Depends which frequentist you ask. From Aris Spanos’s “A frequentist interpretation of probability for model-based inductive inference”:
For those who can’t access that through the paywall (I can), his presentation slides for it are here. I would hate to have been in the audience for the presentation, but the upside of that is that they pretty much make sense on their own, being just a compressed version of the paper.
I am not enough of a statistician to make any quick assessment of these, but they look like useful reading for anyone thinking about the foundations of uncertain inference.
The frequentist interpretation of probability makes the claim that probability theory can only be used in situations involving large numbers of repeatable trials
I don’t understand what this “probability theory can only be used...” claim means. Are they saying that if you try to use probability theory to model anything else, your pencil will catch fire? Are they saying that if you model beliefs probabilistically, Math breaks? I need this claim to be unpacked. What do frequentists think is true about non-linguistic reality, that Bayesians deny?
I don’t understand what this “probability theory can only be used...” claim means. Are they saying that if you try to use probability theory to model anything else, your pencil will catch fire? Are they saying that if you model beliefs probabilistically, Math breaks?
I think they would be most likely to describe it as a category error. If you try to use probability theory outside the constraints within which they consider it applicable, they’d attest that you’d produce no meaningful knowledge and accomplish nothing but confusing yourself.
Can you walk me through where this error arises? Suppose I have a function whose arguments are the elements of a set S, whose values are real numbers between 0 and 1, and whose values sum to 1. Is the idea that if I treat anything in the physical world other than objects’ or events’ memberships in physical sequences of events or heaps of objects as modeling such a set, the conclusions I draw will be useless noise? Or is there something about the word ‘probability’ that makes special errors occur independently of the formal features of sample spaces?
Do you have any links to this argument? I’m having a hard time seeing why any mainstream scientist who thinks beliefs exist at all would think they’re ineffable....
The frequentist interpretation of probability makes the claim that probability theory can only be used in situations involving large numbers of repeatable trials, or selection from a large population.
Yes, but frequentists have zero problems with hypothetical trials or populations.
Do note that for most well-specified statistical problems the Bayesians and the frequentists will come to the same conclusions. Differently expressed, likely, but not contradicting each other.
For example, when Yvain speaks of philosophical Bayesianism, he means something extremely modest...
Yes, it is my understanding that epistemologists usually call the set of ideas Yvain is referring to “probabilism” and indeed, it is far more vague and modest than what they call Bayesianism (which is more vague and modest still than the subjectively-objective Bayesianism that is affirmed often around these parts.).
If frequentism and Bayesianism are just two ways of defining a word, then there’s no substantive disagreement between them. Likewise, if they’re just two different ways of doing statistics, then it’s not clear that any philosophical disagreement is at work; I might not do Bayesian statistics because I lack skill with R, or because I’ve never heard about it, or because it’s not the norm in my department.
BTW, I think this is precisely what Carnap was on about with his distinction between probability-1 and probability-2, neither of which did he think we should adopt to the exclusion of the other.
I would NOT have serious disagreements with e.g. Vaniver’s list.
I think they would have significant practical disagreement with #3, given the widespread use of NHST, but clever frequentists are as quick as anyone else to point out that NHST doesn’t actually do what its users want it to do.
Sure, I would quibble about accents, importances, and priorities, but there’s nothing there that would be unacceptable from the mainstream point of view.
Hence the importance of the qualifier ‘qualitative’; it seems to me that accents, importances, and priorities are worth discussing, especially if you’re interested in changing System 1 thinking instead of System 2 thinking. The mainstream frequentist thinks that base rate neglect is a mistake, but the Bayesian both thinks that base rate neglect is a mistake and has organized his language to make that mistake obvious when it occurs. If you take revealed preferences seriously, it looks like the frequentist says base rate neglect is a mistake but the Bayesian lives that base rate neglect is a mistake.
Now, why Bayes specifically? I would be happy to point to Laplace instead of Bayes, personally, since Laplace seems to have been way smarter and a superior rationalist. But the trouble with naming methods of “thinking correctly” is that everyone wants to name their method “thinking correctly,” and so you rapidly trip over each other. “Rationalism,” for example, refers to a particular philosophical position which is very different from the modal position here at LW. Bayes is useful as a marker, but it is not necessary to come to those insights by way of Bayes.
(I will also note that not disagreeing with something and discovering something are very different thresholds. If someone has a perspective which allows them to generate novel, correct insights, that perspective is much more powerful than one which merely serves to verify that insights are correct.)
Yeah, I said if I were pretend to be a frequentist—but that didn’t involve suddenly becoming dumb :-)
it seems to me that accents, importances, and priorities are worth discussing
I agree, but at this point context starts to matter a great deal. Are we talking about decision-making in regular life? Like, deciding which major to pick, who to date, what job offer to take? Or are we talking about some explicitly statistical environment where you try to build models, fit them, evaluate them, do out-of-sample forecasting, all that kind of things?
I think I would argue that recognizing biases (Tversky/Kahneman style) and trying to correct for them—avoiding them altogether seems too high a threshold—is different from what people call Bayesian approaches. The Bayesian way of updating on the evidence is part of “thinking correctly”, but there is much, much more than just that.
I think I would argue that recognizing biases (Tversky/Kahneman style) and trying to correct for them—avoiding them altogether seems too high a threshold—is different from what people call Bayesian approaches.
At least one (and I think several) of biases identified by Tversky and Kahneman is “people do X, a Bayesian would do Y, thus people are wrong,” so I think you’re overstating the difference. (I don’t know enough historical details to be sure, but I suspect Tversky and Kahneman might be an example of the Bayesian approach allowing someone to discover novel, correct insights.)
The Bayesian way of updating on the evidence is part of “thinking correctly”, but there is much, much more than just that.
I agree, but it feels like we’re disagreeing. It seems to me that a major Less Wrong project is “thinking correctly,” and a major part of that project is “decision-making under uncertainty,” and a major part of uncertainty is dealing with probabilities, and the Bayesian way of dealing with probabilities seems to be the best, especially if you want to use those probabilities for decision-making.
So it sounds to me like you’re saying “we don’t just need stats textbooks, we need Less Wrong.” I agree; that’s why I’m here as well as reading stats textbooks. But it also sounds to me like you’re saying “why are you naming this Less Wrong stuff after a stats textbook?” The easy answer is that it’s a historical accident, and it’s too late to change it now. Another answer I like better is that much of the Less Wrong stuff comes from thinking about and taking seriously the stuff from the stats textbook, and so it makes sense to keep the name, even if we’re moving to realms where the connection to stats isn’t obvious.
Hm… Let me try to unpack my thinking, in particular my terminology which might not match exactly the usual LW conventions. I think of:
Bayes theorem as a simple, conventional, and an entirely uncontroversial statistical procedure. If you ask a dyed-in-the-wool rabid frequentist whether the Bayes theorem is true he’ll say “Yes, of course”.
Bayesian statistics as an approach to statistics with three main features. First is the philosophical interpretation of (some) probability as subjective belief. Second is the focus on conditional probabilities. Third is the strong preferences for full (posterior) distributions as answers instead of point estimates.
Cognitive biases (aka the Kahneman/Tversky stuff) as certain distortions in the way our wetware processes information about reality, as well as certain peculiarities in human decision-making. Yes, a lot of it it is concerned with dealing with uncertainty. Yes, there is some synergy with Bayesian statistics. No, I don’t think this synergy is the defining factor here.
I understand that historically the in the LW community Bayesian statistics and cognitive biases were intertwined. But apart from historical reasons, it seems to me these are two different things and the degree of their, um, interpenetration is much overstated on LW.
it sounds to me like you’re saying “we don’t just need stats textbooks, we need Less Wrong.”
Well, we need for which purpose? For real-life decision making? -- sure, but then no one is claiming that stats textbooks are sufficient for that.
much of the Less Wrong stuff comes from thinking about and taking seriously the stuff from the stats textbook
Some, not much. I can argue that much of LW stuff comes from thinking logically and following chains of reasoning to their conclusion—or actually just comes from thinking at all instead of reacting instinctively / on the basis of a gut feeling or whatever.
I agree that thinking in probabilities is a very big step and it *is* tied to Bayesian statistics. But still it’s just one step.
I can argue that much of LW stuff comes from thinking logically … I agree that thinking in probabilities is a very big step
When contrasting LW stuff and mainstream rationality, I think the reliance on thinking in probabilities is a big part of the difference. (“Thinking logically,” for the mainstream, seems to be mostly about logic of certainty.) When labeling, it makes sense to emphasize contrasting features. I don’t think that’s the only large difference, but I see an argument (which I don’t fully endorse) that it’s the root difference.
(For example, consider evolutionary psychology, a moderately large part of LW. This seems like a field of science particularly prone to uncertainty, where “but you can’t prove X!” would often be a conversation-stopper. For the Bayesian, though, it makes sense to update in the direction of evo psych, even though it can’t be proven, which is then beneficial to the extent that evo psych is useful.)
When contrasting LW stuff and mainstream rationality, I think the reliance on thinking in probabilities is a big part of the difference. (“Thinking logically,” for the mainstream, seems to be mostly about logic of certainty.)
Yes, I think you’re right.
For the Bayesian, though, it makes sense to update in the direction of evo psych, even though it can’t be proven
Um, I’m not so sure about that. The main accusation against evolutionary psychology is that it’s nothing but a bunch of just-so stories, aka unfalsifiable post-hoc narratives. And a Bayesian update should be on the basis of evidence, not on the basis of an unverifiable explanation.
The main accusation against evolutionary psychology is that it’s nothing but a bunch of just-so stories, aka unfalsifiable post-hoc narratives.
It seems to me that if you think in terms of likelihoods, you look at a story and say “but the converse of this story has high enough likelihood that we can’t rule it out!” whereas if you think in terms of likelihood ratios, you say “it seems that this story is weakly more plausible than its converse.”
I’m thinking primarily of comments like this. I think it is a reasonable conclusion that anger seems to be a basic universal emotion because ancestors who had the ‘right’ level of anger reproduced more than those who didn’t. Boris just notes that it could be the case that anger is a byproduct of something else, but doesn’t note anything about the likelihood of anger being universal in a world where it is helpful (very high) and the likelihood of anger being universal in a world where it is neutral or unhelpful (very low). We can’t rule out anger being spurious, but asking to rule that out is mistaken, I think, because the likelihood ratio is so significant. It doesn’t make sense to bet against anger being reproductively useful in the ancestral environment (but I think it makes sense to assign a probability to that bet, even if it’s not obvious how one would resolve it).
It seems to me that if you think in terms of likelihoods, you look at a story and say “but the converse of this story has high enough likelihood that we can’t rule it out!” whereas if you think in terms of likelihood ratios, you say “it seems that this story is weakly more plausible than its converse.”
I have several problems with this line of reasoning. First, I am unsure what it means for a story to be true. It’s a story—it arranges a set of facts in a pattern pleasing to the human brain. Not contradicting any known facts is a very low threshold (see the Russell’s teapot), to call something “true” I’ll need more than that and if a story makes no testable predictions I am not sure on which basis I should evaluate its truth and what does it even mean.
Second, it seems to me that in such situations the likelihoods and so, necessarily, their ratios are very very fuzzy. My meta uncertainty—uncertainty about probabilities—is quite high. I might say “story A is weakly more plausible than story B” but my confidence in my judgment about plausibility is very low. This judgment might not be worth anything.
Third, likelihood ratios are good when you know you have a complete set of potential explanations. And you generally don’t. For open-ended problems the explanation “something else” frequently looks like the more plausible one, but again, the meta uncertainty is very high—not only you don’t know how uncertain you are, you don’t even know what you are uncertain about! Nassim Taleb’s black swans are precisely the beasties that appear out of “something else” to bite you in the ass.
First, I am unsure what it means for a story to be true.
Ah, by that I generally mean something like “the causal network N with a particular factorization F is the underlying causal representation of reality,” and so a particular experiment measures data and then we calculate “the aforementioned causal network would generate this data with probability P” for various hypothesized causal networks.
For situations where you can control at least one of the nodes, it’s easy to see how you can generate data useful for this. For situations where you only have observational data (like the history of human evolution, mostly), then it’s trickier to determine which causal network(s) is(are) best, but often still possible to learn quite a bit more about the underlying structure than is obvious at first glance.
So suppose we have lots of historical lives which are compressed down to two nodes, A which measures “anger” (which is integer-valued and non-negative, say) and C which measures “children” (which is also integer valued and non-negative). The story “anger is spurious” is the network where A and C don’t have a link between them, and the story “anger is reproductively useful” is the network where A->C and there is some nonzero value a^* of A which maximizes the expected value of C. If we see a relationship between A and C in the data, it’s possible that the relationship was generated by the “anger is spurious” network which said those variables were independent, but we can calculate the likelihoods and determine that it’s very very low, especially as we accumulate more and more data.
Third, likelihood ratios are good when you know you have a complete set of potential explanations. And you generally don’t.
Sure. But even if you’re only aware of two hypotheses, it’s still useful to use the LR to determine which to prefer; the supremacy of a third hidden hypothesis can’t swap the ordering of the two known hypotheses!
Nassim Taleb’s black swans are precisely the beasties that appear out of “something else” to bite you in the ass.
Yes, reversal effects are always possible, but I think that putting too much weight on this argument leads to Anton-Wilsonism (certainty is necessary but impossible). I think we do often have a good idea of what our meta uncertainty looks like in a lot of cases, and that’s generally enough to get the job done.
I have only glanced at Pearl’s work, not read it carefully, so my understanding of causal networks is very limited. But I don’t understand on the basis of which data will you construct the causal network for anger and children (and it’s actually more complicated because there are important society-level effects). In what will you “see a relationship between A and C”? On the basis of what will you be calculating the likelihoods?
In what will you “see a relationship between A and C”? On the basis of what will you be calculating the likelihoods?
Ideally, you would have some record. I’m not an expert in evo psych, so I can’t confidently say what sort of evidence they actually rely on. I was hoping more to express how I would interpret a story as a formal hypothesis.
I get the impression that a major technique in evolutionary psychology is making use of the selection effect due to natural selection: if you think that A is heritable, and that different values of A have different levels of reproductive usefulness, then in steady state the distribution of A in the population gives you information about the historic relationship between A and reproductive usefulness, without even measuring relationship between A and C in this generation. So you can ask the question “what’s the chance of seeing the cluster of human anger that we have if there’s not a relationship between A and reproduction?” and get answers that are useful enough to focus most of your attention on the “anger is reproductively useful” hypothesis.
I guess the distinction in my mind is that in a Bayesian approach one enumerates the various hypothesis ahead of time. This is in contrast to coming up with a single hypothesis and then adding in more refined versions based on results. There are trade-offs between the two. Once you get going with a Bayesian approach you are much better protected against bias; however if you are missing some hypothesis from your prior you don’t find it.
Here are some specific responses to the 4 answers:
If you have a problem for which it is easy to enumerate the hypotheses, and have statistical data, then Bayes is great. If in addition you have a good prior probability distribution then you have the additional advantage that it is much easier to avoid bias. However if you find you are having to add in new hypotheses as you investigate then I would say you are using a hybrid method.
Even without Bayes one is supposed to specifically look for alternate hypothesis and search for the best answer. On the Less Wrong welcome page the link next to the Bayesian link is a reference to the 2 4 6 experiment. I’d say this is an example of a problem poorly suited to Bayesian reasoning. It’s not a statistical problem, and it’s really hard to enumerate the prior for all rules for a list of 3 numbers ordered by simplicity. There’s clearly a problem with confirmation bias, but I would say the thing to do is to step back and do some careful experimentation along traditional lines. Maybe Bayesian reasoning is helpful because it would encourage you to do that?
I would agree that a rationalist needs to be exposed to these concepts.
I wonder about this statement the most. It’s hard to judge qualitative statements about probabilities. For example, I can say that I had a low prior belief in cryonics, and since reading articles here I have updated and now have a higher probability. I know I had some biases against the idea. However, I still don’t agree and it’s difficult to tell how much progress I’ve made in understanding the arguments.
That paper did help crystallize some of my thoughts. At this point I’m more interested in wondering if I should be modifying how I think, as opposed to how to implement AI.
You are not alone in thinking the use of Bayes is overblown. It can;t be wrong, of course, but it can be impractical to use and in many real life situations we might not have specific enough knowledge to be able to use it. In fact, that’s probably one of the biggest criticisms of lesswrong.
Hi Less Wrong. I found a link to this site a year or so ago and have been lurking off and on since. However, I’ve self identified as a rationalist since around junior high school. My parents weren’t religious and I was good at math and science, so it was natural to me to look to science and logic to solve everything. Many years later I realize that this is harder than I hoped.
Anyway, I’ve read many of the sequences and posts, generally agreeing and finding many interesting thoughts. It’s fun reading about zombies and Newcomb’s problem and the like.
I guess this sounds heretical, but I don’t understand why Bayes theorem is placed on such a pedestal here. I understand Bayesian statistics, intuitively and also technically. Bayesian statistics is great for a lot of problems, but I don’t see it as always superior to thinking inspired by the traditional scientific method. More specifically, I would say that coming up with a prior distribution and updating can easily be harder than the problem at hand.
I assume the point is that there is more to what is considered Bayesian thinking than Bayes theorem and Bayesian statistics, and I’ve reread some of the articles with the idea of trying to pin that down, but I’ve found that difficult. The closest I’ve come is that examining what your priors are helps you to keep an open mind.
Bayesian theorem is just one of many mathematical equations, like for example Pythagorean theorem. There is inherently nothing magical about it.
It just happens to explain one problem with the current scientific publishing process: neglecting base rates. Which sometimes seems like this: “I designed an experiment that would prove a false hypothesis only with probability p = 0.05. My experiment has succeeded. Please publish my paper in your journal!”
(I guess I am exaggerating a bit here, but many people ‘doing science’ would not understand immediately what is wrong with this. And that would be those who even bother to calculate the p-value. Not everyone who is employed as a scientist is necessarily good at math. Many people get paid for doing bad science.)
This kind of thinking has the following problem: Even if you invent hundred completely stupid hypotheses; if you design experiments that would prove a false hypothesis only with p = 0.05, that means five of them would be proved by the experiment. If you show someone else all hundred experiments together, they may understand what is wrong. But you are more likely to send only the successful five ones to the journal, aren’t you? -- But how exactly is the journal supposed to react to this? Should they ask: “Did you do many other experiments, even ones completely irrelevant to this specific hypothesis? Because, you know, that somehow undermines the credibility of this one.”
The current scientific publishing process has a bias. Bayesian theorem explains it. We care about science, and we care about science being done correctly.
That’s not neglecting base rates, that’s called selection bias combined with incentives to publish. Bayes theorem isn’t going to help you with this.
http://xkcd.com/882/
Uhm, it’s similar, but not the same.
If I understand it correctly, selection bias is when 20 researchers make an experiment with green jelly beans, 19 of them don’t find significant correlation, 1 of them finds it… and only the 1 publishes, and the 19 don’t. The essence is that we had 19 pieces of evidence against the green jelly beans, only 1 piece of evidence for the green jelly beans, but we don’t see those 19 pieces, because they are not published. Selection = “there is X and Y, but we don’t see Y, because it was filtered out by the process that gives us information”.
But imagine that you are the first researcher ever who has researched the jelly beans. And you only did one experiment. And it happened to succeed. Where is the selection here? (Perhaps selection across Everett branches or Tegmark universes. But we can’t blame the scientific publishing process for not giving us information from the parallel universes, can we?)
In this case, base rate neglect means ignoring the fact that “if you take a random thing, the probability that this specific thing causes acne is very low”. Therefore, even if the experiment shows a connection with p = 0.05, it’s still more likely that the result just happened randomly.
The proper reasoning could be something like this (all number pulled out of the hat) -- we already have pretty strong evidence that acne is caused by food; let’s say there is a 50% probability for this. With enough specificity (giving each fruit a different category, etc.), there are maybe 2000 categories of food. It is possible that more then one of them cause acne, and our probability distribution for that is… something. Considering all this information, we estimate a prior probability let’s say 0.0004 that a random food causes acne. -- Which means that if the correlation is significant on level p = 0.05, that per se means almost nothing. (Here one could use the Bayes theorem to calculate that the p = 0.05 successful experiment shows the true cause of acne with probablity cca 1%.) We need to increase it to p = 0.0004 just to get a 50% chance of being right. How can we do that? We should use a much larger sample, or we should repeat the experiment many times, record all the successed and failures, and do a meta-analysis.
That’s a different case—you have no selection bias here, but your conclusions are still uncertain—if you pick p=0.05 as your threshold, you’re clearly accepting that there is a 5% chance of a Type I error: the green jelly beans did nothing, but the noise happened to be such that you interpreted it as conclusive evidence in favor of your hypothesis.
But that all is fine—the readers of scientific papers are expected to understand that results significant to p=0.05 will be wrong around 5% of the times, more or less (not exactly because the usual test measures P(D|H), the probability of the observed data given the (null) hypothesis while you really want P(H|D), the probability of the hypothesis given the data).
People rarely take entirely random things and test them for causal connection to acne. Notice how you had to do a great deal of handwaving in establishing your prior (aka the base rate).
As an exercise, try to be specific. For example, let’s say I want to check if the tincture made from the bark of a certain tree helps with acne. How would I go about calculating my base rate / prior? Can you walk me through an estimation which will end with a specific number?
And this is the base rate neglect. It’s not “results significant to p=0.05 will be wrong about 5% of time”. It’s “wrong results will be significant to p=0.05 about 5% of time”. And most people will confuse these two things.
It’s like when people confuse “A ⇒ B” with “B ⇒ A”, only this time it is “A ⇒ B (p=0.05)” with “B ⇒ A (p=0.05)”. It is “if wrong, then in 5% significant”. It is not “if significant, then in 5% wrong”.
Yes, you are right. Establishing the prior is pretty difficult, perhaps impossible. (But that does not make “A ⇒ B” equal to “B ⇒ A”.) Probably the reasonable thing to do would be simply to impose strict limits in areas where many results were proved wrong.
Um, what “strict limits” are you talking about, what will they look like, and who will be doing the imposing?
To get back to my example, let’s say I’m running experiments to check if the tincture made from the bark of a certain tree helps with acne—what strict limits would you like?
p = 0.001, and if at the end of the year too many researches fail to replicate, keep decreasing. (let’s say that “fail to replicate” in this context means that the replication attempt cannot prove it even with p = 0.05 -- we don’t want to make replications too expensive, just a simple sanity check)
a long answer would involve a lot of handwaving again (it depends on why do you believe the bark is helpful; in other words, what other evidence do you already have)
a short answer: for example, p = 0.001
Well, and what’s magical about this particular number? Why not p=0.01? why not p=0.0001? Confidence thresholds are arbitrary, do you have a compelling argument why any particular one is better than the rest?
Besides, you’re forgetting the costs. Assume that the reported p-values are true (and not the result of selection bias, etc.). Take a hundred papers which claim results at p=0.05. At the asymptote about 95 of them will turn out to be correct and about 5 will turn out to be false. By your strict criteria you’re rejecting all of them—you’re rejecting 95 correct papers. There is a cost to that, is there not?
Lumifer, please update that at this moment you don’t grok the difference between “A ⇒ B (p=0.05)” and “B ⇒ A (p = 0.05)”, which is why you don’t understand what p-value really means, which is why you don’t understand the difference between selection bias and base rate neglect, which is probably why the emphasis on using Bayes theorem in scientific process does not make sense to you. You made a mistake, that happens to all of us. Just stop it already, please.
And don’t feel bad about it. Until recently I didn’t understand it too, and I had a gold medal from international mathematical olympiad. Somehow it is not explained correctly at most schools, perhaps because the teachers don’t get it themselves, or maybe they just underestimate the difficulty of proper understanding and the high chance of getting it wrong. So please don’t contibute to the confusion.
Imagine that there are 1000 possible hypotheses, among which 999 are wrong, and 1 is correct. (That’s just a random example to illustrate the concept. The numbers in real life can be different.) You have an experiment that says “yes” to 5% of the wrong hypotheses (this is what p=0.05 means), and also to the correct hypothesis. So at the end, you have 50 wrong hypotheses and 1 correct hypothesis confirmed by the experiment. So in the journal, 98% of the published articles would be wrong, not 5%. It is “wrong ⇒ confirmed (p=0.05)”, not “confirmed ⇒ wrong (p=0.05)”.
LOL. Yeah, yeah, mea culpa, I had a brain fart and expressed myself very poorly.
I do understand what p-value really means. The issue was that I had in mind a specific scenario (where in effect you’re trying to see if the difference in means between two groups is significant) but neglected to mention it in the post :-)
I feel like this could use a bit longer explanation, especially since I think you’re not hearing Lumifer’s point, so let me give it a shot. (I’m not sure a see a meaningful difference between base rate neglect and selection bias in this circumstance.)
The word “grok” in Viliam_Bur’s comment is really important. This part of the grandparent is true:
But it’s like saying “well, assume the diagnosis is correct. Then the treatment will make the patient better with high probability.” While true, it’s totally out of touch with reality- we can’t assume the diagnosis is correct, and a huge part of being a doctor is responding correctly to that uncertainty.
Earlier, Lumifer said this, which is an almost correct explanation of using Bayes in this situation:
The part that makes it the “almost” is the “5% of the times, more or less.” This implies that it’s centered around 5%, with random chance determining what this instance is. But selection bias means it will almost certainly be more, and generally much more. In fields that study phenomena that don’t exist, 100% of the papers published will be of false results that were significant by chance. In many real fields, rates of failure to replicate are around 30%. Describing 30% as “5%, more or less” seems odd, to say the least.
But the proposal to reduce the p value doesn’t solve the underlying problem (which was Lumifer’s response). If we set the p value threshold lower, at .01 or .001 or wherever, we reducing the risk of false positives at the cost of increasing the risk of false negatives. A study design which needs to determine an effect at the .001 level is much more expensive than a study design which needs to determine an effect at the .05 level, and so we will have many less studies attempted, and many many less published studies.
Better to drop p entirely. Notice that stricter p thresholds go in the opposite direction as the publication of negative results, which is the real solution to the problem of selection bias. By calling for stricter p thresholds, you implicitly assume that p is a worthwhile metric, when what we really want is publication of negative results and more replications.
My grandparent post was stupid, but what I had in mind was basically a stage-2 (or −3) drug trial situation. You have declared (at least to the FDA) that you’re running a trial, so selection bias does not apply at this stage. You have two groups, one receives the experimental drug, one receives a placebo. Assume a double-blind randomized scenario and assume there is a measurable metric of improvement at the end of the trial.
After the trial you have two groups with two empirical distributions of the metric of choice. The question is how confident you are that these two distributions are different.
Well, as usual it’s complicated. Yes, the p-test is suboptimal in most situations where it’s used in reality. However it fulfils a need and if you drop the test entirely you need a replacement for the need won’t go away.
That’s not how p-values work. p=0.05 doesn’t mean that the hypothesis is 95% likely to be correct, even in principle; it means that there’s a 5% chance of seeing the same correlation if the null hypothesis is true. Pull a hundred independent data sets and we’d normally expect to find a p=0.05 correlation or better in at least five or so of them, no matter whether we’re testing, say, an association of cancer risk with smoking or with overuse of the word “muskellunge”.
This distinction’s especially important to keep in mind in an environment where running replications is relatively low-status or where negative results tend to be quietly shelved—both of which, as it happens, hold true in large chunks of academia. But even if this weren’t the case, we’d normally expect replication rates to be less than one minus the claimed p-value, simply because there are many more promising ideas than true ones and some of those will turn up false positives.
No, they won’t. You’re committing base rate neglect. It’s entirely possible for people to publish 2000 papers in a field where there’s no hope of finding a true result, and get 100 false results with p 0.05).
I know a few answers to this question, and I’m sure there are others. (As an aside, these foundational questions are, in my opinion, really important to ask and answer.)
What separates scientific thought and mysticism is that scientists are okay with mystery. If you can stand to not know what something is, to be confused, then after careful observation and thought you might have a better idea of what it is and have a bit more clarity. Bayes is the quantitative heart of the qualitative approach of tracking many hypotheses and checking how concordant they are with reality, and thus should feature heavily in a modern epistemic approach. The more precisely and accurately you can deal with uncertainty, the better off you are in an uncertain world.
What separates Bayes and the “traditional scientific method” (using scare quotes to signify that I’m highlighting a negative impression of it) is that the TSM is a method for avoiding bad beliefs but Bayes is a method for finding the best available beliefs. In many uncertain situations, you can use Bayes but you can’t use the TSM (or it would be too costly to do so), but the TSM doesn’t give any predictions in those cases!
Use of Bayes focuses attention on base rates, alternate hypotheses, and likelihood ratios, which people often ignore (replacing the first with maxent, the second with yes/no thinking, and the latter with likelihoods).
I honestly don’t think the quantitative aspect of priors and updating is that important, compared to the search for a ‘complete’ hypothesis set and the search for cheap experiments that have high likelihood ratios (little bets).
I think that the qualitative side of Bayes is super important but don’t think we’ve found a good way to communicate it yet. That’s an active area of research, though, and in particular I’d love to hear your thoughts on those four answers.
What is the qualitative side of Bayes?
Unfortunately, the end of that sentence is still true:
I think that What Bayesianism Taught Me is a good discussion on the subject, and my comment there explains some of the components I think are part of qualitative Bayes.
I think that a lot of qualitative Bayes is incorporating the insights of the Bayesian approach into your System 1 thinking (i.e. habits on the 5 second level).
Well, yes, but most of the things there are just useful ways to think about probabilities and uncertainty, proper habits, things to check, etc. Why Bayes? He’s not a saint whose name is needed to bless a collection of good statistical practices.
It’s more or less the same reason people call a variety of essentialist positions ‘platonism’ or ‘aristotelianism’. Those aren’t the only thinkers to have had views in this neighborhood, but they predated or helped inspire most of the others, and the concepts have become pretty firmly glued together. Similarly, the phrases ‘Bayes’ theorem’ and ‘Bayesian interpretation of probability’ (whence, jointly, the idea of Bayesian inference) have firmly cemented the name Bayes to the idea of quantifying psychological uncertainty and correctly updating on the evidence. The Bayesian interpretation is what links these theorems to actual practice.
Bayes himself may not have been a ‘Bayesian’ in the modern sense, just as Plato wasn’t a ‘platonist’ as most people use the term today. But the names have stuck, and ‘Laplacian’ or ‘Ramseyan’ wouldn’t have quite the same ring.
I like Laplacian as a name better, but it’s already a thing.
If I were to pretend that I’m a mainstream frequentist and consider “quantifying psychological uncertainty” to be subjective mumbo-jumbo with no place anywhere near real science :-D I would NOT have serious disagreements with e.g. Vaniver’s list. Sure, I would quibble about accents, importances, and priorities, but there’s nothing there that would be unacceptable from the mainstream point of view.
My biggest concern with the label ‘Bayesianism’ isn’t that it’s named after the Reverend, nor that it’s too mainstream. It’s that it’s really ambiguous.
For example, when Yvain speaks of philosophical Bayesianism, he means something extremely modest—the idea that we can successfully model the world without certainty. This view he contrasts, not with frequentism, but with Aristotelianism (‘we need certainty to successfully model the world, but luckily we have certainty’) and Anton-Wilsonism (‘we need certainty to successfully model the world, but we lack certainty’). Frequentism isn’t this view’s foil, and this philosophical Bayesianism doesn’t have any respectable rivals, though it certainly sees plenty of assaults from confused philosophers, anthropologists, and poets.
If frequentism and Bayesianism are just two ways of defining a word, then there’s no substantive disagreement between them. Likewise, if they’re just two different ways of doing statistics, then it’s not clear that any philosophical disagreement is at work; I might not do Bayesian statistics because I lack skill with R, or because I’ve never heard about it, or because it’s not the norm in my department.
There’s a substantive disagreement if Bayesianism means ‘it would be useful to use more Bayesian statistics in science’, and if frequentism means ‘no it wouldn’t!‘. But this methodological Bayesianism is distinct from Yvain’s philosophical Bayesianism, and both of those are distinct from what we might call ‘Bayesian rationalism’, the suite of mantras, heuristics, and exercises rationalists use to improve their probabilistic reasoning. (Or the community that deems such practices useful.) Viewing the latter as an ideology or philosophy is probably a bad idea, since the question of which of these tricks are useful should be relatively easy to answer empirically.
Err, actually, yes it is. The frequentist interpretation of probability makes the claim that probability theory can only be used in situations involving large numbers of repeatable trials, or selection from a large population. William Feller:
Or to quote from the essay coined the term frequentist:
Frequentism is only relevant to epistemological debates in a negative sense: unlike Aristotelianism and Anton-Wilsonism, which both present their own theories of epistemology, frequentism’s relevance is almost only in claiming that Bayesianism is wrong. (Frequentism separately presents much more complicated and less obviously wrong claims within statistics and probability; these are not relevant, given that frequentism’s sole relevance to epistemology is its claim that no theory of statistics and probability could be a suitable basis for an epistemology, since there are many events they simply don’t apply to.)
(I agree that it would be useful to separate out the three versions of Bayesianism, whose claims, while related, do not need to all be true or false at the same time. However, all three are substantively opposed to one or both of the views labelled frequentist.)
Depends which frequentist you ask. From Aris Spanos’s “A frequentist interpretation of probability for model-based inductive inference”:
and
For those who can’t access that through the paywall (I can), his presentation slides for it are here. I would hate to have been in the audience for the presentation, but the upside of that is that they pretty much make sense on their own, being just a compressed version of the paper.
While looking for those, I also found “Frequentists in Exile”, which is Deborah Mayo’s frequentist statistics blog.
I am not enough of a statistician to make any quick assessment of these, but they look like useful reading for anyone thinking about the foundations of uncertain inference.
I don’t understand what this “probability theory can only be used...” claim means. Are they saying that if you try to use probability theory to model anything else, your pencil will catch fire? Are they saying that if you model beliefs probabilistically, Math breaks? I need this claim to be unpacked. What do frequentists think is true about non-linguistic reality, that Bayesians deny?
I think they would be most likely to describe it as a category error. If you try to use probability theory outside the constraints within which they consider it applicable, they’d attest that you’d produce no meaningful knowledge and accomplish nothing but confusing yourself.
Can you walk me through where this error arises? Suppose I have a function whose arguments are the elements of a set S, whose values are real numbers between 0 and 1, and whose values sum to 1. Is the idea that if I treat anything in the physical world other than objects’ or events’ memberships in physical sequences of events or heaps of objects as modeling such a set, the conclusions I draw will be useless noise? Or is there something about the word ‘probability’ that makes special errors occur independently of the formal features of sample spaces?
As best I can parse the question, I think the former option better describes the position.
IIRC a common claim was that modeling beliefs at all is “subjective” and therefore unscientific.
Do you have any links to this argument? I’m having a hard time seeing why any mainstream scientist who thinks beliefs exist at all would think they’re ineffable....
Hmm, I thought I had read it in Jaynes’ PT:TLoS, but I can’t find it now. So take the above with a grain of salt, I guess.
Yes, but frequentists have zero problems with hypothetical trials or populations.
Do note that for most well-specified statistical problems the Bayesians and the frequentists will come to the same conclusions. Differently expressed, likely, but not contradicting each other.
Yes, it is my understanding that epistemologists usually call the set of ideas Yvain is referring to “probabilism” and indeed, it is far more vague and modest than what they call Bayesianism (which is more vague and modest still than the subjectively-objective Bayesianism that is affirmed often around these parts.).
BTW, I think this is precisely what Carnap was on about with his distinction between probability-1 and probability-2, neither of which did he think we should adopt to the exclusion of the other.
I think they would have significant practical disagreement with #3, given the widespread use of NHST, but clever frequentists are as quick as anyone else to point out that NHST doesn’t actually do what its users want it to do.
Hence the importance of the qualifier ‘qualitative’; it seems to me that accents, importances, and priorities are worth discussing, especially if you’re interested in changing System 1 thinking instead of System 2 thinking. The mainstream frequentist thinks that base rate neglect is a mistake, but the Bayesian both thinks that base rate neglect is a mistake and has organized his language to make that mistake obvious when it occurs. If you take revealed preferences seriously, it looks like the frequentist says base rate neglect is a mistake but the Bayesian lives that base rate neglect is a mistake.
Now, why Bayes specifically? I would be happy to point to Laplace instead of Bayes, personally, since Laplace seems to have been way smarter and a superior rationalist. But the trouble with naming methods of “thinking correctly” is that everyone wants to name their method “thinking correctly,” and so you rapidly trip over each other. “Rationalism,” for example, refers to a particular philosophical position which is very different from the modal position here at LW. Bayes is useful as a marker, but it is not necessary to come to those insights by way of Bayes.
(I will also note that not disagreeing with something and discovering something are very different thresholds. If someone has a perspective which allows them to generate novel, correct insights, that perspective is much more powerful than one which merely serves to verify that insights are correct.)
Yeah, I said if I were pretend to be a frequentist—but that didn’t involve suddenly becoming dumb :-)
I agree, but at this point context starts to matter a great deal. Are we talking about decision-making in regular life? Like, deciding which major to pick, who to date, what job offer to take? Or are we talking about some explicitly statistical environment where you try to build models, fit them, evaluate them, do out-of-sample forecasting, all that kind of things?
I think I would argue that recognizing biases (Tversky/Kahneman style) and trying to correct for them—avoiding them altogether seems too high a threshold—is different from what people call Bayesian approaches. The Bayesian way of updating on the evidence is part of “thinking correctly”, but there is much, much more than just that.
At least one (and I think several) of biases identified by Tversky and Kahneman is “people do X, a Bayesian would do Y, thus people are wrong,” so I think you’re overstating the difference. (I don’t know enough historical details to be sure, but I suspect Tversky and Kahneman might be an example of the Bayesian approach allowing someone to discover novel, correct insights.)
I agree, but it feels like we’re disagreeing. It seems to me that a major Less Wrong project is “thinking correctly,” and a major part of that project is “decision-making under uncertainty,” and a major part of uncertainty is dealing with probabilities, and the Bayesian way of dealing with probabilities seems to be the best, especially if you want to use those probabilities for decision-making.
So it sounds to me like you’re saying “we don’t just need stats textbooks, we need Less Wrong.” I agree; that’s why I’m here as well as reading stats textbooks. But it also sounds to me like you’re saying “why are you naming this Less Wrong stuff after a stats textbook?” The easy answer is that it’s a historical accident, and it’s too late to change it now. Another answer I like better is that much of the Less Wrong stuff comes from thinking about and taking seriously the stuff from the stats textbook, and so it makes sense to keep the name, even if we’re moving to realms where the connection to stats isn’t obvious.
Hm… Let me try to unpack my thinking, in particular my terminology which might not match exactly the usual LW conventions. I think of:
Bayes theorem as a simple, conventional, and an entirely uncontroversial statistical procedure. If you ask a dyed-in-the-wool rabid frequentist whether the Bayes theorem is true he’ll say “Yes, of course”.
Bayesian statistics as an approach to statistics with three main features. First is the philosophical interpretation of (some) probability as subjective belief. Second is the focus on conditional probabilities. Third is the strong preferences for full (posterior) distributions as answers instead of point estimates.
Cognitive biases (aka the Kahneman/Tversky stuff) as certain distortions in the way our wetware processes information about reality, as well as certain peculiarities in human decision-making. Yes, a lot of it it is concerned with dealing with uncertainty. Yes, there is some synergy with Bayesian statistics. No, I don’t think this synergy is the defining factor here.
I understand that historically the in the LW community Bayesian statistics and cognitive biases were intertwined. But apart from historical reasons, it seems to me these are two different things and the degree of their, um, interpenetration is much overstated on LW.
Well, we need for which purpose? For real-life decision making? -- sure, but then no one is claiming that stats textbooks are sufficient for that.
Some, not much. I can argue that much of LW stuff comes from thinking logically and following chains of reasoning to their conclusion—or actually just comes from thinking at all instead of reacting instinctively / on the basis of a gut feeling or whatever.
I agree that thinking in probabilities is a very big step and it *is* tied to Bayesian statistics. But still it’s just one step.
I agree with your terminology.
When contrasting LW stuff and mainstream rationality, I think the reliance on thinking in probabilities is a big part of the difference. (“Thinking logically,” for the mainstream, seems to be mostly about logic of certainty.) When labeling, it makes sense to emphasize contrasting features. I don’t think that’s the only large difference, but I see an argument (which I don’t fully endorse) that it’s the root difference.
(For example, consider evolutionary psychology, a moderately large part of LW. This seems like a field of science particularly prone to uncertainty, where “but you can’t prove X!” would often be a conversation-stopper. For the Bayesian, though, it makes sense to update in the direction of evo psych, even though it can’t be proven, which is then beneficial to the extent that evo psych is useful.)
Yes, I think you’re right.
Um, I’m not so sure about that. The main accusation against evolutionary psychology is that it’s nothing but a bunch of just-so stories, aka unfalsifiable post-hoc narratives. And a Bayesian update should be on the basis of evidence, not on the basis of an unverifiable explanation.
It seems to me that if you think in terms of likelihoods, you look at a story and say “but the converse of this story has high enough likelihood that we can’t rule it out!” whereas if you think in terms of likelihood ratios, you say “it seems that this story is weakly more plausible than its converse.”
I’m thinking primarily of comments like this. I think it is a reasonable conclusion that anger seems to be a basic universal emotion because ancestors who had the ‘right’ level of anger reproduced more than those who didn’t. Boris just notes that it could be the case that anger is a byproduct of something else, but doesn’t note anything about the likelihood of anger being universal in a world where it is helpful (very high) and the likelihood of anger being universal in a world where it is neutral or unhelpful (very low). We can’t rule out anger being spurious, but asking to rule that out is mistaken, I think, because the likelihood ratio is so significant. It doesn’t make sense to bet against anger being reproductively useful in the ancestral environment (but I think it makes sense to assign a probability to that bet, even if it’s not obvious how one would resolve it).
I have several problems with this line of reasoning. First, I am unsure what it means for a story to be true. It’s a story—it arranges a set of facts in a pattern pleasing to the human brain. Not contradicting any known facts is a very low threshold (see the Russell’s teapot), to call something “true” I’ll need more than that and if a story makes no testable predictions I am not sure on which basis I should evaluate its truth and what does it even mean.
Second, it seems to me that in such situations the likelihoods and so, necessarily, their ratios are very very fuzzy. My meta uncertainty—uncertainty about probabilities—is quite high. I might say “story A is weakly more plausible than story B” but my confidence in my judgment about plausibility is very low. This judgment might not be worth anything.
Third, likelihood ratios are good when you know you have a complete set of potential explanations. And you generally don’t. For open-ended problems the explanation “something else” frequently looks like the more plausible one, but again, the meta uncertainty is very high—not only you don’t know how uncertain you are, you don’t even know what you are uncertain about! Nassim Taleb’s black swans are precisely the beasties that appear out of “something else” to bite you in the ass.
Ah, by that I generally mean something like “the causal network N with a particular factorization F is the underlying causal representation of reality,” and so a particular experiment measures data and then we calculate “the aforementioned causal network would generate this data with probability P” for various hypothesized causal networks.
For situations where you can control at least one of the nodes, it’s easy to see how you can generate data useful for this. For situations where you only have observational data (like the history of human evolution, mostly), then it’s trickier to determine which causal network(s) is(are) best, but often still possible to learn quite a bit more about the underlying structure than is obvious at first glance.
So suppose we have lots of historical lives which are compressed down to two nodes, A which measures “anger” (which is integer-valued and non-negative, say) and C which measures “children” (which is also integer valued and non-negative). The story “anger is spurious” is the network where A and C don’t have a link between them, and the story “anger is reproductively useful” is the network where A->C and there is some nonzero value a^* of A which maximizes the expected value of C. If we see a relationship between A and C in the data, it’s possible that the relationship was generated by the “anger is spurious” network which said those variables were independent, but we can calculate the likelihoods and determine that it’s very very low, especially as we accumulate more and more data.
Sure. But even if you’re only aware of two hypotheses, it’s still useful to use the LR to determine which to prefer; the supremacy of a third hidden hypothesis can’t swap the ordering of the two known hypotheses!
Yes, reversal effects are always possible, but I think that putting too much weight on this argument leads to Anton-Wilsonism (certainty is necessary but impossible). I think we do often have a good idea of what our meta uncertainty looks like in a lot of cases, and that’s generally enough to get the job done.
I have only glanced at Pearl’s work, not read it carefully, so my understanding of causal networks is very limited. But I don’t understand on the basis of which data will you construct the causal network for anger and children (and it’s actually more complicated because there are important society-level effects). In what will you “see a relationship between A and C”? On the basis of what will you be calculating the likelihoods?
Ideally, you would have some record. I’m not an expert in evo psych, so I can’t confidently say what sort of evidence they actually rely on. I was hoping more to express how I would interpret a story as a formal hypothesis.
I get the impression that a major technique in evolutionary psychology is making use of the selection effect due to natural selection: if you think that A is heritable, and that different values of A have different levels of reproductive usefulness, then in steady state the distribution of A in the population gives you information about the historic relationship between A and reproductive usefulness, without even measuring relationship between A and C in this generation. So you can ask the question “what’s the chance of seeing the cluster of human anger that we have if there’s not a relationship between A and reproduction?” and get answers that are useful enough to focus most of your attention on the “anger is reproductively useful” hypothesis.
I guess the distinction in my mind is that in a Bayesian approach one enumerates the various hypothesis ahead of time. This is in contrast to coming up with a single hypothesis and then adding in more refined versions based on results. There are trade-offs between the two. Once you get going with a Bayesian approach you are much better protected against bias; however if you are missing some hypothesis from your prior you don’t find it.
Here are some specific responses to the 4 answers:
If you have a problem for which it is easy to enumerate the hypotheses, and have statistical data, then Bayes is great. If in addition you have a good prior probability distribution then you have the additional advantage that it is much easier to avoid bias. However if you find you are having to add in new hypotheses as you investigate then I would say you are using a hybrid method.
Even without Bayes one is supposed to specifically look for alternate hypothesis and search for the best answer.
On the Less Wrong welcome page the link next to the Bayesian link is a reference to the 2 4 6 experiment. I’d say this is an example of a problem poorly suited to Bayesian reasoning. It’s not a statistical problem, and it’s really hard to enumerate the prior for all rules for a list of 3 numbers ordered by simplicity. There’s clearly a problem with confirmation bias, but I would say the thing to do is to step back and do some careful experimentation along traditional lines. Maybe Bayesian reasoning is helpful because it would encourage you to do that?
I would agree that a rationalist needs to be exposed to these concepts.
I wonder about this statement the most. It’s hard to judge qualitative statements about probabilities. For example, I can say that I had a low prior belief in cryonics, and since reading articles here I have updated and now have a higher probability. I know I had some biases against the idea. However, I still don’t agree and it’s difficult to tell how much progress I’ve made in understanding the arguments.
Regarding Bayes, you might like my essay on the topic, especially if you have statistical training.
That paper did help crystallize some of my thoughts. At this point I’m more interested in wondering if I should be modifying how I think, as opposed to how to implement AI.
You are not alone in thinking the use of Bayes is overblown. It can;t be wrong, of course, but it can be impractical to use and in many real life situations we might not have specific enough knowledge to be able to use it. In fact, that’s probably one of the biggest criticisms of lesswrong.