How to Fix Science
Like The Cognitive Science of Rationality, this is a post for beginners. Send the link to your friends!
Science is broken. We know why, and we know how to fix it. What we lack is the will to change things.
In 2005, several analyses suggested that most published results in medicine are false. A 2008 review showed that perhaps 80% of academic journal articles mistake “statistical significance” for “significance” in the colloquial meaning of the word, an elementary error every introductory statistics textbook warns against. This year, a detailed investigation showed that half of published neuroscience papers contain one particular simple statistical mistake.
Also this year, a respected senior psychologist published in a leading journal a study claiming to show evidence of precognition. The editors explained that the paper was accepted because it was written clearly and followed the usual standards for experimental design and statistical methods.
Science writer Jonah Lehrer asks: “Is there something wrong with the scientific method?”
Yes, there is.
This shouldn’t be a surprise. What we currently call “science” isn’t the best method for uncovering nature’s secrets; it’s just the first set of methods we’ve collected that wasn’t totally useless like personal anecdote and authority generally are.
As time passes we learn new things about how to do science better. The Ancient Greeks practiced some science, but few scientists tested hypotheses against mathematical models before Ibn al-Haytham’s 11th-century Book of Optics (which also contained hints of Occam’s razor and positivism). Around the same time, Al-Biruni emphasized the importance of repeated trials for reducing the effect of accidents and errors. Galileo brought mathematics to greater prominence in scientific method, Bacon described eliminative induction, Newton demonstrated the power of consilience (unification), Peirce clarified the roles of deduction, induction, and abduction, and Popper emphasized the importance of falsification. We’ve also discovered the usefulness of peer review, control groups, blind and double-blind studies, plus a variety of statistical methods, and added these to “the” scientific method.
In many ways, the best science done today is better than ever — but it still has problems, and most science is done poorly. The good news is that we know what these problems are and we know multiple ways to fix them. What we lack is the will to change things.
This post won’t list all the problems with science, nor will it list all the promising solutions for any of these problems. (Here’s one I left out.) Below, I only describe a few of the basics.
Problem 1: Publication bias
When the study claiming to show evidence of precognition was published, psychologist Richard Wiseman set up a registry for advance announcement of new attempts to replicate the study.
Carl Shulman explains:
A replication registry guards against publication bias, and at least 5 attempts were registered. As far as I can tell, all of the subsequent replications have, unsurprisingly, failed to replicate Bem’s results. However, JPSP and the other high-end psychology journals refused to publish the results, citing standing policies of not publishing straight replications.
From the journals’ point of view, this (common) policy makes sense: bold new claims will tend to be cited more and raise journal prestige (which depends on citations per article), even though this means most of the ‘discoveries’ they publish will be false despite their low p-values (high statistical significance). However, this means that overall the journals are giving career incentives for scientists to massage and mine their data for bogus results, but not to challenge bogus results presented by others.
This is an example of publication bias:
Publication bias is the term for what occurs whenever the research that appears in the published literature is systematically unrepresentative of the population of completed studies. Simply put, when the research that is readily available differs in its results from the results of all the research that has been done in an area, readers and reviewers of that research are in danger of drawing the wrong conclusion about what that body of research shows. In some cases this can have dramatic consequences, as when an ineffective or dangerous treatment is falsely viewed as safe and effective. [Rothstein et al. 2005]
Sometimes, publication bias can be more deliberate. The anti-inflammatory drug Rofecoxib (Vioxx) is a famous case. The drug was prescribed to 80 million people, but in it was later revealed that its maker, Merck, had withheld evidence of the drug’s risks. Merck was forced to recall the drug, but it had already resulted in 88,000-144,000 cases of serious heart disease.
Example partial solution
One way to combat publication bias is for journals to only accept experiments that were registered in a public database before they began. This allows scientists to see which experiments were conducted but never reported (perhaps due to negative results). Several prominent medical journals (e.g. The Lancet and JAMA) now operate this way, but this protocol is not as widespread as it could be.
Problem 2: Experimenter bias
Scientists are humans. Humans are affected by cognitive heuristics and biases (or, really, humans just are cognitive heuristics and biases), and they respond to incentives that may not align with an optimal pursuit of truth. Thus, we should expect experimenter bias in the practice of science.
There are many stages in research during which experimenter bias can occur:
in reading-up on the field,
in specifying and selecting the study sample,
in [performing the experiment],
in measuring exposures and outcomes,
in analyzing the data,
in interpreting the analysis, and
in publishing the results. [Sackett 1979]
Common biases have been covered elsewhere on Less Wrong, so I’ll let those articles explain how biases work.
Example partial solution
There is some evidence that the skills of rationality (e.g. cognitive override) are teachable. Training scientists to notice and meliorate biases that arise in their thinking may help them to reduce the magnitude and frequency of the thinking errors that may derail truth-seeking attempts during each stage of the scientific process.
Problem 3: Bad statistics
I remember when my statistics professor first taught me the reasoning behind “null hypothesis significance testing” (NHST), the standard technique for evaluating experimental results. NHST uses “p-values,” which are statements about the probability of getting some data (e.g. one’s experimental results) given the hypothesis being tested. I asked my professor, “But don’t we want to know the probability of the hypothesis we’re testing given the data, not the other way around?” The reply was something about how this was the best we could do. (But that’s false, as we’ll see in a moment.)
Another problem is that NHST computes the probability of getting data as unusual as the data one collected by considering what might be expected if that particular experiment was repeated many, many times. But how do we know anything about these imaginary repetitions? If I want to know something about a particular earthquake, am I supposed to imagine a few dozen repetitions of that earthquake? What does that even mean?
I tried to answer these questions on my own, but all my textbooks assumed the soundness of the mistaken NHST framework for scientific practice. It’s too bad I didn’t have a class with biostatistican Steven Goodman, who says:
The p-value is almost nothing sensible you can think of. I tell students to give up trying.
The sad part is that the logical errors of NHST are old news, and have been known ever since Ronald Fisher began advocating NHST in the 1920s. By 1960, Fisher had out-advocated his critics, and philosopher William Rozeboom remarked:
Despite the awesome pre-eminence [NHST] has attained… it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.
There are many more problems with NHST and with “frequentist” statistics in general, but the central one is this: NHST does not follow from the axioms (foundational logical rules) of probability theory. It is a grab-bag of techniques that, depending on how those techniques are applied, can lead to different results when analyzing the same data — something that should horrify every mathematician.
The inferential method that solves the problems with frequentism — and, more importantly, follows deductively from the axioms of probability theory — is Bayesian inference.
So why aren’t all scientists using Bayesian inference instead of frequentist inference? Partly, we can blame the vigor of NHST’s early advocates. But we can also attribute NHST’s success to the simple fact that Bayesian calculations can be more difficult than frequentist calculations. Luckily, new software tools like WinBUGS let computers do most of the heavy lifting required for Bayesian inference.
There’s also the problem of sheer momentum. Once a practice is enshrined, it’s hard to dislodge it, even for good reasons. I took three statistics courses in university and none of my textbooks mentioned Bayesian inference. I didn’t learn about it until I dropped out of university and studied science and probability theory on my own.
Remember the study about precognition? Not surprisingly, it was done using NHST. A later Bayesian analysis of the data disconfirmed the original startling conclusion.
Example partial solution
This one is obvious: teach students probability theory instead of NHST. Retrain current scientists in Bayesian methods. Make Bayesian software tools easier to use and more widespread.
Conclusion
If I’m right that there is unambiguous low-hanging fruit for improving scientific practice, this suggests that particular departments, universities, or private research institutions can (probabilistically) out-perform their rivals (in terms of actual discoveries, not just publications) given similar resources.
I’ll conclude with one particular specific hypothesis. If I’m right, then a research group should be able to hire researchers trained in Bayesian reasoning and in catching publication bias and experimenter bias, and have them extract from the existing literature valuable medical truths that the mainstream medical community doesn’t yet know about. This prediction, in fact, is about to be tested.
- The Control Group Is Out Of Control by 29 Apr 2014 0:46 UTC; 44 points) (
- Why Academic Papers Are A Terrible Discussion Forum by 20 Jun 2012 18:15 UTC; 44 points) (
- How about testing our ideas? by 14 Sep 2012 10:28 UTC; 42 points) (
- [Link] Failed replications of the “elderly walking” priming effect by 13 Mar 2012 7:55 UTC; 31 points) (
- “The Journal of Real Effects” by 5 Mar 2012 3:07 UTC; 20 points) (
- Dotting i’s and Crossing t’s—a Journey to Publishing Elegance by 14 Mar 2012 21:23 UTC; 19 points) (
- 18 Mar 2012 21:03 UTC; 15 points) 's comment on 6 Tips for Productive Arguments by (
- 13 Sep 2012 8:43 UTC; 15 points) 's comment on Fallacies of reification—the placebo effect by (
- 800 scientist call out against statistical significance by 23 Mar 2019 12:57 UTC; 10 points) (
- What Value Hermeneutics? by 21 Mar 2017 20:03 UTC; 7 points) (
- 4 Dec 2012 9:45 UTC; 7 points) 's comment on Train Philosophers with Pearl and Kahneman, not Plato and Kant by (
- 1 Mar 2019 1:56 UTC; 6 points) 's comment on Rule Thinkers In, Not Out by (
- 4 Dec 2012 13:07 UTC; 4 points) 's comment on Train Philosophers with Pearl and Kahneman, not Plato and Kant by (
- 15 May 2012 23:19 UTC; 0 points) 's comment on I Stand by the Sequences by (
- 16 May 2012 0:25 UTC; 0 points) 's comment on I Stand by the Sequences by (
- 19 May 2012 22:02 UTC; 0 points) 's comment on I Stand by the Sequences by (
I only had time to double-check one of the scary links at the top, and I wasn’t too impressed with what I found:
But the careful review you link to claims that studies funded by the industry report 85% positive results, compared to 72% positive by independent organizations and 50% positive by government—which is not what I think of when I hear four times! They also give a lot of reasons to think the difference may be benign: industry tends to do different kinds of studies than independent orgs. The industry studies are mainly Phase III/IV—a part of the approval process where drugs that have already been shown to work in smaller studies are tested on a larger population; the nonprofit and government studies are more often Phase I/II—the first check to see whether a promising new chemical works at all. It makes sense that studies on a drug which has already been found to probably work are more positive than the first studies on a totally new chemical. And the degree to which pharma studies are more likely to be late-phase is greater than the degree to which pharma companies are more likely to show positive results, and the article doesn’t give stats comparing like to like! The same review finds with p < .001 that pharma studies are bigger, which again would make them more likely to find a result where one exists.
The only mention of the “4x more likely” number is buried in the Discussion section and cites a completely different study, Lexchin et al.
Lexchin reports an odds ratio of 4, which I think is what your first study meant when they say “industry studies are four times more likely to be positive”. Odds ratios have always been one of my least favorite statistical concepts, and I always feel like I’m misunderstanding them somehow, but I don’t think “odds ratio of 4″ and “four times more likely” are connotatively similar (someone smarter, please back me up on this?!). For example, the largest study in Lexchin’s meta-analysis, Yaphe et al, finds that 87% of industry studies are positive versus 65% of independent studies, for an odds ratio of 3.45x. But when I hear something like “X is four times more likely than Y”, I think of Y being 20% likely and X being 80% likely; not 65% vs. 87%.
This means Lexchin’s results are very very similar to those of the original study you cite, which provides some confirmation that those are probably the true numbers. Lexchin also provides another hypothesis for what’s going on. He says that “the research methods of trials sponsored by drug companies is at least as good as that of non-industry funded research and in many cases better”, but that along with publication bias, industry fudges the results by comparing their drug to another drug, and then giving the other drug wrong. For example, if your company makes Drug X, you sponsor a study to prove that it’s better than Drug Y, but give patients Drug Y at a dose that’s too low to do any good (or so high that it produces side effects). Then they conduct that study absolutely perfectly and get the correct result that their drug is better than another drug at the wrong dosage. This doesn’t seem like the sort of thing Bayesian statistics could fix; in fact, it sounds like it means study interpretation would require domain-specific medical knowledge; someone who could say “Wait a second, that’s not how we usually give penicillin!” I don’t know whether this means industry studies that compare their drug against a placebo are more trustworthy.
So, summary. Industry studies seem to hover around 85% positive, non-industry studies around 65%. Part of this is probably because industry studies are more likely to be on drugs that there’s already some evidence that they work, and not due to scientific misconduct at all. More of it is due to publication bias and to getting the right answer to a wrong question like “Does this work better than another drug when the other is given improperly?”.
Phrases like “Industry studies are four times more likely to show positive results” are connotatively inaccurate and don’t support any of these proposals at all, except maybe the one to reduce publication bias.
This reinforces my prejudice that a lot of the literature on how misleading the literature is, is itself among the best examples of how misleading the literature is.
Yes, “four times as likely” is not the same as an odds ratio of four. And the problem here is the same as the problem in army1987′s LL link that odds ratios get mangled in transmission.
But I like odds ratios. In the limit of small probability, odds ratios are the same as “times as likely.” But there’s nothing 4x as likely as 50%. Does that mean that 50% is very similar to all larger probabilities? Odds ratios are unchanged (or inverted) by taking complements: 4% to 1% is an odds ratio of about 4; 99% to 96% is also 4 (actually 4.1 in both cases). Complementation is exactly what’s going on here. The drug companies get 1.2x-1.3x more positive results than the independent studies. That doesn’t sound so big, but everyone is likely to get positive results. If we speak in terms of negative results, the independent studies are 2-3x likely to get negative results as the drug companies. Now it sounds like a big effect.
Odds ratios give a canonical distance between probabilities that doesn’t let people cherry-pick between 34% more positives and 3x more negatives. They give us a way to compare any two probabilities that is the obvious one for very small probabilities and is related to the obvious one for very large probabilities. The cost of interpolating between the ends is that they are confusing in the middle. In particular, this “3x more negatives” turns into an odds ratio of 4.
Sometimes 50% really is similar to all larger probabilities. Sometimes you have a specific view on things and should use that, rather than the off the shelf odd ratio. But that doesn’t seem to be true here.
Thank you for this. I’ve always been frustrated with odds ratios, but somehow it never occurred to me that they have the beautiful and useful property you describe.
I don’t know as much about odds ratios as I would like to, but you’ve convinced me that they’re something I should learn thoroughly, ASAP. Does anybody have a link to a good explanation of them?
http://lesswrong.com/lw/8lr/logodds_or_logits/ would be helpful for you, I think, since an explanation/introduction was the stated goal.
Sorry, I don’t have any sources. If you want suggestions from other people, you should try the open thread.
Some related words that may be helpful in searching for material are logit and logistic (regression).
Thanks for this. I’ve removed the offending sentence.
Language Log: Thou shalt not report odds ratios
Or if you want to appropriate a different popular phrase, “Never tell me the odds ratio!”
At the least, it allows one to argue that the claim “scientific papers are generally reliable” is self-undermining. The prior probability is also high, given the revolving door of “study of the week” science reporting we all are regularly exposed to.
A lot of the literature on cognitive biases is itself among the best examples of how biased people are (though unfortunately not usually in ways that would prove their point, with the obvious exception of confirmation bias).
Seems like both teaching about biases and learning about biases is dangerous.
I object, for reasons wonderfully stated by gwern here
That was actually just a slightly-edited-for-Hacker-News excerpt from my standing mini-essay explaining why we can’t trust science too much; the whole thing currently lives at http://www.gwern.net/DNB%20FAQ#fn51
That link points to your Dual N-Back piece. I think you meant https://www.gwern.net/Replication#nhst-and-systematic-biases
I am skeptical of the teaching solution section under 2), relative to institutional shifts (favoring confirmatory vs exploratory studies, etc). Section 3 could also bear mention of some of the many ways of abusing Bayesian statistical analyses (e.g. reporting results based on gerrymandered priors, selecting which likelihood ratio to highlight in the abstract and get media attention for, etc). Cosma Shalizi would have a lot to say about it.
I do like the spirit of the post, but it comes across a bit boosterish.
On this note, I predict that if Bayesian statistical analyses ever displaced NHST as the mainstream standard, they would be misused about as much as NHST.
Currently there’s a selection bias: NHST is much more widely taught than Bayesian analyses, so NHST users are much more likely to be lowest common demoninator crank-turners who don’t really understand statistics generally. By contrast, if you’ve managed to find out how to do Bayesian inference, you’re probably better at statistics than the average researcher and therefore less likely to screw up whatever analysis you choose to do. If every researcher were taught Bayesian inference this would no longer be true.
Still, I think Bayesian methods are superior enough that the net benefit of that would be positive. (Also, proper Bayesian training would also cover how to construct ignorance priors, and I suspect nefariously chosen priors would be easier to spot than nefarious frequentist mistreatment of data.)
Bayesian methods are better in a number of ways, but ignorant people using a better tool won’t necessarily get better results. I don’t think the net effect of a mass switch to Bayesian methods would be negative, but I do think it’d be very small unless it involved raising the general statistical competence of scientists.
Even when Bayesian methods get so commonplace that they could be used just by pushing a button in SPSS, researchers will still have many tricks at their disposal to skew their conclusions. Not bothering to publish contrary data, only publishing subgroup analyses that show a desired result, ruling out inconvenient data points as “outliers”, wilful misinterpretation of past work, failing to correct for doing multiple statistical tests (and this can be an issue with Bayesian t-tests, like those in the Wagenmakers et al. reanalysis lukeprog linked above), and so on.
As a biologist, I can say that most statistical errors are just that: errors. They are not tricks. If researchers understand the statistics that they are using, a lot of these problems will go away.
A person has to learn a hell of a lot before they can do molecular biology research, and statistics happens to be fairly low on the priority list for most molecular biologists. In many situations we are able to get around the statistical complexities by generating data with very little noise.
Hanlon’s Razor FTW.
ISTM a large benefit of commonplace Bayes would be that competent statisticians could do actually meaningful meta-analyses...? Which would counteract widespread statistical ineptitude to a significant extent...?
I’m not sure it’d make much difference. From reading & skimming meta-analyses myself I’ve inferred that the main speedbumps with doing them are problems with raw data themselves or a lack of access to raw data. Whether the data were originally summarized using NHST/frequentist methods or Bayesian methods makes a lot less difference.
Edit to add: when I say “problems with raw data themselves” I don’t necessarily mean erroneous data; a problem can be as mundane as the sample/dataset not meeting the meta-analyst’s requirements (e.g. if the sample were unrepresentative, or the dataset didn’t contain a set of additional moderator variables).
I think that teaching Bayesian methods would itself raise the general statistical competence of scientists as a side effect, among other things because the meaning of p-values is seriously counter-intuitive (so more scientists would actually grok Bayesian statistics in such a world than actually grok frequentist statistics right now).
You could well be right. I’m pessimistic about this because I remember seeing lots of people at school & university recoiling from any statistical topic more advanced than calculating means and drawing histograms. If they were being taught about conjugate priors & hyperparameters I’d expect them to react as unenthusiastically as if they were being taught about confidence levels and maximum likelihood. But I don’t have any rock solid evidence for that hunch.
Please don’t insert gratuitous politics into LessWrong posts.
I removed the global warming phrase.
Thanks!
What David_G said. Global warming is a scientific issue. Maybe “what we lack is the will to change things” is the right analysis of the policy problems, but among climate change experts there’s a whole lot more consensus about global warming than there is among AI researchers about the Singularity. “You can’t say controversial things about global warming, but can say even more controversial things about AI” is a rule that makes about as much sense as “teach the controversy” about evolution.
...and what to do about it is a political issue.
It’s also a political issue, to a much greater extent than the possibility and nature of a technological singularity.
Evolution is also a political issue. Shall we now refrain from talking about evolution, or mentioning what widespread refusal to accept evolution, up to the point of there being a strong movement to undermine the teaching of evolution in US schools, says about human rationality?
I get that it can be especially hard to think rationally about politics. And I agree with what Eliezer has written about government policy being complex and almost always involving some trade-offs, so that we should be careful about thinking there’s an obvious “rationalist view” on policy questions.
However, a ban on discussing issues that happen to be politicized is idiotic, because it puts us at the mercy of contingent facts about what forms of irrationality happen to be prevalent in political discussion at this time. Evolution is a prime example of this. Also, if the singularity became a political issue, would we ban discussion of that from LessWrong?
We should not insert political issues which are not relevant to the topic, because the more political issues one brings to the discussion, the less rational it becomes. It would be most safe to discuss all issues separately, but sometimes is it not possible, e.g. when the topic being discussed relies heavily on evolution.
One part of trying to be rational is to accept that people are not rational, and act accordingly. For every political topic there is a number of people whose minds will turn off if they read something they disagree with. It does not mean we should be quiet on the topic, but we should not insert it where it is not relevant.
Explaining why X is true, in a separate article, is correct approach. Saying or suggesting something like “by the way, people who don’t think X is true are wrong” in an unrelated topic, is wrong approach. Why is it so? In the first example you expect your proof of X to be discussed in the comments, because it is the issue. In the second example, discussions about X in comments are off-topic. Asserting X in a place where discussion of X is unwelcome, is a kind of Dark Arts; we should avoid it even if we think X is true.
The topic of evolution, unlike the topic of climate change, is entangled with human psychology, AI, and many other important topics; not discussing it would be highly costly. Moreover, if anyone on LessWrong disagrees with evolution, it’s probably along Newsomian eccentric lines, not along tribal political lines. Also, lukeprog’s comments on the subject made implicit claims about the policy implications of the science, not just about the science itself, which in turn is less clear-cut than the scientific case against a hypothesis requiring a supernatural agent, though for God’s sake please nobody start arguing about exactly how clear-cut.
As a matter of basic netiquette, please use words like “mistaken” or “harmful” instead of “idiotic” to describe views you disagree with.
This post is mostly directed at newbies, which aren’t supposed to be trained in trying to keep their brain from shutting down whenever the “politics” pattern matcher goes off.
In other words, it could cause some readers to stop reading before they get to the gist of the post. Even at Hacker News, I sometimes see “I stopped reading at this point” posts.
Also, I see zero benefit from mentioning global warming specifically in this post. Even a slight drawback outweigh zero benefit.
Oh dear… I admit I hadn’t thought of the folks who will literally stop reading when they hit a political opinion they don’t like. Yeah, I’ve encountered them. Though I think they have bigger problems than not knowing how to fix science, and don’t think mentioning AGW did zero for this post.
(I don’t necessarily disagree with your points, I was simply making a relevant factual claim; yet you seem to have unhesitatingly interpreted my factual claim as automatically implying all sorts of things about what policies I would or would not endorse. Hm...)
I didn’t interpret it as anything about what gov. policies you’d endorse. I did infer you agreed with Steven’s comment. But anyway, my first comment may not have been clear enough, and I think the second comment should be a useful explication of the first one.
(Actually, I meant to type “Maybe… isn’t the right analysis...” or “Maybe… is the wrong analysis...” That was intended as acknowledgement of the reasons to be cautious about talking policy. But I botched that part. Oops.)
By “policies” I meant “norms of discourse on Less Wrong”. I don’t have any strong opinions about them; I don’t unhesitatingly agree with Steven’s opinion. Anyway I’m glad this thread didn’t end up in needless animosity; I’m worried that discussing discussing global warming, or more generally discussing what should be discussed, might be more heated than discussing global warming itself.
Yeah. I thought of making another thread for this issue.
As for the difference with the singularity, views on that are not divided much along tribal political lines (ETA: as you acknowledge), and LessWrong seems much better placed to have positive influence there because the topic has received much less attention, because of LessWrong’s strong connection (sociological if nothing else) with the Singularity Institute, and because it’s a lot more likely to amount to an existential risk in the eyes of most people here of any political persuasion, though again let’s not discuss whether they’re right.
The point of politics is the mind-killer is that one shouldn’t use politically-charged examples when they’re not on-topic. This is exactly that case. The article is not about global warming, so it should not make mention of global warming, because that topic makes some people go insane.
This does not mean that there cannot be a post about global warming (to the extent that it’s on-topic for the site).
Also, “will” may be the wrong concept.
How about “not enough people with the power to change things see sufficient reasons to so”?
Basic science isn’t political here. Things like “Humans cause global warming; There is no God; Humans evolved from apes” are politicized some places but here they are just premises. There is no need to drag in political baggage by making “This Is Politics!” declarations in cases like this.
Do you see how the claim that “humans cause global warming” differs from the claim quoted in the grandparent comment?
It takes the form of a premise in a sentence that goes on to use it as an illustration of another known problem that isn’t realistically going to be solved any time soon. I don’t think I accept your implication here and in my judgement you introduced politics, not the opening post and so it is your comment that I would wish to see discouraged.
That’s hardly gratuitous. Don’t fall prey to the “‘politics is the mind killer’ is the mind killer” effect. Not all mentions of Hitler are Godwinations.
I would have been less bothered if the comparison gave insight into the structure of the problem beyond just the claim that solutions exist and people don’t have the will to implement them, and/or if it had been presented as opinion rather than fact.
Why? By any reasonable definition, it is fact. We shouldn’t step away from essentially proven facts just because some people make political controversies out of them. In fact these are the examples we should be bringing up more, if we want to be rational when it’s harder, not just rational when it’s easy.
and we can read LW articles while standing on our heads to make it even harder!
I general, it does not seem like a good idea to make your ideas artificially hard to understand.
What exactly does “being rational” mean in this context? Rationality is a way to come to the right conclusions from the available data. If you show the data and how you reached the conclusion, you have shown rationality (assuming there is no lower-level problem, for example that you have previously filtered the data). If you only show the conclusion—well, even if it happens to be the right conclusion, you didn’t demonstrate that you have achieved it rationally.
The mere fact that someone is saying some conslusion is not a proof of rationality. It may be a wrong conclusion, but even if it is the right conclusion, it is a very weak evidence towards author’s rationality, because they might as well just profess their group beliefs. And such as people usually are, when someone is saying a conclusion without showing how they reached them, I would put a high prior probability on them professing group beliefs.
There is no utility in “trying harder” per se, only the results matter. If we want to increase the general sanity waterline, we should do things that increase the change of success, not the harder things. What exactly are we trying to do? If we are trying to signal to people with the same opinions, we could write on the LW homepage with big letters: “global warming is true, parallel universes exist, science is broken, and if you don’t believe this, you are not rational enough”—but what exactly would that achieve? I don’t think it would attract people who want to study rationality.
Choose you battles wisely. Talk about global warming when the global warming is the topic of discussion. Same about parallel universes, etc. Imagine going to a global warming conference and talking about parallel universes—does this fit under the label “being rational when it’s harder”?
My phrasing “when it’s harder, not just rational when it’s easy” was poor. Let me make my points another way.
First of all, do you believe that “But as with the problem of global warming and its known solutions, what we lack is the will to change things” is incorrect? Because I’ve seen very few people objecting, just people arguing that “other people” may find the phrasing disturbing, or objectionable, or whatever. If you object, say so. The audience is the less wrong crowd; if they reject the rest of the post over that one sentence, then what exactly are they doing at this website?
Parallel universes requires a long meta explanation before people can even grasp your point, and, more damningly, they are rejected by experts in the field. Yes, the experts are most likely wrong, but it takes a lot of effort to see that. If someone says “I don’t believe in Everett branches and parallel universes”, I don’t conclude they are being irrational, just that they haven’t been exposed to all the arguments in excruciating detail, or are following a—generally correct—“defer to the scientific consensus” heuristic.
But if someone rejects the global warming consensus, then they are being irrational, and this should be proclaimed, again and again. No self-censorship because some people find it “controversial”.
I am not very good at estimating probabilities, but I would guess: 99% there is a global warming; 95% the human contribution is very significant; 95% in a rational world we could reduce the human contribution, though not necessarily to zero.
Climate change also requires some investigation. As an example, I have never studied anything remotely similar to climatology, and I have no idea who the experts in the field are. (I could do this, but I have limited time and different priorities.) People are giving me all kinds of data, many of them falsified, and I don’t have a background knowledge to tell the difference. So basicly in my situation, all I have is hearsay, and it’s just my decision whom to trust. (Unless I want to ignore my other priorities and invest a lot of time in this topic, which has no practical relevance to my everyday life.)
Despite all this, during years I have done some intuitive version of probabilistic reasoning; I have unconsciously noticed that some things correlate with other things (for example: people who are wrong when discussing one topic have somewhat higher probability to be wrong when discussing other topic, some styles of discussion are somewhat more probably used by people who are wrong, etc.), so gradually my model of the world started strongly suggesting that “there is a global warming” is a true statement. Yet, it is all very indirect reasoning on my part—so I can understand how a person, just as ignorant about this topic as me, could with some probability come to a different conclusion.
No one is perfectly rational, right? People make all kinds of transgressions against rationality, and “rejecting the global warming consensus” seems to me like a minor one, compared with alternatives. Such person could still be in the top 1 percentile of rationality, mostly because humans generally are not very rational.
Anyway, the choice (at least as I see it) is not between “speak about global warming” or “not speak about global warming”, but between “speak about global warming in a separate article, with arguments and references” and “drop the mention in unrelated places, as applause lights”. Some people consider this approach bad even when it is about theism, which in my opinion is a hundred times larger transgression against rationality.
Writing about global warming is a good thing to do, and it belongs on LW, and avoiding it would be bad. It just should be done in a way that emphasises that we speak about rational conclusions, and not only promote our group-think. Because it is a topic where most people promote some group-think, so when this topic is introduced, there is a high prior probability that is was introduced for bad reasons.
Thanks for your detailed response!
I feel the opposite—global warming denial is much worse than (mild) theism. I explain more in: http://lesswrong.com/r/discussion/lw/aw6/global_warming_is_a_better_test_of_irrationality/
And yet it leads you to a 99% probability assignment. :-/
Because it is a lot of indirect reasoning. Literally, decades of occasional information. Even weak patterns can become visible after enough exposure. I have learned even before finding LW that underconfidence is also a sin.
As an analogy: if you throw a coin 10 times, and one side comes up 6 times and other side 4 times, it does not mean much. But if you throw the same coin 1000 times, and one side comes up 600 times and other side 400 times, the coin is almost surely not fair. After many observations you see something that was not visible after few observations.
And just like I cannot throw the same coin 10 more times to convince you that it is not fair (you would have to either see all 1000 experiments, or strongly trust my rationality), there is nothing I could write in this comment to justify my probability assignment. I can only point to the indirect evidence: one relatively stronger data point would be the relative consensus of LW contributors.
Sure, lots of pieces of weak evidence can add up to strong evidence… provided they’re practically independent from each other. And since this issue gets entangled with Green vs Blue politics, the correlation between the various pieces of weak evidence might not be that small. (If the coin was always flipped by the same person, who always allowed to look which side faced which way before flipping it, they could well have used a method of flipping which systematically favoured a certain side—E.T. Jaynes’s book describes some such methods.)
Or your honesty.
That is, if you say to me “I flipped this coin 1000 times and recorded the results in this Excel spreadsheet, which shows 600 heads and 400 tails,” all I have to believe is that you really did flip the coin 1000 times and record the results. That assumes you’re honest, but sets a pretty low lower bound for your rationality.
But two Bayesian inferences from the same data can also give different results. How could this be a non-issue for Bayesian inference while being indicative of a central problem for NHST? (If the answer is that Bayesian inference is rigorously deduced from probability theory’s axioms but NHST is not, then the fact that NHST can give different results for the same data is not a true objection, and you might want to rephrase.)
By a coincidence of dubious humor, I recently read a paper on exactly this topic, how NHST is completely misunderstood and employed wrongly and what can be improved! I was only reading it for a funny & insightful quote, but Jacob Cohen (as in, ‘Cohen’s d’) in pg 5-6 of “The Earth Is Round (p < 0.05)” tells us that we shouldn’t seek to replace NHST with a “magic alternative” because “it doesn’t exist”. What we should do is focus on understanding the data with graphics and datamining techniques; report confidence limits on effect sizes, which gives us various things I haven’t looked up; and finally, place way more emphasis on replication than we currently do.
An admirable program; we don’t have to shift all the way to Bayesian reasoning to improve matters. Incidentally, what Bayesian inferences are you talking about? I thought the usual proposals/methods involved principally reporting log odds, to avoid exactly the issue of people having varying priors and updating on trials to get varying posteriors.
This only works in extremely simple cases.
Could you give an example of an experiment that would be too complex for log odds to be useful?
Any example where there are more than two potential hypotheses.
Note, that for example, “this coin is unbiased”, “this coin is biased toward heads with p=.61″, and “this coin is biased toward heads with p=.62” count as three different hypotheses for this purpose.
This is fair as a criticism of log-odds, but in the example you give, one could avoid the issue of people having varying priors by just reporting the value of the likelihood function. However, this likelihood function reporting idea fails to be a practical summary in the context of massive models with lots of nuisance parameters.
I didn’t have any specific examples in mind. But more generally, posteriors are a function of both priors and likelihoods. So even if one avoids using priors entirely by reporting only likelihoods (or some function of the likelihoods, like the log of the likelihood ratio), the resulting implied inferences can change if one’s likelihoods change, which can happen by calculating likelihoods with a different model.
If the OP is read to hold constant everything not mentioned as a difference, that includes the prior beliefs of the person doing the analysis, as against the hypothetical analysis that wasn’t performed by that person.
Does “two Bayesian inferences” imply it is two different people making those inferences, with two people not possibly having identical prior beliefs? Could a person performing axiom-obeying Bayesian inference reach different conclusions than that same person hypothetically would have had they performed a different axiom-obeying Bayesian inference?
I think my reply to gwern’s comment (sibling of yours) all but answers your two questions already. But to be explicit:
Not necessarily, no. It could be two people who have identical prior beliefs but just construct likelihoods differently. It could be the same person calculating two inferences that rely on the same prior but use different likelihoods.
I think so. If I do a Bayesian analysis with some prior and likelihood-generating model, I might get one posterior distribution. But as far as I know there’s nothing in Cox’s theorem or the axioms of probability theory or anything like those that says I had to use that particular prior and that particular likelihood-generating model. I could just as easily have used a different prior and/or a different likelihood model, and gotten a totally different posterior that’s nonetheless legitimate.
The way I interpret hypotheticals in which one person is said to be able to do something other than what they will do, such as “depending on how those techniques are applied,” all of the person’s priors are to be held constant in the hypothetical. This is the most charitable interpretation of the OP because the claim is that, under Bayesian reasoning, results do not depend on how the same data is applied. This seems obviously wrong if the OP is interpreted as discussing results reached after decision processes with identical data but differing priors, so it’s more interesting to talk about agents with other things differing, such as perhaps likelihood-generating models, than it is to talk about agents with different priors.
Can you give an example?
But even if we assume the OP means that data and priors are held constant but not likelihoods, it still seems to me obviously wrong. Moreover, likelihoods are just as fundamental to an application of Bayes’s theorem as priors, so I’m not sure why I would have/ought to have read the OP as implicitly assuming priors were held constant but not likelihoods (or likelihood-generating models).
I didn’t have one, but here’s a quick & dirty ESP example I just made up. Suppose that out of the blue, I get a gut feeling that my friend Joe is about to phone me, and a few minutes later Joe does. After we finish talking and I hang up, I realize I can use what just happened as evidence to update my prior probability for my having ESP. I write down:
my evidence: “I correctly predicted Joe would call” (call this E for short)
the hypothesis H0 — that I don’t have ESP — and its prior probability, 95%
the opposing hypothesis H1 — that I have ESP — and its prior probability, 5%
Now let’s think about two hypothetical mes.
The first me guesses at some likelihoods, deciding that both P(E | H0) and P(E | H1) were both 10%. Turning the crank, it gets a posterior for H1, P(H1 | E), that’s proportional to P(H1) P(E | H1) = 5% × 10% = 0.5%, and a posterior for H0, P(H0 | E), that’s proportional to P(H0) P(E | H0) = 95% × 10% = 9.5%. Of course its posteriors have to add to 100%, not 10%, so it multiplies both by 10 to normalize them. Unsurprisingly, as the likelihoods were equal, its posteriors come out at 95% for H0 and 5% for H1; the priors are unchanged.
When the second me is about to guess at some likelihoods, its brain is suddenly zapped by a stray gamma ray. The second me therefore decides that P(E | H0) was 2% but that P(E | H1) was 50%. Applying Bayes’s theorem in precisely the same way as the first me, it gets a P(H1 | E) proportional to 5% × 50% = 2.5%, and a P(H0 | E) proportional to 95% × 2% = 1.9%. Normalizing (but this time multiplying by 100/(2.5+1.9)) gives posteriors of P(H0 | E) = 43.2% and P(H1 | E) = 56.8%.
So the first me still strongly doubts it has ESP after updating on the evidence, but the second me ends up believing ESP the more likely hypothesis. Yet both used the same method of inference, the same piece of evidence and the same priors!
It’s ridiculous to call non-scientific methods are “useless”. Our civilization is based on such non-scientific methods. Observation, anecdotal evidence, trial and error, markets etc. are all deeply unscientific and extremely useful ways of gaining useful knowledge. Next to these Science is really a fairly minor pursuit.
I’d say that existing folk practices and institutions (what I think you mean by “our civilization”) are based on the non-survival of rival practices and institutions. Our civilization has the institutions it has, for the same reason that we have two eyes and not three — not because two eyes are better than three, but because any three-eyed rivals to prototypical two-eyed ancestors happened not to survive.
Folk practices have typically been selected at the speed of generations, with cultures surviving or dying out — the latter sometimes due to war or disease; but sometimes just as the youth choose to convert to a more successful culture. Science aims at improving knowledge at a faster rate than folk practice selection.
One senses that the author (the one in the student role) neither has understood the relative-frequency theory of probability nor has performed any empirical research using statistics—lending the essay the tone of an arrogant neophyte. The same perhaps for the professor. (Which institution is on report here?) Frequentists reject the very concept of “the probability of the theory given the data.” They take probabilities to be objective, so they think it a category error to remark about the probability of a theory: the theory is either true or false, and probability has nothing to do with it.
You can reject relative-frequentism (I do), but you can’t successfully understand it in Bayesian terms. As a first approximation, it may be better understood in falsificationist terms. (Falsificationism keeps getting trotted out by Bayesians, but that construct has no place in a Bayesian account. These confusions are embarrassingly amateurish.) The Fischer paradigm is that you want to show that a variable made a real difference—that what you discovered wasn’t due to chance. However, there’s always the possibility that chance intervened, so the experimenter settles for a low probability that chance alone was responsible for the result. If the probability (the p value) is low enough, you treat it as sufficiently unlikely not to be worth worrying about, and you can reject the hypothesis that the variable made no difference.
If, like I, you think it makes sense to speak of subjective probabilities (whether exclusively or along with objective probabilities), you will usually find an estimate of the probabilities of the hypothesis given the data, as generated by Bayesian analysis, more useful. That doesn’t mean it’s easy or even possible to do a Bayesian analysis that will be acceptable to other scientists. To get subjective probabilities out, you must put subjective probabilities in. Often the worry is said to be the infamous problem of estimating priors, but in practice the likelihood ratios are more troublesome.
Let’s say I’m doing a study of the effect of arrogance on a neophyte’s confidence that he knows how to fix science. I develop and norm a test of Arrogance/Narcissism and also an inventory of how strongly held a subject’s views are in the philosophy of science and the theory of evidence. I divide the subjects in two groups according to whether they fall above or below the A/N median. I then use Fischerian methods to determine whether there’s an above-chance level of unwarranted smugness among the high A/N group. Easy enough, but limited. It doesn’t tell me what I most want to know, how much credence should I put in the results. I’ve shown there’s evidence for an effect, but there’s always evidence for some effect: the null hypothesis, strictly speaking, is always false. No two entities outside of fundamental physics are exactly the same.
Bayesian analysis promises more, but whereas other scientists will respect my crude frequentist analysis as such—although many will denigrate its real significance—many will reject my Bayesian analysis out of hand due to what must go into it. Let’s consider just one of the factors that must enter the Bayesian analysis. I must estimate the probability that that the ‘high-Arrogance’ subjects will score higher on Smugness if my theory is wrong, that is, if arrogance really has no effect on Smugness. Certainly my Arrogance/Narcissism test doesn’t measure the intended construct without impurities. I must estimate the probability that all the impurities combined or any of them confound the results. Maybe high-Arrogant scorers are dumber in addition to being more arrogant, and that is what’s responsible for some of the correlation. Somehow, I must come up with a responsible way to estimate the probability of getting my results if Arrogance had nothing to do with Smugness. Perhaps I can make an informed approximation, but it will be unlikely to dovetail with the estimates of other scientists. Soon we’ll be arguing about my assumptions—and what we’ll be doing will be more like philosophy than empirical science.
The lead essay provides a biased picture of the advantages of Bayesian methods by completely ignoring its problems. A poor diet for budding rationalists.
Then they should also reject the very concept of “the probability of the data given the theory”, since that quantity has “the probability of the theory” explicitly in the denominator.
You are reading “the probability of the data D given the theory T” to mean p(D | T), which in turn is short for a ratio p(D & T)/p(T) of probabilities with respect to some universal prior p. But, for the frequentist, there is no universal prior p being invoked.
Rather, each theory comes with its own probability distribution p_T over data, and “the probability of the data D given the theory T” just means p_T(D). The different distributions provided by different theories don’t have any relationship with one another. In particular, the different distributions are not the result of conditioning on a common prior. They are incommensurable, so to speak.
The different theories are just more or less correct. There is a “true” probability of the data, which describes the objective propensity of reality to yield those data. The different distributions from the different theories are comparable only in the sense that they each get that true distribution more or less right.
Not LessWronger Bayesians, in my experience.
What about:
It would be more accurate to say that LW-style Bayesians consider falsificationism to be subsumed under Bayesianism as a sort of limiting case. Falsificationism as originally stated (ie, confirmations are irrelevant; only falsifications advance knowledge) is an exaggerated version of a mathematically valid claim. From An Intuitive Explanation of Bayes’ Theorem:
This seems the key step for incorporating falsification as a limiting case; I contest it. The rules of Bayesian rationality preclude assigning an a priori probability of 1 to a synthetic proposition: nothing empirical is so certain that refuting evidence is impossible. (Isthat assertion self-undermining? I hope that worry can be bracketed.) As long as you avoid assigning probabilities of 1 or 0 to priors, you will never get an outcome at those extremes.
But since P(X/A) is always “intermediate,” observing X will never strictly falsify A—which is a good thing because the falsification prong of Popperianism has proven at least as scientifically problematic as the nonverification prong.
I don’t think falsification can be squared with Bayes, even as a limiting case. In Basesian theory, verification and falsification are symmetric (as the slider metaphor really indicates). In principle, you can’t strictly falsify a theory empirically any more (or less) than you can verify one. Verification, as the quoted essay confirms, is blocked by the > 0 probability mandatorily assigned to unpredicted outcomes; falsification is blocked by the < 1 probability mandatorily assigned to the expected results. It is no less irrational to be certain that X holds given A than to be certain that X fails given not-A. You are no more justified in assuming absolutely that your abstractions don’t leak than in assuming you can range over all explanations.
This throws the baby out with the bathwater; we can falsify and verify to degrees. Refusing the terms verify and falsify because we are not able to assign infinite credence seems like a mistake.
I agree; that’s why “strictly.” But you seem to miss the point, which is that falsification and verification are perfectly symmetric: whether you call the glass half empty or half full on either side of the equation wasn’t my concern.
Two basic criticisms apply to Popperian falsificationism: 1) it ignores verification (although the “verisimilitude” doctrine tries to overcome this limitation); and 2) it does assign infinite credence to falsification.
No. 2 doesn’t comport with the principles of Bayesian inference, but seems part of LW Bayesianism (your term):
This allowance of a unitary probability assignment to evidence conditional on a theory is a distortion of Bayesian inference. The distortion introduces an artificial asymmetry into the Bayesian handling of verification versus falsification. It is irrational to pretend—even conditionally—to absolute certainty about an empirical prediction.
We all agree on this point. Yudkowsky isn’t supposing that anything empirical has probability 1.
In the line you quote, Yudkowsky is saying that even if theory A predicts data X with probability 1 (setting aside the question of whether this is even possible), confirming that X is true still wouldn’t push our confidence in the truth of A past a certain threshold, which might be far short of 1. (In particular, merely confirming a prediction X of A can never push the posterior probability of A above p(A|X), which might still be too small because too many alternative theories also predict X). A falsification, on the other hand, can drive the probability of a theory very low, provided that the theory makes some prediction with high confidence (which needn’t be equal to 1) that has a low prior probability.
That is the sense in which it is true that falsifications tend to be more decisive than confirmations. So, a certain limited and “caveated”, but also more precise and quantifiable, version of Popper’s falsificationism is correct.
Yes, no observation will drive the probability of a theory down to precisely 0. The probability can only be driven very low. That is why I called falsificationism an “an exaggerated version of a mathematically valid claim”.
As you say, getting to probability 0 is as impossible as getting to probability 1. But getting close to probability 0 is easier than getting equally close to probability 1.
This asymmetry is possible because different kinds of propositions are more or less amenable to being assigned extremely high or low probability. It is relatively easier to show that some data has extremely high or low probability (whether conditional on some theory or a priori) than it is to show that some theory has extremely high conditional probability.
Fix a theory A. It is very hard to think up an experiment with a possible outcome X such that p(A | X) is nearly 1. To do this, you would need to show that no other possible theory, even among the many theories you haven’t thought of, could have a significant amount of probability, conditional on observing X.
It is relatively easy to think up an experiment with a possible outcome X, which your theory A predicts with very high probability, but which has very low prior probability. To accomplish this, you only need to exhibit some other a priori plausible outcomes different from X.
In the second case, you need to show that the probability of some data is extremely high a posteriori and extremely low a priori. In the first case, you need to show that the a posteriori probability of a theory is extremely high.
In the second case, you only need to construct enough alternative outcomes to certify your claim. In the first case, you need to prove a universal statement about all possible theories.
One root of the asymmetry is this: As hard as it might be to establish extreme probabilities for data, at least the data usually come from a reasonably well-understood parameter space (the real numbers, say). But the space of all possible theories is not well understood, at least not in any computationally tractable way.
All these arguments are at best suggestive. Our abductive capacities are such as to suggest that proving a universal statement about all possible theories isn’t necessarily hard. Your arguments, I think, flow from and then confirm a nominalistic bias: accept concrete data; beware of general theories.
There are universal statements known with greater certainly than any particular data, e.g., life evolved from inanimate matter and mind always supervenes on physics.
I agree that
some universal statements about all theories are very probable, and that
some of our theories are more probable than any particular data.
I’m not seeing why either of these facts are in tension with my previous comment. Would you elaborate?
The claims I made are true of certain priors. I’m not trying to argue you into using such a prior. Right now I only want to make the points that (1) a Bayesian can coherently use a prior satisfying the properties I described, and that (2) falsificationism is true, in a weakened but precise sense, under such a prior.
I hope they’re not using that landing page for anything important. It’s not clear what product (if any) they’re selling, there’s no call to action, and in general it looks to me like it’s doing a terrible job of overcoming inferential distances. I’d say you did a far better job of selling them than they did. Someone needs to read a half a dozen blog posts about how customers only think of themselves, etc.
Great post by the way, Luke.
The website is currently down and parked by GoDaddy. Archive.org has several snapshots, but they are all 404s since 2012.
Their website used to have more content on it, I don’t know why they changed it.
What should I read to get a good defense of Bayesianism—that isn’t just pointing out difficulties with frequentism, NHST, or whatever? I understand the math, but am skeptical that it can be universally applied, due to problems with coming up with the relevant priors and likelihoods.
It’s like the problem with simple deduction in philosophy. Yes, if your premises are right, valid deductions will lead you to true conclusions, but the problem is knowing whether the premises used by the old metaphysicians (or modern ones, for that matter) are true. Bayesianism fails to solve this problem for many cases (though I’m not denying that you do sometimes know the relevant probabilities).
I do definitely plan on getting my hands on a copy of Richard Carrier’s new book when it comes out, so if that’s currently the best defense of Bayesianism out there, I’ll just wait another two months.
Probability theory can be derived as the extension of classical logic to the case where propositions are assigned plausibilities rather than truth values,so it’s not merely like the GIGO problem with simple deduction—it’s the direct inheritance of that problem.
You’re right. I’ll make sure to say “is the same problem” in the future.
A philosophical treatise of universal induction.
This doesnt seem particular generally actionable for testing scientific hypotheses (which is the general problem with proposing bayes as a way to fix science).
You may want to check out John Earman’s Bayes or Bust?.
I suspect that using only valid deductions, while manipulating terms that already have real meanings attached to them, probably poses at least as great a problem as avoiding untrue premises.
I remember during a logic class I took, the teacher made an error of deduction, and I called her out on it. She insisted that it was correct, and every other student in the class agreed. I tried to explain the mistake to her after class, and wasn’t able to get her to see the error until I drew a diagram to explain it.
It was only an introductory level class, but I don’t get the impression that most practicing philosophers are at a higher standard.
You seem to be conflating Bayesian inference with Bayes Theorem. Bayesian inference is a method, not a proposition, so cannot be the conclusion of a deductive argument. Perhaps the conclusion you have in mind is something like “We should use Bayesian inference for...” or “Bayesian inference is the best method for...”. But such propositions cannot follow from mathematical axioms alone.
Moreover, the fact that Bayes Theorem follows from certain axioms of probability doesn’t automatically show that it’s true. Axiomatic systems have no relevance to the real world unless we have established (whether explicitly or implicitly) some mapping of the language of that system onto the real world. Unless we’ve done that, the word “probability” as used in Bayes Theorem is just a symbol without relevance to the world, and to say that Bayes Theorem is “true” is merely to say that it is a valid statement in the language of that axiomatic system.
In practice, we are liable to take the word “probability” (as used in the mathematical axioms of probability) as having the same meaning as “probability” (as we previously used that word). That meaning has some relevance to the real world. But if we do that, we cannot simply take the axioms (and consequently Bayes Theorem) as automatically true. We must consider whether they are true given our meaning of the word “probability”. But “probability” is a notoriously tricky word, with multiple “interpretations” (i.e. meanings). We may have good reason to think that the axioms of probability (and hence Bayes Theorem) are true for one meaning of “probability” (e.g. frequentist). But it doesn’t automatically follow that they are also true for other meanings of “probability” (e.g. Bayesian).
I’m not denying that Bayesian inference is a valuable method, or that it has some sort of justification. But justifying it is not nearly so straightforward as your comment suggests, Luke.
It stands on the foundations of probability theory, and while foundational stuff like Cox’s theorem takes some slogging through, once that’s in place, it is quite straightforward to justify Bayesian inference.
It’s actually somewhat tricky to establish that the rules of probability apply to the Frequentist meaning of probability. You have to mess around with long run frequencies and infinite limits. Even once that’s done, it hard to make the case that the Frequentist meaning has anything to do with the real world—there are no such thing as infinitely repeatable experiments.
In contrast, a few simple desiderata for “logical reasoning under uncertainty” establish probability theory as the only consistent way to do so that satisfy those criteria. Sure, other criteria may suggest some other way of doing so, but no one has put forward any such reasonable way.
Could Dempster-Shafer theory count? I haven’t seen anyone do a Cox-style derivation of it, but I would guess there’s something analogous in Shafer’s original book.
I would be quite interested in seeing such. Unfortunately I don’t have any time to look for such in the foreseeable future.
P.S. Bayes Theorem is derived from a basic statement about conditional probability, such as the following:
P(S/T) = P(S&T)/P(T)
According to the SEP (http://plato.stanford.edu/entries/epistemology-bayesian/) this is usually taken as a “definition”, not an axiom, and Bayesians usually give conditional probability some real-world significance by adding a Principle of Conditionalization. In that case it’s the Principle of Conditionalization that requires justification in order to establish that Bayes Theorem is true in the sense that Bayesians require.
Just to follow up on the previous replies to this line of thought, see Wikipedia’s article on Cox’s theorem and especially reference 6 of that article.
On the Principle of Conditionalization, it might be argued that Cox’s theorem assumes it as a premise; the easiest way to derive it from more basic considerations is through a diachronic Dutch book argument.
disclaimer: I’m not very knowledgeable in this subject to say the least.
This seems relevant: Share likelihood ratios, not posterior beliefs
It would seem useful for them to publish p(data|hypothesis) because then I can use my priors for p(hypothesis) and p(data) to calculate p(hypothesis|data).
Otherwise, depending on what information they updated on to get their priors I might end up updating on something twice.
Cigarette smoking: an underused tool in high-performance endurance training
musical contrast and chronological rejuvenation
Effects of remote, retroactive intercessory prayer on outcomes in patients with bloodstream infection: randomised controlled trial
The music link doesn’t work.
I will tentatively suggest that the difference reported is about people hearing music which was popular when they were younger than about the details of the music.
Prior methods weren’t completely useless. Humans went from hunter-gatherers to civilization without the scientific method or a general notion of science. It is probably more fair to say that science was just much better than all previous methods.
Wait, that confused me. I thought the p-value was the chance of the data given the null hypothesis.
Since NHST is “null hypothesis significance testing”, the hypothesis being tested is the null hypothesis!
In the vernacular, when “testing a hypothesis” we refer to the hypothesis of interest as the one being tested, i.e. the alternative to the null—not the null itself. (For instance, we say things like “test the effect of gender”, not the more cumbersome “test the null hypothesis of the absence of an effect of gender”.)
In any case it wouldn’t hurt the OP, and could only make it clearer, to reword it to remove the ambiguity.
I really like the discussions of the problems, but I would have loved to see more discussions of the solutions. How do we know, more specifically, that they will solve things? What are the obstacles to putting them into effect—why, more specifically, do people just not want to do it? I assume it’s something a bit more complex than a bunch of people going around saying “Yeah, I know science is flawed, but I don’t really feel like fixing it.” (Or maybe it isn’t?)
I know this is stating the obvious, but the next stage after this is for people to regard “science” as what’s in the database rather than what’s in the journals. Otherwise there’s still publication bias (unless people like writing up boring results and journals like publishing them)
Well, the database wouldn’t contain any results. What it does though is reduce the importance of published claims that have a large number of non-published (probably failed) attempts at showing the same effect.
Ideally you want the literature review section of a paper to include a mention of all these related but unpublished experiments, not just other published results.
Boredom is far from the only bad reason that some journals refuse some submissions. Every person in the chain of publication, and that of peer review, must be assumed at least biased and potentially dishonest. Therefore “science” can never be defined by just one database or journal, or even a fixed set of either. Excluded people must always be free to start their own, and their results judged on the processes that produced them. Otherwise whoever is doing the excluding is not to be trusted as an editor.
I hasten to add that this kind of bias exists among all sides and parties.
This, I think, is just one symptom of a more general problem with scientists: they don’t emphasize rigorous logic as much as they should. Science, after all, is not only about (a) observation but about (b) making logical inferences from observation. Scientists need to take (b) far more seriously (not that all don’t, but many do not). You’ve heard the old saying “Scientists make poor philosophers.” It’s true (or at least, true more often than it should be). That has to change. Scientists ought to be amongst the best philosophers in the world, precisely because they ought to be masters of logic.
The problem is that philosophers also make poor philosophers.
Less snarkily, “logical inference” is overrated. It does wonders in mathematics, but rarely does scientific data logically require a particular conclusion.
Well, of course one cannot logically and absolutely deduce much from raw data. But with some logically valid inferential tools in our hands (Occam’s razor, Bayes’ Theorem, Induction) we can probabilistically derive conclusions.
In what sense Occam’s razor “logically valid”?
Well, it is not self-contradictory, for one thing. For another thing, every time a new postulate or assumption is added to a theory we are necessarily lowering the prior probability because that postulate/assumption always has some chance of being wrong.
Just to clarify something: I would expect most readers here would interpret “logically valid” to mean something very specific—essentially something is logically valid if it can’t possibly be wrong, under any interpretation of the words (except for words regarded as logical connectives). Self-consistency is a much weaker condition than validity.
Also, Occam’s razor is about more than just conjunction. Conjunction says that “XY” has a higher probability than “XYZ”; Occam’s razor says that (in the absence of other evidence), “XY” has a higher probability than “ABCDEFG”.
Hi Giles,
I think Occam’s razor is logically valid in the sense that, although it doesn’t always provide the correct answer, it is certain that it will probably provide the correct answer. Also, I’m not sure if I understand your point about conjunction. I’ve always understood “do not multiply entities beyond necessity” to mean that, all else held equal, you ought to make the fewest number of conjectures/assumptions/hypotheses possible.
The problem is that the connotations of philosophy (in my mind at least) are more like how-many-angels mindwanking than like On the electrodynamics of moving bodies. (This is likely the effect of studying pre-20th-century philosophers for five years in high school.)
21st century philosophers aren’t much different.
aoeu
Saying that people should be better is not helpful. Like all people, scientists have limited time and need to choose how to allocate their efforts. Sometimes more observations can solve a problem, and sometimes more careful thinking is necessary. The appropriate allocation depends on the situation and the talents of the researcher in question.
That being said, there may be a dysfunctional bias in how funding is allocated—creating a “all or none” environment where the best strategy for maintaining a basic research program (paying for one’s own salary plus a couple of students) is to be the type of researcher who gets multi-million dollar grants and uses that money to generate gargantuan new datasets, which can then provide the foundation for a sensational publication that everyone notices.
It is important here to distinguish two roles of statistics in science: exploration and confirmation. It seems likely that Bayesian methods are more powerful (and less prone to misuse) than non-Bayesian methods the exploratory paradigm.
However, for the more important issue of confirmation, the primary importance of statistical theory is to: 1) provide a set of quantitative guidelines for scientists to design effective (confirmatory) experiments and avoid being mislead by the results of poorly designed experiments or experiments with inadequate sample sizes 2) produce results which can be readily interpreted by their scientific peers. And here, the NST canon for regression and comparison of means fulfills both purposes more effectively than the Bayesian equivalents, primarily due to the technical difficulty of Bayesian procedures for even the simplest problems: like a normal distribution with unknown mean and variance. While a suitably well-designed Bayesian statistics package is one possible remedy, it would still seem preferable in such cases that scientists learn the usual maximum likelihood estimators, so that they can at least know the formulas for the statistics they are computing. And, as satt and gwern have argued in these comments, it is doubtful that a shift to Bayesianism would prevent scientists from making mistakes like that of the psi study: the use of a Bayesian t-test will not always be sufficient to save the day. Conversely, it is also doubtful that the widespread, correct use of Bayesian would make a huge difference in most day-to-day science. Well-designed experiments will produce both convincing p-values and convincing likelihood ratios; when NHST is applicable, a Bayesian approach would at best allow for perhaps a constant-factor reduction in the necessary sample size.
Even statistically competent individuals have good reason to continue using NHST, or non-Bayesian techniques in general. Just as physicists have not stopped using the Newtonian “approximation” in light of the discovery of relativity, it remains perfectly reasonable to use convenient non-Bayesian techniques when they are “good enough for the job.” An especially important case is non/semi-parametric inference: that is, inference with only very weak assumptions on the nature of certain relevant probability distributions. Practical ways of doing Bayesian nonparametric inference still remain to be developed, and while Bayesian nonparametrics is currently an active topic of statistical research, it seems foolish to hope that implementations of Bayesian non/semi-parametric inference can ever be as computationally scalable as their non-Bayesian counterparts.
Thanks for putting this together. There are many interesting links in there.
I am hopeful that Bayesian methods can help to solve some of our problems, and there is constant development of these techniques in biology.
Scientists should pay more attention to their statistical tests, and I often find myself arguing with others when I don’t like their tests. The most important thing that people need to remember is what “NHST” actually does—it rejects the null hypothesis. Once they think about what the null hypothesis is, and realize that they have done nothing more than reject it, they will make a lot of progress.
Not only can’t I get my head around explanations for bayes formalisms, I have no idea how to apply it to my science. And that’s as a Lesswronger. WinBugs looks 1000x times more complicated that those ‘intuitive’ explanations of bayes like ‘update your beliefs’ and ‘your priors should affect your expectations, then be updated’.
Nitpick: actually last year (March 2011, per http://www.ncbi.nlm.nih.gov/pubmed/21280961 ).
I imagine you intended to link to consilience the concept, not the book. Then again you may just be trying to be subtle.
Wait, is it ? You say,
That makes science as we know it the best method indeed, simply by definition. You list some problems with the way science is done, and outline some possible solutions, and that’s fine. But your opening sentence doesn’t say, “science isn’t as efficient as it could be”, it says “science is broken”. That’s a pretty audacious claim that the rest of your post does not support—especially since, in a deliciously ironic twist, you link to scientific articles as evidence that “science is broken”.
Um?
If I collect a set of totally useless methods, followed by method A which is not totally useless, followed by method B which is also not totally useless, method A is the first method I’ve collected that wasn’t totally useless, but it isn’t definitionally the best method.
Perhaps I misinterpreted lukeprog’s words. It sounded to me like he was saying, “all these other methods were useless, and science is the only method we know of that isn’t useless”. This makes science the best thing we’ve got, by definition, though it doesn’t preclude the existence of better methods that are yet unknown to us.
Ah, I see.
Sure, if by “first” Luke meant /only/, then yes, Luke’s statement (properly interpreted) means science as we know it the best method currently available, and essentially contradicts the rest of his post. Agreed.
Can you clarify your reasons for thinking he meant /only/ rather than /first/?
I interpreted “first” to mean “first in human history”. Since science is the method we currently use to understand the world, I assumed that no other methods were as good—otherwise, we’d be using those. Human history is ongoing, though, so we could find better methods in the future.
Luke goes on to discuss different ways of fixing science, which led me to believe that he doesn’t know of any other methods that are superior to science, either. If he did, presumably he’d be advocating that we drop science altogether, and replace it with these other methods.
It’s by no means clear that we’re generally using the best methods ever discovered. In fact, it seems unlikely on the face of it. It’s not especially uncommon for an earlier method to become popular enough that later superior methods fail to displace it in the popular mind.
That aside, at this point it kind of sounds like the main thing we’re disagreeing about is whether the two different things under discussion in Luke’s post are both properly labelled “science” or not. In which case I’m happy to concede the point and adopt your labels. In which case everyone involved agrees that we should use science and the interesting discussion is about what techniques science ought to comprise.
Can you list some examples ?
Good point. I guess it all depends on whether the changes Luke proposes should count as reforming science, or as replacing it with an entirely new methodology. I don’t think that his changes go far enough to constitute a total replacement; after all, he even titled the article “How to Fix Science”, not “Science is Dead, Let’s Replace It”.
In addition, I think that the changes he proposes to fix publication bias and NHST are incremental rather than entirely orthogonal to the way science is done now (I admit that I’m not sure what to make of the “experimenter bias” section). But it sounds like you might disagree...
Just to pick one we’re discussing elsethread, billions of people around the world continue to embrace traditional religious rituals as a mechanism for improving their personal lot in life and interpreting events around them, despite the human race’s discovery of superior methods for achieving those goals.
Yes, agreed, that’s basically what our disagreement boils down to. As I said before, if we can agree about what changes ought to be made, I simply don’t care whether we call the result “reforming science” or “replacing science with something better.”
So I’m happy to concede the point and adopt your labels: we’re talking about incrementally reforming science.
Fair enough, I think I misinterpreted what you meant by “popular mind”. That said, though, all the people whose primary job description is to understand the natural world are currently using science as their method of choice; at least, all the successful ones are.
In that case, we have no fundamental disagreements; incremental improvement is always a good thing (as long as it’s actually an improvement, of course). The one thing I would disagree with Luke (and, presumably, yourself) about is the extent to which “science is broken”. I think that science works reasonably well in its current form—though there’s room for improvement—whereas Luke seems to believe that science has hit a dead end. On the other hand, the changes he proposes are fairly minor, so perhaps I am simply misinterpreting his tone.
Fair enough. I agree that if by “popular mind” we mean successful professional understanders of the natural world, then my assertion that science has not displaced religion in the popular mind as a preferred mechanism for understanding the world is at the very least non-obvious, and likely false. That seems an unjustified reading of that phrase to me, but that doesn’t matter much.
I can’t decide if we disagree on whether “science works reasonably well in its current form”, as I don’t really know what that phrase means. Even less can I decide whether you and Luke disagree on that.
I wanted to express something like this: let’s imagine that we (we humans, that is) implement none of Luke’s reforms, nor any other reforms at all. Scientific journals continue to exist in their current form; peer review keeps working the same way as it does now, everyone uses p-values, etc. Under this scenario, what percentage of currently unsolved scientific problems (across all disciplines) can we expect to become solved within the next 100 years or so ?
If science is truly broken, then the answer would be, “close to zero”. If science works very well, we can expect an answer similar like, “twice the percentage that were solved during the last 100 years”, or possibly “all of them (though many new problems will be discovered)”. If science works ok, but is not as efficient as it could be, we could expect an answer somewhere between these two extremes. I personally believe that this latter scenario is closest to the truth.
Well, I certainly agree that in the absence of adopting any reforms to how science is done, we should expect some percentage between “close to zero” and “all” currently unsolved problems to be solved in the next century.
If I adopt your stricter measure of between epsilon and 2N where N is the %age of problems solved in the last century, I still agree that we should expect some such threshold to be met or exceeded in the next century.
If that implies that science is not broken, then I agree that science is not broken.
My claim is that, in the absence of adopting any reforms to how science is done, I would still expect this percentage to be much closer to 2N than to zero. I interpret the statement “science is broken” as saying, “the percentage will be epsilon”, and thus I do not believe that science is broken.
Of course, if you interpret “science is broken” to mean “science isn’t moving quite as fast as it could be”, then I’d probably agree with you.
I would agree with that claim as well.
With respect to how I interpret that phrase, the honest answer is that as with most such terminological disputes, I mostly don’t think it matters.
Put a different way… if I can choose between two systems for arriving at useful beliefs about the world, S1 and S2, and S1 is measurably more efficient at converting resources into useful beliefs, then all else being equal, I should adopt S1. Whether the labels “science” and/or “broken” properly apply to S1 and/or S2 doesn’t change that, nor AFAICT does it change anything else I care about.
The OP laid out some differences between two systems, one of which is science as done today, and suggested that the other system was measurably more efficient at converting resources into useful beliefs.
Back at the start of this exchange, I thought you were taking issue with that suggestion. As near as I can figure out at this point, I was simply incorrect; your concerns lie entirely with whether the other system should be labelled “science” and whether the first system should be labelled “broken”. I honestly don’t care… I think it’s important to have consistent definitions for these terms if we’re going to use them at all, but now that you’ve provided clear definitions I’m happy to use yours. It follows that both systems are science and neither is broken.
I would also say that, while the other system is indeed “more efficient at converting resources into useful beliefs”, it’s not so very different from the original system, both in terms of structure and in terms of performance. Thus, unlike (I think) Luke, I see no particular burning need to drop everything we’re doing and begin the conversion process.
Note necessarily since humans aren’t perfectly rational.
I don’t think it’s ironic. Assuming that all science is representative of all science, if science isn’t broken then science saying that science is broken means that science is broken (because it’s in a logically impossible epistemic state). If science is broken then science is broken. So in any case science saying that science is broken means that science is broken. Of course all science isn’t representative of all science, but that takes the sting out of the irony.