Too good to be true
A friend recently posted a link on his Facebook page to an informational graphic about the alleged link between the MMR vaccine and autism. It said, if I recall correctly, that out of 60 studies on the matter, not one had indicated a link.
Presumably, with 95% confidence.
This bothered me. What are the odds, supposing there is no link between X and Y, of conducting 60 studies of the matter, and of all 60 concluding, with 95% confidence, that there is no link between X and Y?
Answer: .95 ^ 60 = .046. (Use the first term of the binomial distribution.)
So if it were in fact true that 60 out of 60 studies failed to find a link between vaccines and autism at 95% confidence, this would prove, with 95% confidence, that studies in the literature are biased against finding a link between vaccines and autism.
In reality, you should adjust your literature survey for known biases of literature. Scientific literature has publication bias, so that positive results are more likely to be reported than negative results.
They also have a bias from errors. Many articles have some fatal flaw that makes their results meaningless. If the distribution of errors is random, I think—though I’m not sure—that we should assume this bias causes regression towards an equal likelihood of positive and negative results.
Given that both of these biases should result, in this case, in more positive results, having all 60 studies agree is even more incredible.
So I did a quick mini-review this morning, looking over all of the studies cited in 6 reviews on the results of studies on whether there is a connection between vaccines and autism:
National Academies Press (2004). Immunization safety review: Vaccines and autism.
National Academies Press (2011). Adverse effects of vaccines: Evidence and causality.
American Academy of Pedatricians (2013): Vaccine safety studies.
The current AAP webpage on vaccine safety studies.
The Immunization Action Coalition: Examine the evidence.
Taylor et al. (2014). Vaccines are not associated with autism: an evidence-based meta-analysis of case-control and cohort studies. Vaccine Jun 17;32(29):3623-9. Paywalled, but references given here.
I listed all of the studies that were judged usable in at least one of these reviews, removed duplicates, then went through them all and determined, either from the review article or from the study’s abstract, what it concluded. There were 39 studies used, and all 39 failed to find a connection between vaccines and autism. 4 studies were rejected as methodologically unsound by all reviews that considered them; 3 of the 4 found a connection.
(I was, as usual, irked that if a study failed to prove the existence of a link given various assumptions, it was usually cited as having shown that there was no link.)
I understand that even a single study indicating a connection would immediately be seized on by anti-vaccination activists. (I’ve even seen them manage to take a study that indicated no connection, copy a graph in that study that indicated no connection, and write an analysis claiming it proved a connection.) Out there in the real world, maybe it’s good to suppress any such studies. Maybe.
But here on LessWrong, where our job is not physical health, but mental practice, we shouldn’t kid ourselves about what the literature is doing. Our medical research methodologies are not good enough to produce 39 papers and have them all reach the right conclusion. The chances of this happening are only .95 ^ 39 = 0.13, even before taking into account publication and error bias.
Note: This does not apply in the same way to reviews that show a link between X and Y
If the scientific community felt compelled to revisit the question of whether gravity causes objects to fall, and conducted studies using a 95% confidence threshold comparing apples dropped on Earth to apples dropped in deep space, we would not expect 5% of the studies to conclude that gravity has no effect on apples. 95% confidence means that, even if there is no link, there’s a 5% chance the data you get will look as if there is a link. It does not mean that if there is a link, there’s a 5% chance the data will look as if there isn’t. (In fact, if you’re wondering how small studies and large studies can all have 95% confidence, it’s because, by convention, the extra power in large studies is spent on being able to detect smaller and smaller effects, not on higher and higher confidence that a detected effect is real. Being able to detect smaller and smaller effects means having a smaller and smaller chance that, if there is an effect, it will be too small for your study to detect. Having “95% confidence” tells you nothing about the chance that you’re able to detect a link if it exists. It might be 50%. It might be 90%. This is the information black hole that priors disappear into when you use frequentist statistics.)
Critiquing bias
One plausible mechanism is that people look harder for methodological flaws in papers they don’t like than in papers that they like. If we allowed all 43 of the papers, we’d have 3 / 43 finding a link, which would still be surprisingly low, but possible.
To test this, I looked at Magnuson 2007, “Aspartame: A Safety Evaluation Based on Current Use Levels, Regulations, and Toxicological and Epidemiological Studies” (Critical Reviews in Toxicology,37:629–727). This review was the primary—in fact, nearly the only—source cited by the most-recent FDA review panel to review the safety of aspartame. The paper doesn’t mention that its writing was commissioned by companies who sell aspartame. Googling their names revealed that at least 8 of the paper’s 10 authors worked for companies that sell aspartame, either at the time that they wrote it, or shortly afterwards.
I went to section 6.9, “Observations in humans”, and counted the number of words spent discussing possible methodological flaws in papers that indicated a link between aspartame and disease, versus the number of words spent discussing possible methodological flaws in papers that indicated no link. I counted only words suggesting problems with a study, not words describing its methodology.
224 words were spent critiquing 55 studies indicating no link, an average of 4.1 words per study. 1375 words were spent critiquing 24 studies indicating a link, an average of 57.3 words per study.
(432 of those 1375 words were spent on a long digression arguing that formaldehyde isn’t really carcinogenic, so that figure goes down to only 42.9 words per positive-result study if we exclude that. But that’s… so bizarre that I’m not going to exclude it.)
OK so I got interested in this strong claim Phil is making and went to look at the original study he is critiquing so strenuously.
But there is no link to that original study or infographic or whatever.
I don’t think there is any value to a strenuous criticism of a study or result when there is no link to that study in the critique.
I tried google and found this image based on this bibliography. It took me a bit to figure out that ti doesn’t match Phil’s description.
As for Phil’s 60 studies, the fact that he gathered his 39 studies from 4 papers that he does link suggests that he was not able to find any actual list of 60 studies. It doesn’t matter whether the 39 studies come from the same bibliography or not. Either the 4 reviews are, on average, biased in their review of the literature, or the literature is itself subject to publication bias.
And of course, whenever Phils of this world encounter the example of results not being slightly too good to be true, they’re just as likely to write an LW post about that.
Boy, it’s a real pity that there’s no research into excess significance in which various authors do systematic samples of large numbers of papers to get field-wide generalizations and observations about whether this is a common phenomenon or not. As it stands, we have no idea whether Phil has cherry-picked a rare phenomenon or not.
Such a pity.
Well, I don’t see anyone writing about e.g. physics results not being too good to be true, or government-sponsored pharmaceutical studies not being too good to be true etc. Nor would it be particularly rare to obtain that sort of result anyway.
Well, more generally people do apply that sort of reasoning in being skeptical of improbable results, like most people’s reaction (especially on LW) to the neutrino FTL result was that the result was simply wrong, regardless of how many measurements they took.
I’m not really familiar with how significance-testing is used in physics, but at least under the six-sigma level of alpha, it would take an enormous number of studies of a null hypothesis before the lack of statistical-significance would become ‘too good to be true’.
Then maybe you should look instead of talking out of your ass. People talk about problems with clinical trials all the time, and pharmaceutical & medicine in general is the home stomping grounds for a lot of meta approaches like excess significance.
Physics is very diverse. There’s those neutrino detectors which detect and fail to detect rare events, for example.
Yes, and they don’t seem to talk much about non problems.
OK, so? Do they impose six-sigmas on the total result, subdivisions, or what?
Yes, because almost all clinical trials stink. Publication bias is pervasive, and the methodological problems are almost universal. When you read through, say, Cochrane meta-analyses or reviews, it’s normal to find that something like 90%+ of studies had to be discarded because they lacked such basic desiderata as ‘blinding’ or ‘randomization’ or simply didn’t specify important things like sample sizes or intent-to-treat. That people are willing to cite studies at all is ‘talking about non problems’.
Presumably? I checked the definition of presumably:
So you take this uncertain confidence level of 95% and find:
OK so you presumed 95% confidence level and showed that that confidence level is inconsistent with unanimity across 60 studies.
Assuming the studies are good, what confidence level would be consistent with unanimity?
Answer: .99^60 = 54%
So from this result we conclude either 1) there is a a problem with at least some of the studies or 2) there is a problem with the presumption of 95% confidence level, but a 99% confidence level would work fine.
For this post to have positive value, the case for picking only conclusion 1 above, and not considering conclusion 2, needs to be made. If the 95% confidence level is in fact EXPLICIT in these studies, then that needs to be verified, and the waffle-word “presumably” needs to be removed.
Is there any reason at all to think that these medical studies didn’t use 95%? The universal confidence level, used pretty much everywhere in medicine and psychology except in rare subfields like genomics, so universal that authors of papers typically don’t even bother to specify or justify the confidence level?
There’s all sorts of things one has to control for, e.g. parent’s age, that may inflate the error bars (if the error in imperfectly controlling for a co-founder is accounted for), putting zero within the error bars. Without looking at all the studies one can’t really tell.
Some studies ought to also have a chance of making a superfluous finding that ‘vaccines prevent autism’, but apparently that was not observed either.
What does that have to do with whether the researchers followed the nigh-universal practice of setting alpha to 0.05?
Example: I am measuring radioactivity with a Geiger counter. I have statistical error (with the 95% confidence interval), but I also have systematic error (e.g. the Geiger counter’s sensitivity is ‘guaranteed’ to be within 5% of a specified value). If I am reporting an unusual finding, I’d want the result not to be explainable by the sum of statistical error and the bound on the systematic error. Bottom line is, generally there’s no guarantee that “95% confidence” findings will go the other way 5% of the time. It is perfectly OK to do something that may inadvertently boost the confidence.
I’d love to see a paper get published that justified the confidence level with “because if I wanted to do rigorous science I would have studied physics” or “because we only have enough jelly beans to run 30 studies, will only be given more jelly beans if we get a positive result and so need to be sure”.
Suppose there were 60 studies that showed no correlation between autism and vaccines at a 99% confidence level. THen it would not be particularly surprising that there were indeed 60 studies with that result.
Would you expect the authors to point out that their result was actually 99% confident even though the usual standard, which they were not explicitly claiming anyway, was 95%?
retracted
That part was just him noticing his confusion. The only way to figure out what the real confidence levels were would be to try and find the studies, which is exactly what he did.
I read his post twice and I still don’t see him having figured out the real confidence levels or claiming to have.
edit: besides, Phil’s own claims don’t even meet the 95% confidence, and god only knows out of how big of a pool he fished this bias example from, and how many instances of ‘a few studies find a link but most don’t’ he ignored until he came up with this.
I don’t think the “95% confidence” works that way. It’s a lower bound, you never try to publish anything with a lower than 95% confidence (and if you do, your publication is likely to be rejected), but you don’t always need to have exactly 95% (2 sigma).
Hell, I play enough RPGs to know that rolling 1 or 20 in a d20 is frequent enough ;) 95% is quite low confidence, it’s really a minimum at which you can start working, but not something optimal.
I’m not sure exactly in medicine, but in physics it’s frequent to have studies at 3 sigma (99.7%) or higher. The detection of the Higgs boson by the LHC for example was done within 5 sigma (one chance in a million of being wrong).
Especially in a field with high risk of data being abused by ill-intentioned people such as “vaccine and autism” link, it would really surprise me that everyone just kept happily the 95% confidence, and didn’t aim for much higher confidence.
Careful! That’s a one chance in a million of a fluke occuring (given the null hypothesis). Probability of being wrong is P(~H1 | 5 sigma) rather than P(5 sigma | H0), and on the whole unmeasurable. :)
Okay. Be surprised. It appears that I’ve read hundreds of medical journal articles and you haven’t.
Medicine isn’t like physics. The data is incredibly messy. High sigma results are often unattainable even for things you know are true.
Was that “exactly 95% confidence” or “at least 95% confidence”?
Also, different studies have different statistical power, so it may not be OK to simply add up their evidence with equal weights.
p-values are supposed to be distributed uniformly from 0 to 1 conditional on the null hypothesis being true.
No; it’s standard to set the threshold for your statistical test for 95% confidence. Studies with larger samples can detect smaller differences between groups with that same statistical power.
“Power” is a statistical term of art, and its technical meaning is neither 1 - alpha) nor 1 - p.
Oops; you’re right. Careless of me; fixed.
It’s times like this that I wish Doctor Seuss was a mathematician (or statistician in this case). If they were willing to make up new words, we’d be able to talk without accidentally using jargon that has technical meaning we didn’t intend.
I’m confused about how this works.
Suppose the standard were to use 80% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked? Suppose the standard were to use 99% confidence. Would it still be surprising to see 60 of 60 studies agree that A and B were not linked?
Also, doesn’t the prior plausibility of the connection being tested matter for attempts to detect experimenter bias this way? E.g., for any given convention about confidence intervals, shouldn’t we be quicker to infer experimenter bias when a set of studies conclude (1) that there is no link between eating lithium batteries and suffering brain damage vs. when a set of studies conclude (2) that there is no link between eating carrots and suffering brain damage?
“95% confidence” means “I am testing whether X is linked to Y. I know that the data might randomly conspire against me to make it look as if X is linked to Y. I’m going to look for an effect so large that, if there is no link between X and Y, the data will conspire against me only 5% of the time to look as if there is. If I don’t see an effect at least that large, I’ll say that I failed to show a link between X and Y.”
If you went for 80% confidence instead, you’d be looking for an effect that wasn’t quite as big. You’d be able to detect smaller clinical effects—for instance, a drug that has a small but reliable effect—but if there were no effect, you’d be fooled by the data 20% of the time into thinking that there was.
It would if the papers claimed to find a connection. When they claim not to find a connection, I think not. Suppose people decided to test the hypothesis that stock market crashes are caused by the Earth’s distance from Mars. They would gather data on Earth’s distance from Mars, and on movements in the stock market, and look for a correlation.
If there is no relationship, there should be zero correlation, on average. That (approximately) means that half of all studies will show a negative correlation, and half will have positive correlation.
They need to pick a number, and say that if they find a positive correlation above that number, they’ve proven that Mars causes stock market crashes. And they pick that number by finding the correlation just exactly large enough that, if there is no relationship, it happens 5% of the time by chance.
If the proposition is very very unlikely, somebody might insist on a 99% confidence interval instead of a 95% confidence interval. That’s how prior plausibility would affect it. Adopting a standard of 95% confidence is really a way of saying we agree not to haggle over priors.
I think it is “only at most 5% of the time”.
No, we are choosing the effect size before we do the study. We choose it so that if the true effect is zero, we will have a false positive exactly 5% of the time.
How does this work for a binary quantity?
If your experiment tells you that [x > 45] with 99% confidence, you may in certain cases be able to confidently transform that to [x > 60] with 95% confidence.
For example, if your experiment tells you that the mass of the Q particle is 1.5034(42) with 99% confidence, maybe you can say instead that it’s 1.50344(2) with 95% confidence.
If your experiment happens to tell you that [particle Q exists] is true with 99% confidence, what kind of transformation can you apply to get 95% confidence instead? Discard some of your evidence? Add noise into your sensor readings?
Roll dice before reporting the answer?
We’re not talking about a binary quantity.
According to Wikipedia:
Quoting authorities without further commentary is a dick thing to do. I am going to spend more words speculating about the intention of the quote than are in the quote, let alone that you bothered to type.
I have no idea what you think is relevant about that passage. It says exactly what I said, except transformed from the effect size scale to the p-value scale. But somehow I doubt that’s why you posted it. The most common problem in the comments on this thread is that people confuse false positive rate with false negative rate, so my best guess is that you are making that mistake and thinking the passage supports that error (though I have no idea why you’re telling me). Another possibility, slightly more relevant to this subthread, is that you’re pointing out that some people use other p-values. But in medicine, they don’t. They almost always use 95%, though sometimes 90%.
My confusion is about “at least” vs. “exactly”. See my answer to Cyan.
You want size), not p-value. The difference is that size is a “pre-data” (or “design”) quantity, while the p-value is post-data, i.e., data-dependent.
Thanks.
So if I set size at 5%, collect the data, and run the test, and repeat the whole experiment with fresh data multiple times, should I expect that, if the null hypothesis is true, the test accepts exactly %5 of times, or at most 5% of times?
If the null hypothesis is simple (that is, if it picks out a single point in the hypothesis space), and the model assumptions are true blah blah blah, then the test (falsely) rejects the null with exactly 5% probability. If the null is composite (comprises a non-singleton subset of parameter space), and there is no nice reduction to a simple null via mathematical tricks like sufficiency or the availability of a pivot, then the test falsely rejects the null with at most 5% probability.
But that’s all very technical; somewhat less technically, almost always, a bootstrap procedure is available that obviates these questions and gets you to “exactly 5%”… asymptotically. Here “asymptotically” means “if the sample size is big enough”. This just throws the question onto “how big is big enough,” and that’s context-dependent. And all of this is about one million times less important than the question of how well each study addresses systematic biases, which is an issue of real, actual study design and implementation rather than mathematical statistical theory.
How does your choice of threshold (made beforehand) affect your actual data and the information about the actual phenomenon contained therein?
In your “critiquing bias” section you allege that 3⁄43 studies supporting a link is “still surprisingly low”. This is wrong; it is actually surprisingly high. If B ~ Binom(43, 0.05), then P(B > 2) ~= 0.36.*
*As calculated by the following Python code:
I said “surprisingly low” because of publication & error bias.
Which way do you think publication bias on the issue goes, anyway?
I wrote a paragraph on that in the post. I predicted a publication bias in favor of positive results, assuming the community is not biased on the particular issue of vaccines & autism. This prediction is probably wrong, but that hypothesis (lack of bias) is what I was testing.
I don’t think this is likely, but one possible explanation is that vaccines prevent autism.
If that’s true, why didn’t one of the researchers publish a paper on that thesis? It should show up in the data they gathered.
Only if it’s statistically significant. It could be a small enough effect that they don’t notice unless they’re looking for it (if you’re going to publish a finding from either extreme, you’re supposed to use a two-tailed test, so they’d presumably want something stronger than p = 0.05), but large enough to keep them from accidentally noticing the opposite effect.
Or alternately, it’s a large effect but the rarity of autism and of non-vaccinated kids makes it hard to reach statistical-significance given sampling error. So let’s see, the suggestion here is that the reason so few studies threw up a false positive was that the true effect was the opposite of the alternative, vaccines reduce autism.
Autism is… what, 0.5% of the general population of kids these days? And unvaccinated kids are, according to a random Mother Jones article, ~1.8%.
So let’s imagine that vaccines halve the risk of autism down from the true 1.0% to the observed 0.5% (halving certainly seems like a ‘large’ effect to me), autism has the true base rate of 1.0% in unvaccinated, and the unvaccinated make up 1.8% of the population. If we randomly sampled the population in general, how much would we have to sample in order to detect a difference in autism rates between the vaccinated & unvaccinated?
The regular R function I’d use for this,
power.prop.test
, doesn’t work since it assumes balanced sample sizes, not 1.8% in one group and 98.2% in the other. I could write a simulation to do the power calculation for aprop.test
since the test itself handles imbalanced sample sizes, but then I googled and found someone had written something very similar for the Wilcoxon u-test, so hey, I’ll use the samplesize library instead; filling in the relevant values, we find for a decent chance of detecting such a correlation of vaccination with reduced autism, it takes:a total n=90k. I’m guessing that most studies don’t get near that.
Of course, a lot of that penalty is going towards picking up enough kid who are both autistic and unvaccinated, so one could do better by trying to preferentially sample either of those groups, but then one gets into thorny questions about whether one’s convenience samples are representative and biased in some way...
As the original article says, if there was no effect, you’d expect a few studies to get p < 0.05 by chance. Similarly, if there was no effect, you’d expect a few studies to get p > 0.95 by chance, suggesting that vaccines prevent autism. If vaccines do prevent autism, then it would be even more likely to have p > 0.95.
Not all statistical analysis has to be preregistered. If a data has a trend that suggest vaccination might reduce autism I’m sure the researchers would run a test for it.
If the study is underpowered to find a effect in that direction it’s also like to be underpowered to find a effect in the other direction.
Can someone with more statistical expertise run a test to see whether the studies are underpowered to pick up effects in either direction?
There is fairly extensive data (not published in the peer reviewed literature) that groups which are unvaccinated have far lower autism rates than the general public.
UPI Reporter Dan Olmsted went looking for the autistic Amish. In a community where he should have found 50 profound autistics, he found 3. The first was an adopted Chinese girl who’d had vaccinations rushed before she was shipped from China and more here on the way to the adoptive parents. The second had been normal until developing classic autism symptoms within hours of being vaccinated. The third there was no information about. http://www.putchildrenfirst.org/media/e.4.pdf
Olmsted continued his search for unvaccinated Amish with autism beyond that community, finding none for a long time, but eventually found a Doctor in Virginia who had treated 6 unvaccinated Amish people from various places with autism. 4 of them had very elevated levels of mercury.
A telephone survey commissioned by the nonprofit group Generation Rescue compared vaccinated with unvaccinated boys in nine counties of Oregon and California [15]. The survey included nearly 12,000 households with children ranging in ages from 4 to 17 years, including more than 17,000 boys among whom 991 were described as completely unvaccinated. In the 4 to 11 year bracket, the survey found that, compared with unvaccinated boys, vaccinated boys were 155% more likely to have a neurological disorder, 224% more likely to have ADHD, and 61% more likely to have autism. For the older boys in the 11-17 year bracket, the results were even more pronounced with 158 % more likely to have a neurological disorder, 317% more likely to have ADHD, and with 112% more likely to have autism. [15]
In addition to the Generation Rescue Survey, there are three autism-free oases in the United States. Most publicized are Amish communities, mainly studied in Ohio and Pennsylvania [16].The Amish are unique in their living styles in largely self-sustaining communities. They grow their own food. Although they have no specific prohibitions against medical care, very rarely do they vaccinate their children. In local medical centers available to the Amish, most centers reported that they had never seen an Amish autistic child. The only Amish children that were seen as a rule were those with congenital disorders such as fragile X. The one autistic Amish child that was discovered during the surveys was taken to a medical office for an ear infection where the child was incidentally vaccinated, probably without the mother’s consent.
The second is the Florida-based medical practice of Dr. Jeff Bradstreet. While treating several thousand autistic children in his practice, Bradstreet has observed that “there is virtually no autism in home-schooling families who decline to vaccinate for religious reasons” [17]
The third, the “Homefirst Health Services” located in Chicago, has a virtual absence of autism among the several thousand patients that were delivered at home by the medical practice, and remained non-vaccinated according to the wishes of the parents [18].
Clusters of autistic children have also been found among parents with occupational exposures to chemicals prior to conception [19], and in children exposed prenatally to organochlorine pesticides [20].
excerpted from:
http://vactruth.com/2012/03/13/vaccines-human-animal-dna/
Reportedly the CDC has been surveying the vaccination status of the Amish for years, attempting to induce them to vaccinate (with some success I believe), and has consistently refused requests to include an autism question with their survey to gather data.
Its probably worth noting that Seneff et al, http://www.mdpi.com/1099-4300/14/11/2265 who have identified one pathway by which vaccines might be causing autism, have also in other work argued that glyphosate may invoke the same pathway, and the same groups may also be avoiding glyphosate. http://people.csail.mit.edu/seneff/WAPF_Slides_2012/Offsite_Seneff_Handout.pdf
He went looking for autistics in a community mostly known for rejecting Science and Engineering? It ‘should’ be expected that the rate of autism is the same as in the general population? That’s… not what I would expect. Strong social penalties for technology use for many generations would be a rather effective way to cull autistic tendencies from a population.
I don’t reject the possibility there are other explanations for the observation that unvaccinated Amish have very low autism rates. I even offered one: that they also reject Glyphosate.
However, when it turns out that the rare cases of Amish with autism that are found mostly turn out to be vaccinated, or have some very specific other cause obvious that’s not present in the general population (high mercury), the case for vaccination being a cause becomes much much stronger.
And when you realize that other groups of unvaccinated also have low autism rates, the case becomes stronger.
And when you realize that injecting the aluminum into animal models causes behavioral deficits, and injecting vaccines into post-natal animals causes brain damage, in every study I’ve found, the case becomes stronger still.
And when you discover that the safety surveys don’t cite any empirical measurements whatsoever of the toxicity of injected aluminum in neo-nates, (or even injected aluminum in adults, for that matter), and don’t generally address the issue of aluminum at all, and don’t cite or rebut any of the many papers published in mainstream journals observing these things, or rebut or cite any of the half dozen or more epidemiological studies showing aluminum is highly correlated with autism, then I think you should conclude there is strong cognitive bias at work, if not worse.
The Amish vary greatly from one place to another. Here in Mercer County, they don’t grow much of their own food, and when they do, they can it. They do make their own milk, but they like fast food and packaged food. Storing ingredients without refrigeration, cooking fancy meals on a wood stove, and cleaning up after them with no hot running water, isn’t so simple.
Why are you responding to me? I just gave a possible explanation that I specifically said that I didn’t believe. You could post this in the main discussion to give credence to the hypothesis of the publishing bias explanation.
I could critique this if you want, although if you actually want to talk about whether or not vaccines cause autism I’d suggest posting in the open thread or starting your own post. This one is talking about publishing bias.
This seems like a big leap. 95% confidence means at least 95% confidence, right? So if I reject the “vaccines cause autism” hypothesis with p = 0.001, that makes me 95% confident and I publish?
There’s a 5% chance of having at least 95% confidence if there’s no correlation.
If there’s no correlation, p is a random number between zero and one. p = 0.001 would show that vaccines do cause autism. p = 0.999 would show that they prevent it.
I question this assumption. The distribution of errors that people will make in public communication tracks closely what they can expect to get away with doing. Errors (and non-errors, for that matter) that would result in social sanction will be reviewed more closely before publication, when they are generated at all. If your new satellite powered surveillance technique reports that the Emperor has No Clothes you double check.
The addition of ‘noise’ to a process with publication bias will enhance the strength of that bias. It also potentially increases regression towards equal negative/positive results. It is not clear which of these opposing influences will be stronger in a given field.
I don’t think that it’s necessarily suspicious in that, a priori, I wouldn’t have a problem with 60 tests all being negative even though they’re all only 95% confident.
The reason being, depending on the nature of the test, the probability of a false negative might indeed be 5% while the probability of a false positive could be tiny. Suppose this is indeed the case and let’s consider the two cases that the true answer is either ‘positive’ or ‘negative’.
(A) if the true conclusion is ‘positive’, any test can yield a negative with 5% probability. (this test will be reported as a negative with 95% confidence, though one would expect most tests to yield the positive conclusion.)
(B) if the true conclusion is ‘negative’, any test that yields a negative will still be reported with the 95% confidence because of the possibility of case (A). Though if it is case (B), we should not expect any positive conclusion, even over 60 tests, because the false-positive rate is so low.
I have no idea if this lack of symmetry is the case for the set of MMR and autism studies. (It probably isn’t—so I apologize that I am probably accomplishing nothing but making it more difficult to argue what is likely a true intuition.)
But it is easy to think of an example where this asymmetry would apply: consider that you are searching for someone that you know well in a crowd, but you are not sure they are there. Consider a test to be looking for them over a 15 minute period, and you estimate that if they are there, you are likely to find them during that 15 minute period with 95% probability. Suppose they are there but you don’t find them in 15 minutes—that is a false negative with 5% probability. Supopse they are not there and you do not find them—you again say they are not there with 95% probability. But in this case where they are not there, even if you have 60 people looking for them over 15 minutes, no one will find them because the probability of a false positive is pretty much zero.
(I do see where you addressed false positives versus false negatives in several places, so this explanation was not for you specifically since I know you are familiar with this. But it is not so clear which is which in these studies from the top, and it is fleshing this out that will ultimately make the argument more difficult, but more water-tight.)
No, that 5% is the probability of false positive, not the probability of false negative. Phil has the number he needs and uses it correctly.
Which 5%?
No, “that” 5% is the probability from my cooked-up example, which was the probability of a false-negative.
You’re saying (and Phil says also in several places) that in his example the 5% is the probability of a false positive. I don’t disagree, a priori, but I would like to know, how do we know this? This is a necessary component of the full argument that seems to be missing so far.
Another way of asking my question, perhaps more clearly, is: how do we know if the 60 considered studies were testing the hypothesis that there was a link or the hypothesis that there was not a link?
There is an asymmetry that makes it implausible that the null hypothesis would be that there is an effect. The null hypothesis has to be a definite value. The null hypothesis can be zero, which is what we think it is, or it could be some specific value, like a 10% increase in autism. But the null hypothesis cannot be “there is some effect of unspecified magnitude.” There is no data that can disprove that hypothesis, because it includes effects arbitrarily close to zero. But that can be the positive hypothesis, because it is possible to disprove the complementary null hypothesis, namely zero.
Another more symmetric way of phrasing it is that we do the study and compute a confidence interval, that we are 95% confident that the effect size is in that interval. That step does not depend on the choice of hypothesis. But what do we do with this interval? We reject every hypothesis not in the interval. If zero is not in the interval, we reject it. If a 10% increase is not in the interval, we can reject that. But we cannot reject all nonzero effect sizes at once.
(I realize I’m confused about something and am thinking it through for a moment.)
I see. I was confused for a while, but in the hypothetical examples I was considering, a link between MMR and autism might be missed (a false negative with 5% probability) but isn’t going to found unless it was there (low false positive). Then Vanviver explains, above, that the canonical null-hypothesis framework assumes that random chance will make it look like there is an effect with some probability—so it is the false positive rate you can tune with your sample size.
I marginally understand this. For example, I can’t really zoom out and see why you can’t define your test so that the false positive rate is low instead. That’s OK. I do understand your example and see that it is relevant for the null-hypothesis framework. (My background in statistics is not strong and I do not have much time to dedicate to this right now.)
I think the answer to this is “because they’re using NHST.” They say “we couldn’t detect an effect at the level that random chance would give us 5% of the time, thus we are rather confident there is no effect.” But that we don’t see our 5% false positives suggests that something about the system is odd.
OK, that sounds straightforward.
How does one know that the 60 studies are these? (rather then the others (e.g., that were designed to show an effect with 95% probability, but failed to do so and thus got a negative result)).
What would have happened to results that vaccines prevent autism?
They would have been highly cited academic papers and good for the researchers who made those findings.
Yeah. I was asking a rhetorical question, actually.
When it comes to studies of vaccine side effect there one thing that very worrying. When a new vaccine enter the market there is testing for side effects. Those studies actually do find side effects and the Center of Disease Control should be a trustworthy source for reporting them.
It turn out different vaccine have quite different side effects. They didn’t find any Nausea, vomiting, diarrhea, or abdominal pain as a side effect in the Hepatitis A vaccine but 1 of 4 people who take the HPV—Cervarix vaccine get them. Maybe different vaccine work extremely differently and therefore it doesn’t make any sense to generalize the risks of one vaccine to the next. Maybe the studies that list the side effects are also to poor.
Even more worrying the side effects reporting system that regular doctors use while the drug is on the market completely fails to capture the side effects that would be expected. If I remember right they find more than two orders of magnitude less side effects than would be predicted based on the studies for bringing a drug to the market.
It is generally believed that only something on the order of 1% of side effects reported to Doctors are reported by Doctors into the system, which would explain your last comment.
Stupid mathematical nitpick:
Actually, it is more correct to say that .95 ^ 39 = 0.14.
If we calculate it out to a few more decimal places, we see that .95 ^ 39 is ~0.135275954. This is closer to 0.14 than to 0.13, and the mathematical convention is to round accordingly.
Did you save a list of the p-values reported in the 39 (or 43) studies you looked at? I wonder what I’d get if I aggregated them with Fisher’s method.
Well in light of how the modern scientific processes produces a bias against contrary views the activists’ seizing on any studies drawing contrary conclusions appears to be rational. To put it another way if the process is strongly biased against studies that reach contrarian conclusions, any study reaching a contrarian conclusion that survives the process is much stronger evidence than a study that reaches the mainstream conclusion.
First thing, if you put something in your body, it has some effect, even if that effect is small. “No effect” results just rule out effects above different effect sizes (both positive and negative) with high probability, and there’s no point talking about “a link” like it’s some discrete thing (you sort of jump back and forth between getting this one right and wrong).
Second, different studies will rule out different effect sizes with 95% confidence—or to put it another way, at a given effect size, different studies will have different p-values, and so your probability exercise was pretty pointless because you didn’t compare the studies’ opinions about any particular effect size, just “whatever was 95%.”
Third, I’d bet a nickel the effect sizes ruled out at 95% in all of these studies are well below the point where it would become concerning (like, say, the effect of the parents being a year older). That is, these studies all likely rule out a concerning effect size with probability much better than 95%.
My probability exercise was not about effect size. It was about the probability of all studies agreeing by chance if there is in fact no link, and so the 95% confidence is what is relevant.
Again, not relevant to the point I’m making here.
Not relevant to the main point that you’re making, but relevant to your parenthetical:
Vaguely interesting, but a p=.13 result in pretty much any field gets a big fat “meh” from me. Sometimes p=.13 results happen. I’d want stronger evidence before I started to suspect bias.
Excess significance and publication bias in general are so common as to be the default; p=0.13 is pretty bad looking (with a single-tailed test, that’d be below the 0.10 threshold Ioannides suggest for publication bias tests due to their low power to detect bias).
Simple statistics, but eye-opening. I wonder if gwern would be interested enough to do a similar analysis, or maybe he already has.
Goetz is re-inventing a meta-analytic wheel here (which is nothing to be ashamed of). It certainly is the case that a body of results can be too good to be true. To Goetz’s examples, I’ll add acupuncture, but wait, that’s not all! We can add everything to the list: “Do Certain Countries Produce Only Positive Results? A Systematic Review of Controlled Trials” is a fun** paper which finds
‘Excess significance’ is not a new concept (fun fact: people even use the phrase ‘too good to be true’ to summarize it, just like Goetz does) and is a valid sign of bias in whatever set of studies on is looking at, and as he says, you can treat it as a binomial to calculate the odds of n studies failing to hit their quota of 5% false positives and instead delivering 0% or whatever. But 5% here is just the lower bound, you can substantially improve by taking into account statistical power, this is how Schimmack’s ‘incredibility index’ basically works*. More recently is the p-curve approach, but I don’t understand that as well.
To some extent, you can also diagnose this problem in funnel plots: if studies-datapoints clump ‘too tightly’ within the cone of precision vs significance and you don’t see any small/low-power studies wandering over into the ‘bad’ area of point-estimates where random noise should be bouncing at least some of them, then there’s something funny going on with the data.
* I say a bit because Schimmack intends his II for use in psychology papers of the sort which report, say, 5 experiments testing a particular hypothesis, and mirabile dictu, all 5 support the authors’ theory.
Now, if we considered only false positives, the odds of all 5 not being false positives is 0.95^5 or 77.4% - so 5 positives isn’t especially damning, nothing like 60 papers all claiming positive results. But we can do better, by looking at the other kind of error.
Shimmack points out that you can look instead at the other side of the coin from alpha/false positives: statistical power, the odds of finding a statistically-significant result assuming the effect actually exists. Given that experiments usually have low power like 50%, that means half of the paper’s experiments should have ‘failed’ even if they were right, so now we ask instead, ‘since half the experiments should have failed even in the best case that we’re testing a true hypothesis, how likely are these results of all 5 succeeding?’ then the calculation is 0.5^5 or 3% - so their results are truly incredible!
(If I understand the logic of NHST correctly, 5% is merely the guaranteed lower bound of error, due to the choice of 0.05 for alpha. But unless every experiment is run with a billion subjects and has statistical power of 100%, the real percentage of ‘failed’ studies should be much higher, with the exact amount based on how bad the power is.)
** Did I say ‘fun’? I actually meant, ‘incredibly depressing’ and ‘makes me fear for the future of science if so much cargo cult science can be done in non-Western countries which have the benefit of centuries of scientific work and philosophy and many of whose scientists trained in the West, and yet somehow, it seems that the spirit of science just didn’t get conveyed, and science there has been corrupted into a hollow mockery of itself, creating legions of witch-doctors who run “experiments” and write “papers” and do “statistics” none of which means anything’.
Science is not a magic bullet against bad incentives. I am more optimistic, we are getting a lot done despite bad incentives.
But none of the incentives seem particularly strong there. It’s not offensive to any state religion, it’s not objectionable to local landlords, it’s not a subversive creed espoused by revolutionaries who want to depose the emperor. The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge academics for promotion based on how many papers they publish. If genuine science can’t survive that and will degenerate into cargo cult science when hit by such weak incentives...
People respond strongly to this in the West also—“least publishable units”, etc.
This is almost mystical wording. There is bad science in the West, and good science in the East. I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society. If you think you can get away with dumping plastic in milk, a little data faking is the least of your problems. As that gets better, science will get better too.
And yet, at least clinical trials fail here, and we don’t have peer-review rings being busted or people throwing bales of money out the window as the police raid them for assisting academic fraud. (To name some recent Chinese examples.)
Again, what incentives? If science cannot survive some ‘weak institutions’ abroad, which don’t strike me as any worse than, say, the Gilded Age in America (and keep in mind the relative per capita GDPs of China now and, say, the golden age of German science before WWII), how long can one expect it to last?
It’s gesturing to society-wide factors of morality, values, and personality, yes, since it doesn’t seem to be related to more mundane factors like per capita GDP.
Japan is a case in point here. Almost as bad as China on the trial metric despite over a century of Western-style science and a generally uncorrupt society which went through its growing pains decades ago.
That explains China and Russia/USSR, it doesn’t explain Japan and Taiwan.
The study was looking at English texts, not Russian, Chinese, or Japanese texts.
edit: a study on foreign language bias in German speaking countries.
And that’s Germans, for whom it is piss easy to learn English (compared to Russians, Chinese, or Japanese).
Why did you omit the part where a third of the sample was published in both English and German, and hence weakens the bias? (That is comparable to the overlap for Chinese & English publications.)
There’s something that just didn’t get conveyed: English language. That paper, with it’s idiot finding, was looking at the studies downloaded from Medline and presumably published in English, or at least with an English abstract (the search was done for English terms and no translation efforts were mentioned).
As long as researchers retain freedom to either write their study up in English or not there’s going to be an additional publication-in-a-very-foreign-language bias.
With regards to acupuncture, one thing that didn’t happen, is soviet union being full of acupuncture centres and posters about awesomeness of acupuncture everywhere on the walls, something that would have happened if there was indeed such a high prevalence of positive findings in locally available literature.
As a rule of thumb, I would say that any research published after the early 1990s in a language other than English is most likely crap.
Why do you think it changed, and in the early 1990s specifically? (The original study I posted only examined ’90s papers and so couldn’t show any time-series like that, so it can’t be why you think that.)
I suppose that before the 1990s respectable Soviet scientists published primarily in Russian.
Yes, but it’s not sufficient to explain the results. To use your German example, even a doubling of significance rates in vernacular vs English doesn’t give one ~100% success rate in evaluating treatments since their net success rate across the 3 categories is going to be something like 40%. Nor is publishing in English going to be a rare and special event, regardless of how hard English is to learn, because publishing in high-impact English-language journals is part of how Chinese universities are ranked and people are rewarded.
Uh huh. But acupuncture is not part of the Russian cultural heritage. What I do see instead is, to name one example (what with not being a Russian and familiar with the particular pathologies of Russian science), tons of bogus nootropics studies (they come up on /r/nootropics periodically as people discover yet another translated abstract on Pubmed of a sketchy substance cursorily tested in animals), because interest in human enhancement is part of Russian culture.
Unsurprisingly, pseudo-medicine and pseudo-science will vary by region—which is, after all, the point of comparing acupuncture studies in the West to studies in East Asia! (If there were millions of acupuncture fanatics in Russia and the UK and the USA just like in China/Korea/Japan, then what would we learn, exactly, from comparing studies?) We expect there to be regional differences and that the West will be less committed & more disinterested than East Asia, closer to the ground truth, and hence the difference gives us a lower bound on how big the biases are.
Publication in general doesn’t have to be rare and special, only the publications of negative results has to be uncommon. People just care less about publishing negative results and prefer to publish positive results; if there’s X amount of effort for publication in a foreign language, and the positive studies already use up all of the X, no X is left for negative results… There’s other issues, e.g. how many of those tests were re-testing simple, effective FDA-approved drugs and such?
Also, for the Soviet union, there would be a certain political advantage in finding no efficacy of drugs that are expensive to manufacture or import. And one big aspect of soviet backwardness was always the disbelief that something actually works.
Even assuming that the publications always found what ever experimenter wanted to find, it wouldn’t explain that predominantly an effect is found. What’s of the chemical safety studies? There’s a very strong bias to fail to disprove the null hypothesis.
Yet your paper somehow found a ridiculously high positive rate for acupuncture. The way I think it would work, well, first thing first it’s very difficult to blind acupuncture studies and inadequately blinded experiments should find positive result from the placebo effect, secondarily, because that’s the case, nobody really cares about that effect, and thirdly, de-facto the system did not result in construction of acupuncture centres.
I haven’t really noticed nootropics being a big thing, and various rat maze studies were and are largely complete crap anyway. To the point that the impact of experimenter’s gender got only discovered recently.
edit: also if we’re looking at Russia from 1991 to 1998, that was the time when scientists and other such government employees were literally not getting paid their wages. I remember that time, my parents were not paid for months at a time, they were reselling shampoo on the side to get some cash.
I realize that, and I’ve already pointed out why the difference in rates is not going to be that large & that your cite does not explain the excess significance in their sample.
Doesn’t matter that much. Power, usually quite low, sets the upper limit to how many of the results should have been positive even if we assume every single one was testing a known-efficacious drug (which hypothesis raises its own problems: how is that consistent with your claims about the language bias towards publishing cool new results?)
So? I don’t care why the Russian literature is biased, just that it is.
Yes, but toxicology studies being done by industry is not aimed at academic publication, and the ones aimed at academic publication have the usual incentives to find something and so are part of the overall problem.
Huh? The paper finds that acupuncture study rates vary by region. USA/Sweden/Germany 53/59%/63%, China/Japan/Taiwan 100% etc
How much have you looked? There’s plenty of acupuncture centres in the USA despite a relatively low acupuncture success rate.
Does a fish notice water? But fine, maybe you don’t, feel free to supply your own example of Russian pseudoscience and traditional medicine. I doubt Russian science is a shining jewel of perfection with no faults given its 91% acupuncture success rate (admittedly on a small base).
Not sure that’s a good example, as Wikipedia seems to disagree about homebrew phage therapy not being applied: https://en.wikipedia.org/wiki/Phage_therapy#History
Anyway,
How do you see the unseen? Unless someone has done a large definitive RCT, how does one ever prove that a result was bogus? Nobody is ever going to take the time and resources to refute those shitty animal experiments with a much better experiment. Most scientific findings never gets that sort of black-and-white refutation, it just gets quietly forgotten and buried, and even the specialists don’t know about it. Most bad science doesn’t look like Lysenko. Or look at evidence-based medicine in the West: rubbish medicine doesn’t look like a crazy doc slicing open patients with a scalpel, it just looks like regular old medicine which ‘somehow’ turns up no benefit when rigorously tested and is quietly dropped from the medical textbooks.
To diagnose bad science, you need to look at overall metrics and indirect measures—like excess significance. Like 91% of acupuncture studies working.
Well, humans do notice air some of the time. (SCNR.)
If you want to persist in your mythical ideas regarding western civilization by postulating what ever you need and making shit up, there’s nothing I or anyone else can do about it.
Your study is making a more specific claim than mere bias in research, it’s claiming bias in one particular direction.
The point is that the SU was, mostly, using antibiotics (once production was set up, i.e. from some time after ww2).
Well, and there wasn’t a plenty in the soviet union despite supposedly higher success rate.
If you don’t know correct rate you can’t tell which specific rate is erroneous. It’s not realistically possible to construct a blind study of acupuncture, so, unlike, say, homoeopathy, it is a very shitty measure of research errors.
I really doubt that 91% of Russian language acupuncture studies published in Soviet Union found a positive effect (I dunno about 1991-1998 Russia, it was fucked up beyond belief at that time), and I don’t know how many studies should have found a positive effect (followed by a note that more adequate blinding must be invented to study it properly).
And we know that what ever was the case there was no Soviet abandonment of normal medicine in favour of acupuncture—the system somehow worked out ok in the end.
That’s not a reply to what I wrote.
Yes, that’s what a bias is. A systematic tendency in one direction. As opposed to random error.
And before that, they were using phages despite apparently pretty shaky evidence it was anything but a placebo. That said, pointing out the systematic bias of Russian science (among many other countries, and I’m fascinated, incidentally, how the only country you’re defending like this is… your own. No love for Korea?) does not commit me to the premise that phages do or not work—you’re the one who brought them up as an example of how excellent Russian science is, not me.
How many are there now? Shouldn’t you have looked that up?
Difference in rates is prima facie evidence of bias, due to the disagreement. If someone says A and someone else says not-A, you don’t need to know what A actually is to observe the contradiction and know at least one party is wrong.
Yes it is.
And naturally, you have not looked for anything on the topic, you just doubt it.
Strawman. No country engages in ‘abandonment of normal medicine’ - if you go to China, do you only find acupuncturists? Of course not. The problem is that you find acupuncturists sucking up resources in dispensing expensive placbeos and you find that the scientific community is not strong enough to resist the cultural & institutional pressures and find that acupuncture doesn’t work, resulting in real working medicine being intermeshed with pseudomedicine.
Fortunately, normal medicine (after tremendous investments in R&D and evidence-based medicine) currently works fairly well and I think it would take a long time for it to decay into something as overall bad as pre-modern Western medicine was; I also think some core concepts like germ theory are sufficiently simple & powerful that they can’t be lost, but that would be cold comfort in the hypothetical cargo cult scenario (‘good news: doctors still know what infections are and how to fight epidemics; bad news: everything else they do is so much witch-doctor mumbojumbo based on unproven new therapies, misinterpretations of old therapies which used to work, and traditional treatments like acupuncture’).
Unless A contains indexicals that point to different things in the two cases.
(Maybe Asian acupunturists are better than European ones, or maybe East Asians respond better to acupuncture than Caucasians for some reason, or...)
( I’m not saying that this is likely, just that it’s possible.)
I was referring to your other comment.
That’s the one I know most about, obviously. I have no clue about what’s going on in China, Korea, or Japan.
Look, it doesn’t matter if phages work or don’t work! The treatment, in favour of which there would be strong bias, got replaced with another treatment, which would have been biased against. Something that wouldn’t have happened if science systematically failed to work in such an extreme and ridiculous manner. I keep forgetting that I really really need to spell out any conclusions when arguing with you. It’s like you’re arguing that a car is missing the wheels but I just drove here on it.
Besides, the 90%+ proportion of positive results is also the case in the west
(also, in the past we had stuff like lobotomy in the west)
So why do you think your defense would not apply equally well (or poorly) to them? What’s the phage of China?
Oh wow. What a convincing argument. ‘Look, some Russians once did this! Now they do that! No, it doesn’t matter if they were right or wrong before or after!’ Cool. So does that mean I get to point to every single change of medical treatment in the USA as evidence it’s just peachy there? ‘Look, some Americans once did lobotomy! Now they don’t! It doesn’t matter if lobotomies work or don’t work!’
You didn’t drive shit anywhere.
That’s on a different dataset, covering more recent time periods, which, as the abstract says, still shows serious problems in East Asia (compromised by relatively small sample: trying to show trends in ‘AS’ using 204 studies over 17 years isn’t terribly precise compared to the 2627 they have for the USA) with the latest data being 85% vs 100%. And 100% significance is a ceiling, so who knows how bad the East Asian research has actually gotten during the same time period Western numbers continue to deteriorate...
You are trying to pull a “everyone and me is against you” stunt against Gwern? Do you have any idea how dumbfoundingly absurd this would sound to most of those of the class “anyone else” who happens to see this exchange?
Ohh and to add. One big ‘thing’ in the Soviet Union was research in phage therapy, hoping to replace antibiotics with it, but somehow they didn’t end up replacing antibiotics with homebrew phage therapy, something that I’d expect to happen if they were simply finding what they wanted to find, and otherwise not doing science. To summarize, I see this allegation of some grave fault but I fail to see the consequences of this fault. Nor did they end up having all the workers take some ‘nootropics’ that don’t work, or anything likewise stupid.
Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying “the coin has tails on both sides at a 95% confidence level” because that’s what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin is two-tailed with a 100% confidence.
What was the mistake?
I don’t know if the original post was changed, but it explicitly addresses this point:
The actual situation is described this way:
I have a coin which I claim is fair: that is, there is equal chance that it lands on heads and tails, and each flip is independent of every other flip.
But when we look at 60 trials of the coin flipped 5 times (that is, 300 total flips), we see that there are no trials in which either 0 heads were flipped or 5 heads were flipped. Every time, it’s 1 to 4 heads.
This is odd- for a fair coin, there’s a 6.25% chance that we would see 5 tails in a row or 5 heads in a row in a set of 5 flips. To not see that 60 times in a row has a probability of only 2.1%, which is rather unlikely! We can state with some confidence that this coin does not look fair; there is some structure to it that suggests the flips are not independent of each other.
One mistake is treating 95% as the chance of the study indicating two-tailed coins, given that they were two-tailed coins. More likely it was meant as the chance of the study not indicating two-tailed coins, given that they were not two-tailed coins.
Try this:
You want to test if a coin is biased towards heads. You flip it 5 times, and consider 5 heads as a positive result, 4 heads or fewer as negative. You’re aiming for 95% confidence but have to get 31⁄32 = 96.875%. Treating 4 heads as a possible result wouldn’t work either, as that would get you less than 95% confidence.
This doesn’t seem like a good analogy to any real-world situation. The null hypothesis (“the coin really has two tails”) predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.
While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study.
Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally not (1-p), much less (1-alpha).
I don’t see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?
(1) is obvious, of course—in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates.
On another level, I suppose it can be said that of course they are all biased! But, by the actual two-tailed coin rather than researchers’ prejudice against normal coins.
Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the “the”. The final overconfidence is the least of the mistakes in the story.
Mistakes more relevant to practical empiricism: Treating “>= 95%” as “= 95%” is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively ‘correct’, but they will be less informed.
Hence my question in another thread: Was that “exactly 95% confidence” or “at least 95% confidence”? However when researchers say “at a 95% confidence level” they typically mean “p < 0.05″, and reporting the actual p-values is often even explicitly discouraged (let’s not digress into whether it is justified).
Yet the mistake I had in mind (as opposed to other, less relevant, merely “a” mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn’t guarantee you an automatic 5% chance of getting the other.
Yes, that’s suspicious. Good instinct. I’m sure there’s some bias against publishing a marginally-significant result that’s got a low (outside the framework of the paper’s statistical model) prior. I’d bet some of the unlucky ones got file-drawered, and others (dishonestly or not) kept on collecting more data until the noise (I presume) was averaged down.
However, you might be missing that on an iso-P contour, false positives have diminishing effect size as sample size increases.
For those who don’t know what a case control or cohort study is:
″...Essentially, for a cohort study, you start at the point of exposure, and then follow individuals to see who develops the outcome. In a retrospective cohort, you find information that has recorded participants prior exposures. In a case control study, you start with a group of cases (those with the disease) and controls (those without disease) and then measure exposures. These two designs are similar, but they differ on the starting point (outcome or exposure).; - AL—UoM
What? No. Just… no. You can’t say “Because P(‘result’|H0)=.05, P(~‘result’|H1)=.05”.
The actual problem is that the surveys routinely ignore all the articles showing vaccines causing problems. There is always a lot of attention to whether thimerosal causes autism, or whether MMR causes autism. There is never any examination of whether the aluminum in vaccines causes autism, or whether getting any vaccine at all at a critical period of brain development causes autism, or whether getting too many vaccines too soon causes autism*, for all of which there are published peer reviewed papers indicating a likelihood they do, or the contaminants in vaccines cause autism (contaminants suggested in the literature include human DNA, which is suggested to cause autism, Simian retroviruses, Simian virus SV-40, and Mycoplasmas), nor any comparison of vaccinated to fully unvaccinated individuals. I did a quick survey of the surveys cited above, and none of them considers aluminum. I also wrote a survey article on some of the literature suggesting early and adjuvanted vaccines cause damage, and none of the dozens of peer reviewed papers I found suggesting damage is cited in the IOM’s numerous 800+ page surveys. My survey can be found at http://whyarethingsthisway.com/2014/03/08/example-1-pediatrician-belief-is-opposite-the-published-scientific-evidence-on-early-vaccine-safety/
If you read later posts in that blog you will find more criticism of the cognitive biases and outright omissions in the vaccine safety and effectiveness literature. Basically there is extensive scientific evidence indicating that any vaccine that happens to come during critical periods of brain development may harm development, and that various cumulative effects of vaccines may cause problems. The safety surveys basically only compare patients who got one specific vaccine and many others to patients who got many others, so they are blind to the kinds of damage that seem likely to be occurring. Also, the only RPC study I know of that injected saline or vaccine randomly into children and followed their health (not whether they got a specific disease) for more than a few months found the vaccine recipients (a flu vaccine) got 4 times as many respiratory illnesses as the recipients of the placebo.
*The paper of Stefano et al that is sometimes cited as showing this, upon closer examination effectively only compares patients who got DTP and other vaccines to patients who may have gotten DTaP, but did not get DTP, and got other vaccines. See the survey above for more discussion. I’ve never seen any other paper mentioned on the subject.
I would be very grateful to people who can supply citations to articles I haven’t yet found contradicting my conclusions, for example other RCP studies following health for a prolonged period, especially of children.
When you start looking at specific elements, contaminants, or compounds instead of the vaccine as a whole, the number of possible comparisons increases exponentially, and it would be impossible to analyze them all in detail. Just look at all the correlation comparisons that were made using the LessWrong survey (it’s near the end of the post in main). Those were just the ones with extremely high significance (a more stringent requirement than many published papers). If a contaminant in vaccines is the cause, for it to show up in a study focusing specifically on the contaminant, but not in a study of the vaccine itself would indicate only a small correlation or a flaw in one of the studies; the latter being much more likely.
But from an outsider’s perspective, look at this way: Who is more likely to be biased, medically trained doctors who went through 8 years of school and several more of residency learning about this stuff to reach their present position or the “scientist” on that blog you linked who never even states his/her specialty and others like him/her? From my perspective, you’ve got a huge uphill battle to convince me.
Well, read what I wrote. It’s demonstrable that the IOM and the other review boards are ignoring whole literatures just by searching their PDF for all of the results I’ve cited. They have no defense to aluminum, they never even study the subject, and both animal results and epidemiology show its damaging. And it is being introduced into neo-nates in amounts 100′s of times greater than what they get from diet in the first six months, with the injections bypassing about 6 or 7 filters evolution has created to keep it out. They also have no coherent articles whatsoever studying the issue of whether too many vaccines too early are causing problems. Every single study they ever do compares individuals getting large numbers of vaccines to other individuals getting large numbers of vaccines, and they ignore all the epidemiology that compares people getting many vaccines to people getting more vaccines, like regressions on vaccine compliance across states or regressions on infant mortality by number of vaccines in national series, all of which show damage. The only randomized placebo test I’m aware of of a vaccine that followed the health of the children, rather than whether they got a specific disease, found recipients of the vaccine got 4 times as many respiratory ailments as recipients of the placebo. Comparisons of vaccinated to unvaccinated, for example in the third world where you get decent statistics, also report the vaccines are raising mortality dramatically. Every animal study studying the challenge of immune systems of post-natal animals with injections reports its damaging to immune system and/or brain development. All of this the surveys simply ignore, rather than rebut or provide any evidence on the other side of.
If you don’t make the effort to look at the literature when I’ve laid it out for you, you are going to avoid discovering the plain fact that medicine has become a cargo cult science, and they don’t actually know what they are doing. The notion that committees of Doctors and/or government officials make better than random decisions about health care practices, or understand the import of scientific literatures, is fanciful, but also empirically disproved. This may impact your health and wallet in a big way if I’m right, and I am, so its worth paying a little attention to. its also a representative of an even wider phenomenon about the world, that a lot of what you think is real is actually crowd think delusions.