Goetz is re-inventing a meta-analytic wheel here (which is nothing to be ashamed of). It certainly is the case that a body of results can be too good to be true. To Goetz’s examples, I’ll add acupuncture, but wait, that’s not all! We can add everything to the list: “Do Certain Countries Produce Only Positive Results? A Systematic Review of Controlled Trials” is a fun** paper which finds
In studies that examined interventions other than acupuncture [‘all papers classed as “randomized controlled trial” or “controlled clinical trial”’], 405 of 1100 abstracts met the inclusion criteria. Of trials published in England, 75% gave the test treatment as superior to control. The results for China, Japan, Russia/USSR, and Taiwan were 99%, 89%, 97%, and 95%, respectively. No trial published in China or Russia/USSR found a test treatment to be ineffective.
‘Excess significance’ is not a new concept (fun fact: people even use the phrase ‘too good to be true’ to summarize it, just like Goetz does) and is a valid sign of bias in whatever set of studies on is looking at, and as he says, you can treat it as a binomial to calculate the odds of n studies failing to hit their quota of 5% false positives and instead delivering 0% or whatever. But 5% here is just the lower bound, you can substantially improve by taking into account statistical power, this is how Schimmack’s ‘incredibility index’ basically works*. More recently is the p-curve approach, but I don’t understand that as well.
To some extent, you can also diagnose this problem in funnel plots: if studies-datapoints clump ‘too tightly’ within the cone of precision vs significance and you don’t see any small/low-power studies wandering over into the ‘bad’ area of point-estimates where random noise should be bouncing at least some of them, then there’s something funny going on with the data.
* I say a bit because Schimmack intends his II for use in psychology papers of the sort which report, say, 5 experiments testing a particular hypothesis, and mirabile dictu, all 5 support the authors’ theory.
Now, if we considered only false positives, the odds of all 5 not being false positives is 0.95^5 or 77.4% - so 5 positives isn’t especially damning, nothing like 60 papers all claiming positive results. But we can do better, by looking at the other kind of error.
Shimmack points out that you can look instead at the other side of the coin from alpha/false positives: statistical power, the odds of finding a statistically-significant result assuming the effect actually exists. Given that experiments usually have low power like 50%, that means half of the paper’s experiments should have ‘failed’ even if they were right, so now we ask instead, ‘since half the experiments should have failed even in the best case that we’re testing a true hypothesis, how likely are these results of all 5 succeeding?’ then the calculation is 0.5^5 or 3% - so their results are truly incredible!
(If I understand the logic of NHST correctly, 5% is merely the guaranteed lower bound of error, due to the choice of 0.05 for alpha. But unless every experiment is run with a billion subjects and has statistical power of 100%, the real percentage of ‘failed’ studies should be much higher, with the exact amount based on how bad the power is.)
** Did I say ‘fun’? I actually meant, ‘incredibly depressing’ and ‘makes me fear for the future of science if so much cargo cult science can be done in non-Western countries which have the benefit of centuries of scientific work and philosophy and many of whose scientists trained in the West, and yet somehow, it seems that the spirit of science just didn’t get conveyed, and science there has been corrupted into a hollow mockery of itself, creating legions of witch-doctors who run “experiments” and write “papers” and do “statistics” none of which means anything’.
Science is not a magic bullet against bad incentives.
But none of the incentives seem particularly strong there. It’s not offensive to any state religion, it’s not objectionable to local landlords, it’s not a subversive creed espoused by revolutionaries who want to depose the emperor. The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge academics for promotion based on how many papers they publish. If genuine science can’t survive that and will degenerate into cargo cult science when hit by such weak incentives...
But none of the incentives seem particularly strong there.
The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge
academics for promotion based on how many papers they publish.
People respond strongly to this in the West also—“least publishable units”, etc.
it seems that the spirit of science just didn’t get conveyed
This is almost mystical wording. There is bad science in the West, and good science in the East. I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society. If you think you can get away with dumping plastic in milk, a little data faking is the least of your problems. As that gets better, science will get better too.
People respond strongly to this in the West also—“least publishable units”, etc.
And yet, at least clinical trials fail here, and we don’t have peer-review rings being busted or people throwing bales of money out the window as the police raid them for assisting academic fraud. (To name some recent Chinese examples.)
I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society.
Again, what incentives? If science cannot survive some ‘weak institutions’ abroad, which don’t strike me as any worse than, say, the Gilded Age in America (and keep in mind the relative per capita GDPs of China now and, say, the golden age of German science before WWII), how long can one expect it to last?
This is almost mystical wording.
It’s gesturing to society-wide factors of morality, values, and personality, yes, since it doesn’t seem to be related to more mundane factors like per capita GDP.
As that gets better, science will get better too.
Japan is a case in point here. Almost as bad as China on the trial metric despite over a century of Western-style science and a generally uncorrupt society which went through its growing pains decades ago.
I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society. If you think you can get away with dumping plastic in milk, a little data faking is the least of your problems.
That explains China and Russia/USSR, it doesn’t explain Japan and Taiwan.
Only 35% of German-language articles, compared with 62% of English-language articles, reported significant (p < 0.05) differences in the main endpoint between study and control groups (p = 0.002 by McNemar’s test)
And that’s Germans, for whom it is piss easy to learn English (compared to Russians, Chinese, or Japanese).
Why did you omit the part where a third of the sample was published in both English and German, and hence weakens the bias? (That is comparable to the overlap for Chinese & English publications.)
There’s something that just didn’t get conveyed: English language. That paper, with it’s idiot finding, was looking at the studies downloaded from Medline and presumably published in English, or at least with an English abstract (the search was done for English terms and no translation efforts were mentioned).
As long as researchers retain freedom to either write their study up in English or not there’s going to be an additional publication-in-a-very-foreign-language bias.
With regards to acupuncture, one thing that didn’t happen, is soviet union being full of acupuncture centres and posters about awesomeness of acupuncture everywhere on the walls, something that would have happened if there was indeed such a high prevalence of positive findings in locally available literature.
As long as researchers retain freedom to either write their study up in English or not there’s going to be an additional publication-in-a-very-foreign-language bias.
As a rule of thumb, I would say that any research published after the early 1990s in a language other than English is most likely crap.
Why do you think it changed, and in the early 1990s specifically? (The original study I posted only examined ’90s papers and so couldn’t show any time-series like that, so it can’t be why you think that.)
As long as researchers retain freedom to either write their study up in English or not there’s going to be an additional publication-in-a-very-foreign-language bias.
Yes, but it’s not sufficient to explain the results. To use your German example, even a doubling of significance rates in vernacular vs English doesn’t give one ~100% success rate in evaluating treatments since their net success rate across the 3 categories is going to be something like 40%. Nor is publishing in English going to be a rare and special event, regardless of how hard English is to learn, because publishing in high-impact English-language journals is part of how Chinese universities are ranked and people are rewarded.
With regards to acupuncture, one thing that didn’t happen, is soviet union being full of acupuncture centres and posters about awesomeness of acupuncture everywhere on the walls
Uh huh. But acupuncture is not part of the Russian cultural heritage. What I do see instead is, to name one example (what with not being a Russian and familiar with the particular pathologies of Russian science), tons of bogus nootropics studies (they come up on /r/nootropics periodically as people discover yet another translated abstract on Pubmed of a sketchy substance cursorily tested in animals), because interest in human enhancement is part of Russian culture.
Unsurprisingly, pseudo-medicine and pseudo-science will vary by region—which is, after all, the point of comparing acupuncture studies in the West to studies in East Asia! (If there were millions of acupuncture fanatics in Russia and the UK and the USA just like in China/Korea/Japan, then what would we learn, exactly, from comparing studies?) We expect there to be regional differences and that the West will be less committed & more disinterested than East Asia, closer to the ground truth, and hence the difference gives us a lower bound on how big the biases are.
Nor is publishing in English going to be a rare and special event
Publication in general doesn’t have to be rare and special, only the publications of negative results has to be uncommon. People just care less about publishing negative results and prefer to publish positive results; if there’s X amount of effort for publication in a foreign language, and the positive studies already use up all of the X, no X is left for negative results… There’s other issues, e.g. how many of those tests were re-testing simple, effective FDA-approved drugs and such?
Also, for the Soviet union, there would be a certain political advantage in finding no efficacy of drugs that are expensive to manufacture or import. And one big aspect of soviet backwardness was always the disbelief that something actually works.
Even assuming that the publications always found what ever experimenter wanted to find, it wouldn’t explain that predominantly an effect is found. What’s of the chemical safety studies? There’s a very strong bias to fail to disprove the null hypothesis.
Unsurprisingly, pseudo-medicine and pseudo-science will vary by region
Yet your paper somehow found a ridiculously high positive rate for acupuncture. The way I think it would work, well, first thing first it’s very difficult to blind acupuncture studies and inadequately blinded experiments should find positive result from the placebo effect, secondarily, because that’s the case, nobody really cares about that effect, and thirdly, de-facto the system did not result in construction of acupuncture centres.
I haven’t really noticed nootropics being a big thing, and various rat maze studies were and are largely complete crap anyway. To the point that the impact of experimenter’s gender got only discovered recently.
edit: also if we’re looking at Russia from 1991 to 1998, that was the time when scientists and other such government employees were literally not getting paid their wages. I remember that time, my parents were not paid for months at a time, they were reselling shampoo on the side to get some cash.
Publication in general doesn’t have to be rare and special, only the publications of negative results has to be uncommon.
I realize that, and I’ve already pointed out why the difference in rates is not going to be that large & that your cite does not explain the excess significance in their sample.
There’s other issues, e.g. how many of those tests were re-testing simple, effective FDA-approved drugs and such?
Doesn’t matter that much. Power, usually quite low, sets the upper limit to how many of the results should have been positive even if we assume every single one was testing a known-efficacious drug (which hypothesis raises its own problems: how is that consistent with your claims about the language bias towards publishing cool new results?)
Also, for the Soviet union, there would be a certain political advantage in finding no efficacy of drugs that are expensive to manufacture or import.
So? I don’t care why the Russian literature is biased, just that it is.
What’s of the chemical safety studies? There’s a very strong bias to fail to disprove the null hypothesis.
Yes, but toxicology studies being done by industry is not aimed at academic publication, and the ones aimed at academic publication have the usual incentives to find something and so are part of the overall problem.
Yet your paper somehow found a ridiculously high positive rate for acupuncture. The way I think it would work, well, first thing first it’s very difficult to blind acupuncture studies and inadequately blinded experiments should find positive result from the placebo effect,
Huh? The paper finds that acupuncture study rates vary by region. USA/Sweden/Germany 53/59%/63%, China/Japan/Taiwan 100% etc
secondarily, because that’s the case, nobody really cares about that effect, and thirdly, de-facto the system did not result in construction of acupuncture centres.
How much have you looked? There’s plenty of acupuncture centres in the USA despite a relatively low acupuncture success rate.
I haven’t really noticed nootropics being a big thing
Does a fish notice water? But fine, maybe you don’t, feel free to supply your own example of Russian pseudoscience and traditional medicine. I doubt Russian science is a shining jewel of perfection with no faults given its 91% acupuncture success rate (admittedly on a small base).
but somehow they didn’t end up replacing antibiotics with homebrew phage therapy
When antibiotics were discovered in 1941 and marketed widely in the U.S. and Europe, Western scientists mostly lost interest in further use and study of phage therapy for some time.[12] Isolated from Western advances in antibiotic production in the 1940s, Russian scientists continued to develop already successful phage therapy to treat the wounds of soldiers in field hospitals. During World War II, the Soviet Union used bacteriophages to treat many soldiers infected with various bacterial diseases e.g. dysentery and gangrene. Russian researchers continued to develop and to refine their treatments and to publish their research and results. However, due to the scientific barriers of the Cold War, this knowledge was not translated and did not proliferate across the world.
Anyway,
To summarize, I see this allegation of some grave fault but I fail to see the consequences of this fault.
How do you see the unseen? Unless someone has done a large definitive RCT, how does one ever prove that a result was bogus? Nobody is ever going to take the time and resources to refute those shitty animal experiments with a much better experiment. Most scientific findings never gets that sort of black-and-white refutation, it just gets quietly forgotten and buried, and even the specialists don’t know about it. Most bad science doesn’t look like Lysenko. Or look at evidence-based medicine in the West: rubbish medicine doesn’t look like a crazy doc slicing open patients with a scalpel, it just looks like regular old medicine which ‘somehow’ turns up no benefit when rigorously tested and is quietly dropped from the medical textbooks.
To diagnose bad science, you need to look at overall metrics and indirect measures—like excess significance. Like 91% of acupuncture studies working.
Doesn’t matter that much. Power, usually quite low...
If you want to persist in your mythical ideas regarding western civilization by postulating what ever you need and making shit up, there’s nothing I or anyone else can do about it.
So? I don’t care why the Russian literature is biased, just that it is.
Your study is making a more specific claim than mere bias in research, it’s claiming bias in one particular direction.
Not sure that’s a good example, as Wikipedia seems to disagree about homebrew phage therapy not being applied:
The point is that the SU was, mostly, using antibiotics (once production was set up, i.e. from some time after ww2).
There’s plenty of acupuncture centres in the USA despite a relatively low acupuncture success rate.
Well, and there wasn’t a plenty in the soviet union despite supposedly higher success rate.
Huh? The paper finds that acupuncture study rates vary by region. USA/Sweden/Germany 53/59%/63%, China/Japan/Taiwan 100% etc
If you don’t know correct rate you can’t tell which specific rate is erroneous. It’s not realistically possible to construct a blind study of acupuncture, so, unlike, say, homoeopathy, it is a very shitty measure of research errors.
To diagnose bad science, you need to look at overall metrics and indirect measures—like excess significance. Like 91% of acupuncture studies working.
I really doubt that 91% of Russian language acupuncture studies published in Soviet Union found a positive effect (I dunno about 1991-1998 Russia, it was fucked up beyond belief at that time), and I don’t know how many studies should have found a positive effect (followed by a note that more adequate blinding must be invented to study it properly).
And we know that what ever was the case there was no Soviet abandonment of normal medicine in favour of acupuncture—the system somehow worked out ok in the end.
If you want to persist in your mythical ideas regarding western civilization by postulating what ever you need and making shit up, there’s nothing I or anyone else can do about it.
That’s not a reply to what I wrote.
Your study is making a more specific claim than mere bias in research, it’s claiming bias in one particular direction.
Yes, that’s what a bias is. A systematic tendency in one direction. As opposed to random error.
The point is that the SU was, mostly, using antibiotics (once production was set up, i.e. from some time after ww2).
And before that, they were using phages despite apparently pretty shaky evidence it was anything but a placebo. That said, pointing out the systematic bias of Russian science (among many other countries, and I’m fascinated, incidentally, how the only country you’re defending like this is… your own. No love for Korea?) does not commit me to the premise that phages do or not work—you’re the one who brought them up as an example of how excellent Russian science is, not me.
Well, and there wasn’t a plenty in the soviet union despite supposedly higher success rate.
How many are there now? Shouldn’t you have looked that up?
If you don’t know correct rate you can’t tell which specific rate is erroneous.
Difference in rates is prima facie evidence of bias, due to the disagreement. If someone says A and someone else says not-A, you don’t need to know what A actually is to observe the contradiction and know at least one party is wrong.
It’s not realistically possible to construct a blind study of acupuncture
Yes it is.
I really doubt that 91% of Russian language acupuncture studies published in Soviet Union found a positive effect (I dunno about 1991-1998 Russia, it was fucked up beyond belief at that time), and I don’t know how many studies should have found a positive effect (followed by a note that more adequate blinding must be invented to study it properly).
And naturally, you have not looked for anything on the topic, you just doubt it.
And we know that what ever was the case there was no Soviet abandonment of normal medicine in favour of acupuncture—the system somehow worked out ok in the end.
Strawman. No country engages in ‘abandonment of normal medicine’ - if you go to China, do you only find acupuncturists? Of course not. The problem is that you find acupuncturists sucking up resources in dispensing expensive placbeos and you find that the scientific community is not strong enough to resist the cultural & institutional pressures and find that acupuncture doesn’t work, resulting in real working medicine being intermeshed with pseudomedicine.
Fortunately, normal medicine (after tremendous investments in R&D and evidence-based medicine) currently works fairly well and I think it would take a long time for it to decay into something as overall bad as pre-modern Western medicine was; I also think some core concepts like germ theory are sufficiently simple & powerful that they can’t be lost, but that would be cold comfort in the hypothetical cargo cult scenario (‘good news: doctors still know what infections are and how to fight epidemics; bad news: everything else they do is so much witch-doctor mumbojumbo based on unproven new therapies, misinterpretations of old therapies which used to work, and traditional treatments like acupuncture’).
Difference in rates is prima facie evidence of bias, due to the disagreement. If someone says A and someone else says not-A, you don’t need to know what A actually is to observe the contradiction and know at least one party is wrong.
Unless A contains indexicals that point to different things in the two cases.
(Maybe Asian acupunturists are better than European ones, or maybe East Asians respond better to acupuncture than Caucasians for some reason, or...)
( I’m not saying that this is likely, just that it’s possible.)
among many other countries, and I’m fascinated, incidentally, how the only country you’re defending like this is… your own.
That’s the one I know most about, obviously. I have no clue about what’s going on in China, Korea, or Japan.
does not commit me to the premise that phages do or not work
Look, it doesn’t matter if phages work or don’t work! The treatment, in favour of which there would be strong bias, got replaced with another treatment, which would have been biased against. Something that wouldn’t have happened if science systematically failed to work in such an extreme and ridiculous manner. I keep forgetting that I really really need to spell out any conclusions when arguing with you. It’s like you’re arguing that a car is missing the wheels but I just drove here on it.
That’s the one I know most about, obviously. I have no clue about what’s going on in China, Korea, or Japan.
So why do you think your defense would not apply equally well (or poorly) to them? What’s the phage of China?
Look, it doesn’t matter if phages work or don’t work! The treatment, in favour of which there would be strong bias, got replaced with another treatment, which would have been biased against. Something that wouldn’t have happened if science systematically failed to work in such an extreme and ridiculous manner.
Oh wow. What a convincing argument. ‘Look, some Russians once did this! Now they do that! No, it doesn’t matter if they were right or wrong before or after!’ Cool. So does that mean I get to point to every single change of medical treatment in the USA as evidence it’s just peachy there? ‘Look, some Americans once did lobotomy! Now they don’t! It doesn’t matter if lobotomies work or don’t work!’
I keep forgetting that I really really need to spell out any conclusions when arguing with you. It’s like you’re arguing that a car is missing the wheels but I just drove here on it.
You didn’t drive shit anywhere.
Besides, the 90%+ proportion of positive results is also the case in the west
That’s on a different dataset, covering more recent time periods, which, as the abstract says, still shows serious problems in East Asia (compromised by relatively small sample: trying to show trends in ‘AS’ using 204 studies over 17 years isn’t terribly precise compared to the 2627 they have for the USA) with the latest data being 85% vs 100%. And 100% significance is a ceiling, so who knows how bad the East Asian research has actually gotten during the same time period Western numbers continue to deteriorate...
If you want to persist in your mythical ideas regarding western civilization by postulating what ever you need and making shit up, there’s nothing I or anyone else can do about it.
You are trying to pull a “everyone and me is against you” stunt against Gwern? Do you have any idea how dumbfoundingly absurd this would sound to most of those of the class “anyone else” who happens to see this exchange?
Ohh and to add. One big ‘thing’ in the Soviet Union was research in phage therapy, hoping to replace antibiotics with it, but somehow they didn’t end up replacing antibiotics with homebrew phage therapy, something that I’d expect to happen if they were simply finding what they wanted to find, and otherwise not doing science. To summarize, I see this allegation of some grave fault but I fail to see the consequences of this fault. Nor did they end up having all the workers take some ‘nootropics’ that don’t work, or anything likewise stupid.
Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying “the coin has tails on both sides at a 95% confidence level” because that’s what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin is two-tailed with a 100% confidence.
I have a coin which I claim is fair: that is, there is equal chance that it lands on heads and tails, and each flip is independent of every other flip.
But when we look at 60 trials of the coin flipped 5 times (that is, 300 total flips), we see that there are no trials in which either 0 heads were flipped or 5 heads were flipped. Every time, it’s 1 to 4 heads.
This is odd- for a fair coin, there’s a 6.25% chance that we would see 5 tails in a row or 5 heads in a row in a set of 5 flips. To not see that 60 times in a row has a probability of only 2.1%, which is rather unlikely! We can state with some confidence that this coin does not look fair; there is some structure to it that suggests the flips are not independent of each other.
One mistake is treating 95% as the chance of the study indicating two-tailed coins, given that they were two-tailed coins. More likely it was meant as the chance of the study not indicating two-tailed coins, given that they were not two-tailed coins.
Try this:
You want to test if a coin is biased towards heads. You flip it 5 times, and consider 5 heads as a positive result, 4 heads or fewer as negative. You’re aiming for 95% confidence but have to get 31⁄32 = 96.875%. Treating 4 heads as a possible result wouldn’t work either, as that would get you less than 95% confidence.
This doesn’t seem like a good analogy to any real-world situation. The null hypothesis (“the coin really has two tails”) predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.
While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study.
Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally not (1-p), much less (1-alpha).
I don’t see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?
(1) is obvious, of course—in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates.
On another level, I suppose it can be said that of course they are all biased! But, by the actual two-tailed coin rather than researchers’ prejudice against normal coins.
Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the “the”. The final overconfidence is the least of the mistakes in the story.
Mistakes more relevant to practical empiricism: Treating “>= 95%” as “= 95%” is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively ‘correct’, but they will be less informed.
Hence my question in another thread: Was that “exactly 95% confidence” or “at least 95% confidence”? However when researchers say “at a 95% confidence level” they typically mean “p < 0.05″, and reporting the actual p-values is often even explicitly discouraged (let’s not digress into whether it is justified).
Yet the mistake I had in mind (as opposed to other, less relevant, merely “a” mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn’t guarantee you an automatic 5% chance of getting the other.
Simple statistics, but eye-opening. I wonder if gwern would be interested enough to do a similar analysis, or maybe he already has.
Goetz is re-inventing a meta-analytic wheel here (which is nothing to be ashamed of). It certainly is the case that a body of results can be too good to be true. To Goetz’s examples, I’ll add acupuncture, but wait, that’s not all! We can add everything to the list: “Do Certain Countries Produce Only Positive Results? A Systematic Review of Controlled Trials” is a fun** paper which finds
‘Excess significance’ is not a new concept (fun fact: people even use the phrase ‘too good to be true’ to summarize it, just like Goetz does) and is a valid sign of bias in whatever set of studies on is looking at, and as he says, you can treat it as a binomial to calculate the odds of n studies failing to hit their quota of 5% false positives and instead delivering 0% or whatever. But 5% here is just the lower bound, you can substantially improve by taking into account statistical power, this is how Schimmack’s ‘incredibility index’ basically works*. More recently is the p-curve approach, but I don’t understand that as well.
To some extent, you can also diagnose this problem in funnel plots: if studies-datapoints clump ‘too tightly’ within the cone of precision vs significance and you don’t see any small/low-power studies wandering over into the ‘bad’ area of point-estimates where random noise should be bouncing at least some of them, then there’s something funny going on with the data.
* I say a bit because Schimmack intends his II for use in psychology papers of the sort which report, say, 5 experiments testing a particular hypothesis, and mirabile dictu, all 5 support the authors’ theory.
Now, if we considered only false positives, the odds of all 5 not being false positives is 0.95^5 or 77.4% - so 5 positives isn’t especially damning, nothing like 60 papers all claiming positive results. But we can do better, by looking at the other kind of error.
Shimmack points out that you can look instead at the other side of the coin from alpha/false positives: statistical power, the odds of finding a statistically-significant result assuming the effect actually exists. Given that experiments usually have low power like 50%, that means half of the paper’s experiments should have ‘failed’ even if they were right, so now we ask instead, ‘since half the experiments should have failed even in the best case that we’re testing a true hypothesis, how likely are these results of all 5 succeeding?’ then the calculation is 0.5^5 or 3% - so their results are truly incredible!
(If I understand the logic of NHST correctly, 5% is merely the guaranteed lower bound of error, due to the choice of 0.05 for alpha. But unless every experiment is run with a billion subjects and has statistical power of 100%, the real percentage of ‘failed’ studies should be much higher, with the exact amount based on how bad the power is.)
** Did I say ‘fun’? I actually meant, ‘incredibly depressing’ and ‘makes me fear for the future of science if so much cargo cult science can be done in non-Western countries which have the benefit of centuries of scientific work and philosophy and many of whose scientists trained in the West, and yet somehow, it seems that the spirit of science just didn’t get conveyed, and science there has been corrupted into a hollow mockery of itself, creating legions of witch-doctors who run “experiments” and write “papers” and do “statistics” none of which means anything’.
Science is not a magic bullet against bad incentives. I am more optimistic, we are getting a lot done despite bad incentives.
But none of the incentives seem particularly strong there. It’s not offensive to any state religion, it’s not objectionable to local landlords, it’s not a subversive creed espoused by revolutionaries who want to depose the emperor. The bad incentives here seem to be small bureaucratic ones along the line of it being easier to judge academics for promotion based on how many papers they publish. If genuine science can’t survive that and will degenerate into cargo cult science when hit by such weak incentives...
People respond strongly to this in the West also—“least publishable units”, etc.
This is almost mystical wording. There is bad science in the West, and good science in the East. I would venture to guess that the crappy state of science in e.g. China is just due to the weak institutions/high corruption levels in their society. If you think you can get away with dumping plastic in milk, a little data faking is the least of your problems. As that gets better, science will get better too.
And yet, at least clinical trials fail here, and we don’t have peer-review rings being busted or people throwing bales of money out the window as the police raid them for assisting academic fraud. (To name some recent Chinese examples.)
Again, what incentives? If science cannot survive some ‘weak institutions’ abroad, which don’t strike me as any worse than, say, the Gilded Age in America (and keep in mind the relative per capita GDPs of China now and, say, the golden age of German science before WWII), how long can one expect it to last?
It’s gesturing to society-wide factors of morality, values, and personality, yes, since it doesn’t seem to be related to more mundane factors like per capita GDP.
Japan is a case in point here. Almost as bad as China on the trial metric despite over a century of Western-style science and a generally uncorrupt society which went through its growing pains decades ago.
That explains China and Russia/USSR, it doesn’t explain Japan and Taiwan.
The study was looking at English texts, not Russian, Chinese, or Japanese texts.
edit: a study on foreign language bias in German speaking countries.
And that’s Germans, for whom it is piss easy to learn English (compared to Russians, Chinese, or Japanese).
Why did you omit the part where a third of the sample was published in both English and German, and hence weakens the bias? (That is comparable to the overlap for Chinese & English publications.)
There’s something that just didn’t get conveyed: English language. That paper, with it’s idiot finding, was looking at the studies downloaded from Medline and presumably published in English, or at least with an English abstract (the search was done for English terms and no translation efforts were mentioned).
As long as researchers retain freedom to either write their study up in English or not there’s going to be an additional publication-in-a-very-foreign-language bias.
With regards to acupuncture, one thing that didn’t happen, is soviet union being full of acupuncture centres and posters about awesomeness of acupuncture everywhere on the walls, something that would have happened if there was indeed such a high prevalence of positive findings in locally available literature.
As a rule of thumb, I would say that any research published after the early 1990s in a language other than English is most likely crap.
Why do you think it changed, and in the early 1990s specifically? (The original study I posted only examined ’90s papers and so couldn’t show any time-series like that, so it can’t be why you think that.)
I suppose that before the 1990s respectable Soviet scientists published primarily in Russian.
Yes, but it’s not sufficient to explain the results. To use your German example, even a doubling of significance rates in vernacular vs English doesn’t give one ~100% success rate in evaluating treatments since their net success rate across the 3 categories is going to be something like 40%. Nor is publishing in English going to be a rare and special event, regardless of how hard English is to learn, because publishing in high-impact English-language journals is part of how Chinese universities are ranked and people are rewarded.
Uh huh. But acupuncture is not part of the Russian cultural heritage. What I do see instead is, to name one example (what with not being a Russian and familiar with the particular pathologies of Russian science), tons of bogus nootropics studies (they come up on /r/nootropics periodically as people discover yet another translated abstract on Pubmed of a sketchy substance cursorily tested in animals), because interest in human enhancement is part of Russian culture.
Unsurprisingly, pseudo-medicine and pseudo-science will vary by region—which is, after all, the point of comparing acupuncture studies in the West to studies in East Asia! (If there were millions of acupuncture fanatics in Russia and the UK and the USA just like in China/Korea/Japan, then what would we learn, exactly, from comparing studies?) We expect there to be regional differences and that the West will be less committed & more disinterested than East Asia, closer to the ground truth, and hence the difference gives us a lower bound on how big the biases are.
Publication in general doesn’t have to be rare and special, only the publications of negative results has to be uncommon. People just care less about publishing negative results and prefer to publish positive results; if there’s X amount of effort for publication in a foreign language, and the positive studies already use up all of the X, no X is left for negative results… There’s other issues, e.g. how many of those tests were re-testing simple, effective FDA-approved drugs and such?
Also, for the Soviet union, there would be a certain political advantage in finding no efficacy of drugs that are expensive to manufacture or import. And one big aspect of soviet backwardness was always the disbelief that something actually works.
Even assuming that the publications always found what ever experimenter wanted to find, it wouldn’t explain that predominantly an effect is found. What’s of the chemical safety studies? There’s a very strong bias to fail to disprove the null hypothesis.
Yet your paper somehow found a ridiculously high positive rate for acupuncture. The way I think it would work, well, first thing first it’s very difficult to blind acupuncture studies and inadequately blinded experiments should find positive result from the placebo effect, secondarily, because that’s the case, nobody really cares about that effect, and thirdly, de-facto the system did not result in construction of acupuncture centres.
I haven’t really noticed nootropics being a big thing, and various rat maze studies were and are largely complete crap anyway. To the point that the impact of experimenter’s gender got only discovered recently.
edit: also if we’re looking at Russia from 1991 to 1998, that was the time when scientists and other such government employees were literally not getting paid their wages. I remember that time, my parents were not paid for months at a time, they were reselling shampoo on the side to get some cash.
I realize that, and I’ve already pointed out why the difference in rates is not going to be that large & that your cite does not explain the excess significance in their sample.
Doesn’t matter that much. Power, usually quite low, sets the upper limit to how many of the results should have been positive even if we assume every single one was testing a known-efficacious drug (which hypothesis raises its own problems: how is that consistent with your claims about the language bias towards publishing cool new results?)
So? I don’t care why the Russian literature is biased, just that it is.
Yes, but toxicology studies being done by industry is not aimed at academic publication, and the ones aimed at academic publication have the usual incentives to find something and so are part of the overall problem.
Huh? The paper finds that acupuncture study rates vary by region. USA/Sweden/Germany 53/59%/63%, China/Japan/Taiwan 100% etc
How much have you looked? There’s plenty of acupuncture centres in the USA despite a relatively low acupuncture success rate.
Does a fish notice water? But fine, maybe you don’t, feel free to supply your own example of Russian pseudoscience and traditional medicine. I doubt Russian science is a shining jewel of perfection with no faults given its 91% acupuncture success rate (admittedly on a small base).
Not sure that’s a good example, as Wikipedia seems to disagree about homebrew phage therapy not being applied: https://en.wikipedia.org/wiki/Phage_therapy#History
Anyway,
How do you see the unseen? Unless someone has done a large definitive RCT, how does one ever prove that a result was bogus? Nobody is ever going to take the time and resources to refute those shitty animal experiments with a much better experiment. Most scientific findings never gets that sort of black-and-white refutation, it just gets quietly forgotten and buried, and even the specialists don’t know about it. Most bad science doesn’t look like Lysenko. Or look at evidence-based medicine in the West: rubbish medicine doesn’t look like a crazy doc slicing open patients with a scalpel, it just looks like regular old medicine which ‘somehow’ turns up no benefit when rigorously tested and is quietly dropped from the medical textbooks.
To diagnose bad science, you need to look at overall metrics and indirect measures—like excess significance. Like 91% of acupuncture studies working.
Well, humans do notice air some of the time. (SCNR.)
If you want to persist in your mythical ideas regarding western civilization by postulating what ever you need and making shit up, there’s nothing I or anyone else can do about it.
Your study is making a more specific claim than mere bias in research, it’s claiming bias in one particular direction.
The point is that the SU was, mostly, using antibiotics (once production was set up, i.e. from some time after ww2).
Well, and there wasn’t a plenty in the soviet union despite supposedly higher success rate.
If you don’t know correct rate you can’t tell which specific rate is erroneous. It’s not realistically possible to construct a blind study of acupuncture, so, unlike, say, homoeopathy, it is a very shitty measure of research errors.
I really doubt that 91% of Russian language acupuncture studies published in Soviet Union found a positive effect (I dunno about 1991-1998 Russia, it was fucked up beyond belief at that time), and I don’t know how many studies should have found a positive effect (followed by a note that more adequate blinding must be invented to study it properly).
And we know that what ever was the case there was no Soviet abandonment of normal medicine in favour of acupuncture—the system somehow worked out ok in the end.
That’s not a reply to what I wrote.
Yes, that’s what a bias is. A systematic tendency in one direction. As opposed to random error.
And before that, they were using phages despite apparently pretty shaky evidence it was anything but a placebo. That said, pointing out the systematic bias of Russian science (among many other countries, and I’m fascinated, incidentally, how the only country you’re defending like this is… your own. No love for Korea?) does not commit me to the premise that phages do or not work—you’re the one who brought them up as an example of how excellent Russian science is, not me.
How many are there now? Shouldn’t you have looked that up?
Difference in rates is prima facie evidence of bias, due to the disagreement. If someone says A and someone else says not-A, you don’t need to know what A actually is to observe the contradiction and know at least one party is wrong.
Yes it is.
And naturally, you have not looked for anything on the topic, you just doubt it.
Strawman. No country engages in ‘abandonment of normal medicine’ - if you go to China, do you only find acupuncturists? Of course not. The problem is that you find acupuncturists sucking up resources in dispensing expensive placbeos and you find that the scientific community is not strong enough to resist the cultural & institutional pressures and find that acupuncture doesn’t work, resulting in real working medicine being intermeshed with pseudomedicine.
Fortunately, normal medicine (after tremendous investments in R&D and evidence-based medicine) currently works fairly well and I think it would take a long time for it to decay into something as overall bad as pre-modern Western medicine was; I also think some core concepts like germ theory are sufficiently simple & powerful that they can’t be lost, but that would be cold comfort in the hypothetical cargo cult scenario (‘good news: doctors still know what infections are and how to fight epidemics; bad news: everything else they do is so much witch-doctor mumbojumbo based on unproven new therapies, misinterpretations of old therapies which used to work, and traditional treatments like acupuncture’).
Unless A contains indexicals that point to different things in the two cases.
(Maybe Asian acupunturists are better than European ones, or maybe East Asians respond better to acupuncture than Caucasians for some reason, or...)
( I’m not saying that this is likely, just that it’s possible.)
I was referring to your other comment.
That’s the one I know most about, obviously. I have no clue about what’s going on in China, Korea, or Japan.
Look, it doesn’t matter if phages work or don’t work! The treatment, in favour of which there would be strong bias, got replaced with another treatment, which would have been biased against. Something that wouldn’t have happened if science systematically failed to work in such an extreme and ridiculous manner. I keep forgetting that I really really need to spell out any conclusions when arguing with you. It’s like you’re arguing that a car is missing the wheels but I just drove here on it.
Besides, the 90%+ proportion of positive results is also the case in the west
(also, in the past we had stuff like lobotomy in the west)
So why do you think your defense would not apply equally well (or poorly) to them? What’s the phage of China?
Oh wow. What a convincing argument. ‘Look, some Russians once did this! Now they do that! No, it doesn’t matter if they were right or wrong before or after!’ Cool. So does that mean I get to point to every single change of medical treatment in the USA as evidence it’s just peachy there? ‘Look, some Americans once did lobotomy! Now they don’t! It doesn’t matter if lobotomies work or don’t work!’
You didn’t drive shit anywhere.
That’s on a different dataset, covering more recent time periods, which, as the abstract says, still shows serious problems in East Asia (compromised by relatively small sample: trying to show trends in ‘AS’ using 204 studies over 17 years isn’t terribly precise compared to the 2627 they have for the USA) with the latest data being 85% vs 100%. And 100% significance is a ceiling, so who knows how bad the East Asian research has actually gotten during the same time period Western numbers continue to deteriorate...
You are trying to pull a “everyone and me is against you” stunt against Gwern? Do you have any idea how dumbfoundingly absurd this would sound to most of those of the class “anyone else” who happens to see this exchange?
Ohh and to add. One big ‘thing’ in the Soviet Union was research in phage therapy, hoping to replace antibiotics with it, but somehow they didn’t end up replacing antibiotics with homebrew phage therapy, something that I’d expect to happen if they were simply finding what they wanted to find, and otherwise not doing science. To summarize, I see this allegation of some grave fault but I fail to see the consequences of this fault. Nor did they end up having all the workers take some ‘nootropics’ that don’t work, or anything likewise stupid.
Well, perhaps a bit too simple. Consider this. You set your confidence level at 95% and start throwing a coin. You observe 100 tails out of 100. You publish a report saying “the coin has tails on both sides at a 95% confidence level” because that’s what you chose during design. Then 99 other researchers repeat your experiment with the same coin, arriving at the same 95%-confidence conclusion. But you would expect to see about 5 reports claiming otherwise! The paradox is resolved when somebody comes up with a trick using a mirror to observe both sides of the coin at once, finally concluding that the coin is two-tailed with a 100% confidence.
What was the mistake?
I don’t know if the original post was changed, but it explicitly addresses this point:
The actual situation is described this way:
I have a coin which I claim is fair: that is, there is equal chance that it lands on heads and tails, and each flip is independent of every other flip.
But when we look at 60 trials of the coin flipped 5 times (that is, 300 total flips), we see that there are no trials in which either 0 heads were flipped or 5 heads were flipped. Every time, it’s 1 to 4 heads.
This is odd- for a fair coin, there’s a 6.25% chance that we would see 5 tails in a row or 5 heads in a row in a set of 5 flips. To not see that 60 times in a row has a probability of only 2.1%, which is rather unlikely! We can state with some confidence that this coin does not look fair; there is some structure to it that suggests the flips are not independent of each other.
One mistake is treating 95% as the chance of the study indicating two-tailed coins, given that they were two-tailed coins. More likely it was meant as the chance of the study not indicating two-tailed coins, given that they were not two-tailed coins.
Try this:
You want to test if a coin is biased towards heads. You flip it 5 times, and consider 5 heads as a positive result, 4 heads or fewer as negative. You’re aiming for 95% confidence but have to get 31⁄32 = 96.875%. Treating 4 heads as a possible result wouldn’t work either, as that would get you less than 95% confidence.
This doesn’t seem like a good analogy to any real-world situation. The null hypothesis (“the coin really has two tails”) predicts the exact same outcome every time, so every experiment should get a p-value of 1, unless the null-hypothesis is false, in which case someone will eventually get a p-value of 0. This is a bit of a pathological case which bears little resemblance to real statistical studies.
While the situation admittedly is oversimplified, it does seem to have the advantage that anyone can replicate it exactly at a very moderate expense (a two-headed coin will also do, with a minimum amount of caution). In that respect it may actually be more relevant to real world than any vaccine/autism study.
Indeed, every experiment should get a pretty strong p-value (though never exactly 1), but what gets reported is not the actual p but whether it is above .95 (which is an arbitrary threshold proposed once by Fisher who never intended it to play the role it plays in science currently, but merely as a rule of thumb to see if a hypothesis is worth a follow-up at all.) But even the exact p-values refer to only one possible type of error, and the probability of the other is generally not (1-p), much less (1-alpha).
I don’t see a paradox. After 100 experiments one can conclude that either the confidence level was set too low, or the papers are all biased toward two-tailed coins. But which is it?
(1) is obvious, of course—in hindsight. However changing your confidence level after the observation is generally advised against. But (2) seems to be confusing Type I and Type II error rates.
On another level, I suppose it can be said that of course they are all biased! But, by the actual two-tailed coin rather than researchers’ prejudice against normal coins.
Neglecting all of the hypotheses which would result in the mirrored observation which do not involve the coin being two tailed. The mistake in your question is the “the”. The final overconfidence is the least of the mistakes in the story.
Mistakes more relevant to practical empiricism: Treating “>= 95%” as “= 95%” is a reasoning error, resulting in overtly wrong beliefs. Choosing to abandon all information apart from the single boolean is a (less serious) efficiency error. Listeners can still be subjectively-objectively ‘correct’, but they will be less informed.
Hence my question in another thread: Was that “exactly 95% confidence” or “at least 95% confidence”? However when researchers say “at a 95% confidence level” they typically mean “p < 0.05″, and reporting the actual p-values is often even explicitly discouraged (let’s not digress into whether it is justified).
Yet the mistake I had in mind (as opposed to other, less relevant, merely “a” mistakes) involves Type I and Type II error rates. Just because you are 95% (or more) confident of not making one type of error doesn’t guarantee you an automatic 5% chance of getting the other.