Yeah, that was my reaction too: regardless of intentions, the scientific method is, in the “soft” sciences, frequently not arriving at the truth.
The follow up question should of course be: how can we fix it? Or more pragmatically, how can you identify whether a study’s conclusion can be trusted? I’ve seen calls to improve all the things that are broken right now: reduce p-hacking and publication bias, aim for lower p values, spread better knowledge of statistics, do more robustness checks, etc. etc.. This post adds to the list of things that must be fixed before studies are reliable.
But one thing I’ve wondered is: what about focusing more on studies that find large effects? There are two advantages: (i) it’s harder to miss large effects, making the conclusion more reliable and easier to reproduce, and (ii) if the effects is small, it doesn’t matter as much anyways. For example, I trust the research on the planning fallacy more because the effect is so pronounced. And I’m much more interested to know about things that are very carcinogenic than about things that are just barely carcinogenic enough to be detected.
So, has someone written the book “Top 20 Largest Effects Found in [social science / medicine / etc.]”? I would buy it in a heartbeat.
I’ve seen calls to improve all the things that are broken right now: <list>
I think this is a flaw in and of itself. There are many, many ways to go wrong, and the entire standard list (p-hacking, selective reporting, multiple stopping criteria, you name it) should be interpreted more as symptoms than as causes of a scientific crisis.
The crux of the whole scientific approach is that you empirically separate hypothetical universes. You do this by making your universe-hypotheses spit out predictions, and then verify them. It seems to me that by and large this process is ignored or even completely absent when we start asking difficult soft science questions. And to clarify: I don’t particularly blame any researcher, or institute, or publishing agency or peer doing some reviewing. I think that the task at hand is so inhumanly difficult that collectively we are not up to it, and instead we create some semblance of science and call it a day.
From a distanced perspective, I would like my entire scientific process to look like reverse-engineering a big black box labeled ‘universe’. It has input buttons and output channels. Our paradigm postulate correlations between input settings and outputs, and then an individual hypothesis makes a claim about the input settings. We track forward what outputs would be caused by any possible input setting, observe the reality, and update with Bayesian odds ratios.
The problem is frequently that the data we are relying on is influenced by an absolutely gargantuan number of factors—as an example in the OP, the teenage pregnancy rate. I have no trouble believing that statewide schooling laws have some impact on this, but possibly so do for example above-average summer weather, people’s religious background, the ratio of boys to girls in a community, economic (in)stability, recent natural disasters and many more factors. So having observed the teenage pregnancy rates, inferring the impact of the statewide schooling laws is a nigh impossible task. Even just trying to put this into words my mind immediately translated this to “what fraction of the state-by-state variance in teenage pregnancy rates can be attributed to this factor, and what fraction to other factors” but even this is already an oversimplification—why are we comparing states at a fixed time, instead of tracking states over time, or even taking each state-time snapshot as an individual dataset? And why is a linear correlation model accurate, who says we can split the multi-factor model into additive components (implied by the fractions)?
The point I am failing to make is that in this case it is not at all clear what difference in the pregnancy rates we would observe if the statewide schooling laws had a decidedly negative, small negative, small positive or decidedly positive impact, as opposed to one or several of the other factors dominating the observed effects. And without that causal connection we can never infer the impact of these laws from the observed data. This is not a matter of p-hacking or biased science or anything of the sort—the approach doesn’t have the (information theoretic) power to discern the answer we are looking for in the first place, i.e. to single out the true hypothesis from between the false ones.
As for your pragmatic question, how can we tell if a study is to be trusted? I’d recommend asking experts in your field first, and only listening to cynics second. If you insist on asking, my method is to evaluate whether or not it seems plausible to me that, assuming that the conclusion of the paper holds, this would show up as the announced effect observed in the paper. Simultaneously I try to think of several other explanations for the same data. If either of these tries gives some resounding result I tend to chuck the study in the bin. This approach is fraught with confirmation bias (“it seems implausible to me because my view of the world suggests you shouldn’t be able to measure an effect like this”), but I don’t have a better model of the world to consult than my model of the world.
Yeah, that was my reaction too: regardless of intentions, the scientific method is, in the “soft” sciences, frequently not arriving at the truth.
The follow up question should of course be: how can we fix it? Or more pragmatically, how can you identify whether a study’s conclusion can be trusted? I’ve seen calls to improve all the things that are broken right now: reduce p-hacking and publication bias, aim for lower p values, spread better knowledge of statistics, do more robustness checks, etc. etc.. This post adds to the list of things that must be fixed before studies are reliable.
But one thing I’ve wondered is: what about focusing more on studies that find large effects? There are two advantages: (i) it’s harder to miss large effects, making the conclusion more reliable and easier to reproduce, and (ii) if the effects is small, it doesn’t matter as much anyways. For example, I trust the research on the planning fallacy more because the effect is so pronounced. And I’m much more interested to know about things that are very carcinogenic than about things that are just barely carcinogenic enough to be detected.
So, has someone written the book “Top 20 Largest Effects Found in [social science / medicine / etc.]”? I would buy it in a heartbeat.
I think this is a flaw in and of itself. There are many, many ways to go wrong, and the entire standard list (p-hacking, selective reporting, multiple stopping criteria, you name it) should be interpreted more as symptoms than as causes of a scientific crisis.
The crux of the whole scientific approach is that you empirically separate hypothetical universes. You do this by making your universe-hypotheses spit out predictions, and then verify them. It seems to me that by and large this process is ignored or even completely absent when we start asking difficult soft science questions. And to clarify: I don’t particularly blame any researcher, or institute, or publishing agency or peer doing some reviewing. I think that the task at hand is so inhumanly difficult that collectively we are not up to it, and instead we create some semblance of science and call it a day.
From a distanced perspective, I would like my entire scientific process to look like reverse-engineering a big black box labeled ‘universe’. It has input buttons and output channels. Our paradigm postulate correlations between input settings and outputs, and then an individual hypothesis makes a claim about the input settings. We track forward what outputs would be caused by any possible input setting, observe the reality, and update with Bayesian odds ratios.
The problem is frequently that the data we are relying on is influenced by an absolutely gargantuan number of factors—as an example in the OP, the teenage pregnancy rate. I have no trouble believing that statewide schooling laws have some impact on this, but possibly so do for example above-average summer weather, people’s religious background, the ratio of boys to girls in a community, economic (in)stability, recent natural disasters and many more factors. So having observed the teenage pregnancy rates, inferring the impact of the statewide schooling laws is a nigh impossible task. Even just trying to put this into words my mind immediately translated this to “what fraction of the state-by-state variance in teenage pregnancy rates can be attributed to this factor, and what fraction to other factors” but even this is already an oversimplification—why are we comparing states at a fixed time, instead of tracking states over time, or even taking each state-time snapshot as an individual dataset? And why is a linear correlation model accurate, who says we can split the multi-factor model into additive components (implied by the fractions)?
The point I am failing to make is that in this case it is not at all clear what difference in the pregnancy rates we would observe if the statewide schooling laws had a decidedly negative, small negative, small positive or decidedly positive impact, as opposed to one or several of the other factors dominating the observed effects. And without that causal connection we can never infer the impact of these laws from the observed data. This is not a matter of p-hacking or biased science or anything of the sort—the approach doesn’t have the (information theoretic) power to discern the answer we are looking for in the first place, i.e. to single out the true hypothesis from between the false ones.
As for your pragmatic question, how can we tell if a study is to be trusted? I’d recommend asking experts in your field first, and only listening to cynics second. If you insist on asking, my method is to evaluate whether or not it seems plausible to me that, assuming that the conclusion of the paper holds, this would show up as the announced effect observed in the paper. Simultaneously I try to think of several other explanations for the same data. If either of these tries gives some resounding result I tend to chuck the study in the bin. This approach is fraught with confirmation bias (“it seems implausible to me because my view of the world suggests you shouldn’t be able to measure an effect like this”), but I don’t have a better model of the world to consult than my model of the world.