One reason is that experiments are designed to be repeatable. So we repeat them or versions of them. Observational studies are really just sophisticated analyses of something that has happened, so one may more successfully appeal to “this time/context is different.”
Another reason is that a lot of times observational studies have proprietary/secret data. Publicly-available datasets are less susceptible to this excuse but sometimes those are paired with proprietary data to develop the research.
I was literally in a seminar this week where Person A was explaining their analysis of proprietary data and found a substitution effect. Person B said they had their own paper with a different proprietary dataset and found a complementarity effect. Without fraud, with perfectly reproducible code, and with utter robustness to alternative specifications, these effects could still coexist. But good luck testing robustness beyond the robustness tests done by the original author(s), good luck testing if the analysis is reproducible, and good luck testing whether any of the data is fraudulent...without the data.
I’m not even advocating for open data, just explaining that the trade secret and “uniqueness” differential make observational studies less conducive to that kind of scrutiny. For what it’s worth, reviewers/authors tend to demand/provide more specification robustness tests for observational than experimental data. Some of that has to do with the judgment calls inherent in trying to handle endogeneity issues in observational research (experiments can appeal to random assignment) and covering bases with respect to those “so many more [researcher] degrees of freedom.”
A third reason is simply that the crisis struck effects tested with low-powered studies. Because of implementing a statistical significance filter, the published effects were therefore more likely to be overestimates of the true effects. The N of observational studies mitigates this source of overestimation but runs the risk of finding practically-trivial effects (so sometimes you’ll see justifications for how the effect size is actually practically meaningful).
One reason is that experiments are designed to be repeatable. So we repeat them or versions of them. Observational studies are really just sophisticated analyses of something that has happened, so one may more successfully appeal to “this time/context is different.”
Another reason is that a lot of times observational studies have proprietary/secret data. Publicly-available datasets are less susceptible to this excuse but sometimes those are paired with proprietary data to develop the research.
I was literally in a seminar this week where Person A was explaining their analysis of proprietary data and found a substitution effect. Person B said they had their own paper with a different proprietary dataset and found a complementarity effect. Without fraud, with perfectly reproducible code, and with utter robustness to alternative specifications, these effects could still coexist. But good luck testing robustness beyond the robustness tests done by the original author(s), good luck testing if the analysis is reproducible, and good luck testing whether any of the data is fraudulent...without the data.
I’m not even advocating for open data, just explaining that the trade secret and “uniqueness” differential make observational studies less conducive to that kind of scrutiny. For what it’s worth, reviewers/authors tend to demand/provide more specification robustness tests for observational than experimental data. Some of that has to do with the judgment calls inherent in trying to handle endogeneity issues in observational research (experiments can appeal to random assignment) and covering bases with respect to those “so many more [researcher] degrees of freedom.”
A third reason is simply that the crisis struck effects tested with low-powered studies. Because of implementing a statistical significance filter, the published effects were therefore more likely to be overestimates of the true effects. The N of observational studies mitigates this source of overestimation but runs the risk of finding practically-trivial effects (so sometimes you’ll see justifications for how the effect size is actually practically meaningful).