My claims are really just for CS, idk how much they apply to the social sciences, but the post gives me no reason to think they aren’t true for the social sciences as well.
Just stop citing bad research, I shouldn’t need to tell you this, jesus christ what the fuck is wrong with you people.
This doesn’t work unless it’s common knowledge that the research is bad, since reviewers are looking for reasons to reject and “you didn’t cite this related work” is a classic one (and your paper might be reviewed by the author of the bad work). When I was early in my PhD, I had a paper rejected where it sounded like a major contributing factor was not citing a paper that I specifically thought was not related but the reviewer thought was.
Read the papers you cite. Or at least make your grad students to do it for you. It doesn’t need to be exhaustive: the abstract, a quick look at the descriptive stats, a good look at the table with the main regression results, and then a skim of the conclusions. Maybe a glance at the methodology if they’re doing something unusual. It won’t take more than a couple of minutes. And you owe it not only to SCIENCE!, but also to yourself: the ability to discriminate between what is real and what is not is rather useful if you want to produce good research.23
I think the point of this recommendation is to get people to stop citing bad research. I doubt it will make a difference since as argued above the cause isn’t “we can’t tell which research is bad” but “despite knowing what’s bad we have to cite it anyway”.
When doing peer review, reject claims that are likely to be false. The base replication rate for studies with p>.001 is below 50%. When reviewing a paper whose central claim has a p-value above that, you should recommend against publication unless the paper is exceptional (good methodology, high prior likelihood, etc.)24 If we’re going to have publication bias, at least let that be a bias for true positives. Remember to subtract another 10 percentage points for interaction effects. You don’t need to be complicit in the publication of false claims.
I have issues with this, but they aren’t related to me knowing more about academia than the author, so I’ll skip it. (And it’s more like, I’m uncertain about how good an idea this would be.)
Stop assuming good faith. I’m not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.
The evidence in the post suggesting that people aren’t acting in good faith is roughly “if you know statistics then it’s obvious that the papers you’re writing won’t replicate”. My guess is that many social scientists don’t know statistics and/or don’t apply it intuitively, so I don’t see a reason to reject the (a priori more plausible to me) hypothesis that most people are acting in okay-to-good faith.
I don’t really understand the author’s model here, but my guess is that they are assuming that academics primarily think about “here’s the dataset and here are the analysis results and here are the conclusions”. I can’t speak to social science, but when I’m trying to figure out some complicated thing (e.g “why does my algorithm work in setting X but not setting Y”) I spend most of my time staring at data, generating hypotheses, making predictions with them, etc. which is very very conducive to the garden of forking paths that the author dismisses out of hand.
EDIT: Added some discussion of the other recommendations below, though I know much less about them, and here I’m just relying more on my own intuition rather than my knowledge about academia:
Earmark 60% of funding for registered reports (ie accepted for publication based on the preregistered design only, not results). For some types of work this isn’t feasible, but for ¾ of the papers I skimmed it’s possible. In one fell swoop, p-hacking and publication bias would be virtually eliminated.
I’d be shocked if 3⁄4 of social science papers could have been preregistered. My guess is that what happens is that researchers collect data, do a bunch of analyses, figure out some hypotheses, and only then write the paper.
Possibly the suggestion here is that all this exploratory work should be done first, then a study should be preregistered, and then the results are reported. My weak guess is that this wouldn’t actually help replicability very much—my understanding is that researchers are often able to replicate their own results, even when others can’t. (Which makes sense! If I try to describe to a CHAI intern an algorithm they should try running, I often have the experience that they do something differently than I was expecting. Ideally in social science results would be robust to small variations, but in practice they aren’t, and I wouldn’t strongly expect preregistration to help, though plausibly it would.)
An NSF/NIH inquisition that makes sure the published studies match the pre-registration (there’s so much “”“”””“”””QRP”“”””“”””″ in this area you wouldn’t believe). The SEC has the power to ban people from the financial industry—let’s extend that model to academia.
My general qualms about preregistration apply here too, but if we assume that we’re going to have a preregistration model, then this seems good to me.
Earmark 10% of funding for replications. When the majority of publications are registered reports, replications will be far less valuable than they are today. However, intelligently targeted replications still need to happen.
This seems good to me (though idk if 10% is the right number, I could see both higher and lower).
Increase sample sizes and lower the significance threshold to .005. This one needs to be targeted: studies of small effects probably need to quadruple their sample sizes in order to get their power to reasonable levels. The median study would only need 2x or so. Lowering alpha is generally preferable to increasing power. “But Alvaro, doesn’t that mean that fewer grants would be funded?” Yes.
Personally, I don’t like the idea of significance thresholds and required sample sizes. I like having quantitative data because it informs my intuitions; I can’t just specify a hard decision rule based on how some quantitative data will play out.
Even if this were implemented, I wouldn’t predict much effect on reproducibility: I expect that what happens is the papers we get have even more contingent effects that only the original researchers can reproduce, which happens via them traversing the garden of forking paths even more. Here’s an example with p-values of .002 and .006.
Ignore citation counts. Given that citations are unrelated to (easily-predictable) replicability, let alone any subtler quality aspects, their use as an evaluative tool should stop immediately.
I am very on board with citation counts being terrible, but what should be used instead? If you evaluate based on predicted replicability, you incentivize research that says obvious things, e.g. “rain is correlated with wet sidewalks”.
I suspect that you probably could build a better and still cost-efficient evaluation tool, but it’s not obvious how.
Open data, enforced by the NSF/NIH. There are problems with privacy but I would be tempted to go as far as possible with this. Open data helps detect fraud. And let’s have everyone share their code, too—anything that makes replication/reproduction easier is a step in the right direction.
Seems good, though I’d want to first understand what purpose IRBs serve (you’d have to severely roll back IRBs for open data to become a norm).
Financial incentives for universities and journals to police fraud. It’s not easy to structure this well because on the one hand you want to incentivize them to minimize the frauds published, but on the other hand you want to maximize the frauds being caught. Beware Goodhart’s law!
I approve of the goal “minimize fraud”. This recommendation is too vague for me to comment on the strategy.
Why not do away with the journal system altogether? The NSF could run its own centralized, open website; grants would require publication there. Journals are objectively not doing their job as gatekeepers of quality or truth, so what even is a journal? A combination of taxonomy and reputation. The former is better solved by a simple tag system, and the latter is actually misleading. Peer review is unpaid work anyway, it could continue as is. Attach a replication prediction market (with the estimated probability displayed in gargantuan neon-red font right next to the paper title) and you’re golden. Without the crutch of “high ranked journals” maybe we could move to better ways of evaluating scientific output. No more editors refusing to publish replications. You can’t shift the incentives: academics want to publish in “high-impact” journals, and journals want to selectively publish “high-impact” research. So just make it impossible. Plus as a bonus side-effect this would finally sink Elsevier.
This seems to assume that the NSF would be more competent than journals for some reason. I don’t think the problem is with journals per se, I think the problem is with peer review, so if the NSF continues to use peer review as the author suggests, I don’t expect this to fix anything.
The author also suggests using a replication prediction market; as I mentioned above you don’t want to optimize just for replicability. Possibly you could have replication + some method of incentivizing novelty / importance. The author does note this issue elsewhere but just says “it’s a solvable problem”. I am not so optimistic. I feel like similar a priori reasoning could have led to the author saying “reproducibility will be a solvable problem”.
some method of incentivizing novelty / importance
Citation count clearly isn’t a good measure of accuracy, but it’s likely a good measure of importance in a field. So we could run some kind of expected value calculation where the usefulness of a paper is measured by P(result is true) * (# of citations) - P(result is false) * (# of citations) = (# of citations) * [P(result is true) - P(result is false)].
Edit: where the probabilities are approximated by replication markets. I think this function gives us what we actually want, so optimizing institutions to maximize it seems like a good idea.
Edit: This doesn’t actually represent what we want, since journals can just force everyone to cite the same well replicated study to maximize citation count on that, but it’s a good approximation. Not a great goal, but a good measurement of what we want, but we shouldn’t optimize institutions to maximize it.
some method of incentivizing novelty / importance
Lesswrong upvote count.
Slightly more seriously: Propagation through the academic segments of centerless curation networks. The author might be anticipating continued advances in social media technology, conventions of use, and uptake. Uptake and improvements in conventions of use, at least, seem to be visibly occuring. Advances in technology seem less assured, but I will do what I can.
This problem seems to me to have the flavor of Moloch and/or inadequate equilibria. Your criticisms have two parts, the pre-edit part based on your personal experience, in which you state why the personal actions they recommend are actually not possible because of the inadequate equilibria (i.e. because of academic incentives), and the criticism of the author’s proposed non-personal actions, which you say is just based on intuition.
I think the author would be unsurprised that the personal actions are not reasonable. They have already said this problem requires government intervention, basically to resolve the incentive problem. But maybe at the margin you can take some of the actions that the author refers to in the personal actions. If a paper is on the cusp of “needing to be cited” but you think it won’t replicate, take that into account! Or if reviewing a paper, at least take into account the probability of replication in your decision.
I think you are maybe reading the author’s claim to “stop assuming good faith” too literally. In the subsequent sentence they are basically refining that to the idea that most people are acting in good faith, but are not competent enough for good faith to be a useful assumption, which seems reasonable to me.
If a paper is on the cusp of “needing to be cited” but you think it won’t replicate, take that into account! Or if reviewing a paper, at least take into account the probability of replication in your decision.
Why do you think people don’t already do this?
In general, if you want to make a recommendation on the margin, you have to talk about what the current margin is.
I think you are maybe reading the author’s claim to “stop assuming good faith” too literally. In the subsequent sentence they are basically refining that to the idea that most people are acting in good faith, but are not competent enough for good faith to be a useful assumption
Huh? The sentence I see is
I’m not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.
“the predators are running wild” does not mean “most people are acting in good faith, but are not competent enough for good faith to be a useful assumption”.
They have to do it to some extent, otherwise replicability would be literally uncorrelated with publishability, which probably isn’t the case. But because of the outcomes, we can see that people aren’t doing it enough at the margin, so encouraging people to move as far in that direction as they can seems like a useful reminder.
There are two models here, one is that everyone is a homo economicus when citing papers, so no amount of persuasion is going to adjust people’s citations. They are already making the optimal tradeoff based on their utility function of their personal interests vs. society’s interests. The other is that people are subject to biases and blind spots, or just haven’t even really considered whether they have the OPTION of not citing something that is questionable, in which case reminding them is a useful affordance.
I’m trying to be charitable to the author here, to recover useful advice. They didn’t say things in the way I’m saying them. But they may have been pointing in a useful direction, and I’m trying to steelman that.
“the predators are running wild” does not mean “most people are acting in good faith, but are not competent enough for good faith to be a useful assumption”.
Even upon careful rereading of that sentence, I disagree. But to parse this out based on this little sentence is too pointless for me. Like I said, I’m trying to focus on finding useful substance, not nitpicking the author, or you!
Which specific parts did you have in mind?
My claims are really just for CS, idk how much they apply to the social sciences, but the post gives me no reason to think they aren’t true for the social sciences as well.
This doesn’t work unless it’s common knowledge that the research is bad, since reviewers are looking for reasons to reject and “you didn’t cite this related work” is a classic one (and your paper might be reviewed by the author of the bad work). When I was early in my PhD, I had a paper rejected where it sounded like a major contributing factor was not citing a paper that I specifically thought was not related but the reviewer thought was.
I think the point of this recommendation is to get people to stop citing bad research. I doubt it will make a difference since as argued above the cause isn’t “we can’t tell which research is bad” but “despite knowing what’s bad we have to cite it anyway”.
I have issues with this, but they aren’t related to me knowing more about academia than the author, so I’ll skip it. (And it’s more like, I’m uncertain about how good an idea this would be.)
The evidence in the post suggesting that people aren’t acting in good faith is roughly “if you know statistics then it’s obvious that the papers you’re writing won’t replicate”. My guess is that many social scientists don’t know statistics and/or don’t apply it intuitively, so I don’t see a reason to reject the (a priori more plausible to me) hypothesis that most people are acting in okay-to-good faith.
I don’t really understand the author’s model here, but my guess is that they are assuming that academics primarily think about “here’s the dataset and here are the analysis results and here are the conclusions”. I can’t speak to social science, but when I’m trying to figure out some complicated thing (e.g “why does my algorithm work in setting X but not setting Y”) I spend most of my time staring at data, generating hypotheses, making predictions with them, etc. which is very very conducive to the garden of forking paths that the author dismisses out of hand.
EDIT: Added some discussion of the other recommendations below, though I know much less about them, and here I’m just relying more on my own intuition rather than my knowledge about academia:
I’d be shocked if 3⁄4 of social science papers could have been preregistered. My guess is that what happens is that researchers collect data, do a bunch of analyses, figure out some hypotheses, and only then write the paper.
Possibly the suggestion here is that all this exploratory work should be done first, then a study should be preregistered, and then the results are reported. My weak guess is that this wouldn’t actually help replicability very much—my understanding is that researchers are often able to replicate their own results, even when others can’t. (Which makes sense! If I try to describe to a CHAI intern an algorithm they should try running, I often have the experience that they do something differently than I was expecting. Ideally in social science results would be robust to small variations, but in practice they aren’t, and I wouldn’t strongly expect preregistration to help, though plausibly it would.)
My general qualms about preregistration apply here too, but if we assume that we’re going to have a preregistration model, then this seems good to me.
This seems good to me (though idk if 10% is the right number, I could see both higher and lower).
Personally, I don’t like the idea of significance thresholds and required sample sizes. I like having quantitative data because it informs my intuitions; I can’t just specify a hard decision rule based on how some quantitative data will play out.
Even if this were implemented, I wouldn’t predict much effect on reproducibility: I expect that what happens is the papers we get have even more contingent effects that only the original researchers can reproduce, which happens via them traversing the garden of forking paths even more. Here’s an example with p-values of .002 and .006.
Andrew Gelman makes a similar case.
I am very on board with citation counts being terrible, but what should be used instead? If you evaluate based on predicted replicability, you incentivize research that says obvious things, e.g. “rain is correlated with wet sidewalks”.
I suspect that you probably could build a better and still cost-efficient evaluation tool, but it’s not obvious how.
Seems good, though I’d want to first understand what purpose IRBs serve (you’d have to severely roll back IRBs for open data to become a norm).
I approve of the goal “minimize fraud”. This recommendation is too vague for me to comment on the strategy.
This seems to assume that the NSF would be more competent than journals for some reason. I don’t think the problem is with journals per se, I think the problem is with peer review, so if the NSF continues to use peer review as the author suggests, I don’t expect this to fix anything.
The author also suggests using a replication prediction market; as I mentioned above you don’t want to optimize just for replicability. Possibly you could have replication + some method of incentivizing novelty / importance. The author does note this issue elsewhere but just says “it’s a solvable problem”. I am not so optimistic. I feel like similar a priori reasoning could have led to the author saying “reproducibility will be a solvable problem”.
Citation count clearly isn’t a good measure of accuracy, but it’s likely a good measure of importance in a field. So we could run some kind of expected value calculation where the usefulness of a paper is measured by
P(result is true) * (# of citations) - P(result is false) * (# of citations) = (# of citations) * [P(result is true) - P(result is false)]
.Edit: where the probabilities are approximated by replication markets. I think this function gives us what we actually want, so optimizing institutions to maximize it seems like a good idea.
Edit: This doesn’t actually represent what we want, since journals can just force everyone to cite the same well replicated study to maximize citation count on that, but it’s a good approximation. Not a great goal, but a good measurement of what we want, but we shouldn’t optimize institutions to maximize it.
Lesswrong upvote count.
Slightly more seriously: Propagation through the academic segments of centerless curation networks. The author might be anticipating continued advances in social media technology, conventions of use, and uptake. Uptake and improvements in conventions of use, at least, seem to be visibly occuring. Advances in technology seem less assured, but I will do what I can.
This problem seems to me to have the flavor of Moloch and/or inadequate equilibria. Your criticisms have two parts, the pre-edit part based on your personal experience, in which you state why the personal actions they recommend are actually not possible because of the inadequate equilibria (i.e. because of academic incentives), and the criticism of the author’s proposed non-personal actions, which you say is just based on intuition.
I think the author would be unsurprised that the personal actions are not reasonable. They have already said this problem requires government intervention, basically to resolve the incentive problem. But maybe at the margin you can take some of the actions that the author refers to in the personal actions. If a paper is on the cusp of “needing to be cited” but you think it won’t replicate, take that into account! Or if reviewing a paper, at least take into account the probability of replication in your decision.
I think you are maybe reading the author’s claim to “stop assuming good faith” too literally. In the subsequent sentence they are basically refining that to the idea that most people are acting in good faith, but are not competent enough for good faith to be a useful assumption, which seems reasonable to me.
Why do you think people don’t already do this?
In general, if you want to make a recommendation on the margin, you have to talk about what the current margin is.
Huh? The sentence I see is
“the predators are running wild” does not mean “most people are acting in good faith, but are not competent enough for good faith to be a useful assumption”.
They have to do it to some extent, otherwise replicability would be literally uncorrelated with publishability, which probably isn’t the case. But because of the outcomes, we can see that people aren’t doing it enough at the margin, so encouraging people to move as far in that direction as they can seems like a useful reminder.
There are two models here, one is that everyone is a homo economicus when citing papers, so no amount of persuasion is going to adjust people’s citations. They are already making the optimal tradeoff based on their utility function of their personal interests vs. society’s interests. The other is that people are subject to biases and blind spots, or just haven’t even really considered whether they have the OPTION of not citing something that is questionable, in which case reminding them is a useful affordance.
I’m trying to be charitable to the author here, to recover useful advice. They didn’t say things in the way I’m saying them. But they may have been pointing in a useful direction, and I’m trying to steelman that.
Even upon careful rereading of that sentence, I disagree. But to parse this out based on this little sentence is too pointless for me. Like I said, I’m trying to focus on finding useful substance, not nitpicking the author, or you!