This also reminds me of Tal Yarkoni’s paper on what he calls the generalizability crisis in psychology. That’s the fact that psychological experiments measure a very specific thing that’s treated as corresponding to a more general thing. Psychologists think that the specific thing measures the more general thing, and Yarkoni argues that they’re not measuring what they think they’re measuring.
One of his examples is about the study of verbal overshadowing. This is a claimed phenomenon where if you have to verbally describe what a face looks like, you will be worse off at actually recognizing that face later on. The hypothesis is that producing the verbal description causes you to remember the verbal description, while remembering the actual face less well—but the verbal description inevitably contains less detail. This has been generalized to the broader claim of “producing verbal descriptions of experiences impair our later recollection of them”.
Yarkoni discusses an effort to replicate one of the original experiments:
Alogna and colleagues (2014) conducted a large-scale “registered replication report” (RRR; Simons, Holcombe, & Spellman, 2014) involving 31 sites and over 2,000 participants. The study sought to replicate an influential experiment by Schooler and Engstler-Schooler (1990) in which the original authors showed that participants who were asked to verbally describe the appearance of a perpetrator caught committing a crime on video showed poorer recognition of the perpetrator following a delay than did participants assigned to a control task (naming as many countries and capitals as they could). Schooler & Engstler-Schooler (1990) dubbed this the verbal overshadowing effect. In both the original and replication experiments, only a single video, containing a single perpetrator, was presented at encoding, and only a single set of foil items was used at test. Alogna et al. successfully replicated the original result in one of two tested conditions, and concluded that their findings revealed “a robust verbal overshadowing effect” in that condition.
Let us assume for the sake of argument that there is a genuine and robust causal relationship between the manipulation and outcome employed in the Alogna et al study. I submit that there would still be essentially no support for the authors’ assertion that they found a “robust” verbal overshadowing effect, because the experimental design and statistical model used in the study simply cannot support such a generalization. The strict conclusion we are entitled to draw, given the limitations of the experimental design inherited from Schooler and Engstler-Schooler (1990), is that there is at least one particular video containing one particular face that, when followed by one particular lineup of faces, is more difficult for participants to identify if they previously verbally described the appearance of the target face than if they were asked to name countries and capitals. [...]
On any reasonable interpretation of the construct of verbal overshadowing, the corresponding universe of intended generalization should clearly also include most of the operationalizations that would result from randomly sampling various combinations of these factors (e.g., one would expect it to still count as verbal overshadowing if Alogna et al. had used live actors to enact the crime scene, instead of showing a video). Once we accept this assumption, however, the critical question researchers should immediately ask themselves is: are there other psychological processes besides verbal overshadowing that could plausibly be influenced by random variation in any of these uninteresting factors, independently of the hypothesized psychological processes of interest? A moment or two of consideration should suffice to convince one that the answer is a resounding yes. It is not hard to think of dozens of explanations unrelated to verbal overshadowing that could explain the causal effect of a given manipulation on a given outcome in any single operationalization.
This verbal overshadowing example is by no means unusual. The same concerns apply equally to the broader psychology literature containing tens or hundreds of thousands of studies that routinely adopt similar practices. In most of psychology, it is standard operating procedure for researchers employing just one experimental task, between-subject manipulation, experimenters, testing room, research site, etc., to behave as though an extremely narrow operationalization is an acceptable proxy for a much broader universe of admissible observations. It is instructive—and somewhat fascinating from a sociological perspective—to observe that while no psychometrician worth their salt would ever recommend a default strategy of measuring complex psychological constructs using a single unvalidated item, the majority of psychology studies do precisely that with respect to multiple key design factors. The modal approach is to stop at a perfunctory demonstration of face validity—that is, to conclude that if a particular operationalization seems like it has something to do with the construct of interest, then it is an acceptable stand-in for that construct. Any measurement-level findings are then uncritically generalized to the construct level, leading researchers to conclude that they’ve learned something useful about broader phenomena like verbal overshadowing, working memory, ego depletion, etc., when in fact such sweeping generalizations typically obtain little support from the reported empirical studies.
Interesting. This resonates, and yet maybe stands in tension, with complaints that social psychology fails to do enough exact replications. I remember that a criticism of social psychology was that researchers would test a generalization like priming in too many different ways, and people were suspicious about whether or not any of the effects would stand up to replication.
I’d love to see a description of what this field should be doing. There’s a sweet spot between too much weight on one experimental approach, and too little exact replication. How does a field identify that sweet spot, and how can it coordinate to carry out experiments in the sweet spot?
Yeah, I was thinking this same thing. I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.
I feel like it really comes down to how powerful a study is. When you have tons of data like a big tech company might, or the results are really straightforward, like in some of the hard sciences, I think this is a great approach. When the effects of a treatment are subtler and sample size is more limited, as is often the case in the social sciences, I would be wary to recommend testing everything you can think of.
I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.
I’d say that’s more a problem of selective reporting.
Great post!
This also reminds me of Tal Yarkoni’s paper on what he calls the generalizability crisis in psychology. That’s the fact that psychological experiments measure a very specific thing that’s treated as corresponding to a more general thing. Psychologists think that the specific thing measures the more general thing, and Yarkoni argues that they’re not measuring what they think they’re measuring.
One of his examples is about the study of verbal overshadowing. This is a claimed phenomenon where if you have to verbally describe what a face looks like, you will be worse off at actually recognizing that face later on. The hypothesis is that producing the verbal description causes you to remember the verbal description, while remembering the actual face less well—but the verbal description inevitably contains less detail. This has been generalized to the broader claim of “producing verbal descriptions of experiences impair our later recollection of them”.
Yarkoni discusses an effort to replicate one of the original experiments:
Interesting. This resonates, and yet maybe stands in tension, with complaints that social psychology fails to do enough exact replications. I remember that a criticism of social psychology was that researchers would test a generalization like priming in too many different ways, and people were suspicious about whether or not any of the effects would stand up to replication.
I’d love to see a description of what this field should be doing. There’s a sweet spot between too much weight on one experimental approach, and too little exact replication. How does a field identify that sweet spot, and how can it coordinate to carry out experiments in the sweet spot?
Yeah, I was thinking this same thing. I feel like I’m social sciences I’m more concerned about researchers testing for too many things and increasing the probability of false positives than testing too few things and maybe not fully understanding a result.
I feel like it really comes down to how powerful a study is. When you have tons of data like a big tech company might, or the results are really straightforward, like in some of the hard sciences, I think this is a great approach. When the effects of a treatment are subtler and sample size is more limited, as is often the case in the social sciences, I would be wary to recommend testing everything you can think of.
I’d say that’s more a problem of selective reporting.