Say you get a large group of people and split them into three groups. The first group gets this question:
What is the length of the Mississippi river?
The second group gets these questions:
Is the Mississippi river greater or less than 70 miles long?
What is the length of the Mississippi river?
The third group gets these:
Is the Mississippi river greater or less than 2000 miles long?
What is the length of the Mississippi river?
When Jacowitz and Kahneman tried this in 1995, they found that the groups had median estimates of 800 miles, 300 miles, and 1500 miles, respectively.
In theory at least, the initial greater / less questions provide no information, and so shouldn’t change the final estimates people give. But they do change them, a lot.
This effect is known as anchoring. It’s widely discussed, but the replication crisis has made me paranoid cautious about trusting anything that hasn’t been replicated. So I tried to find anchoring studies with replications.
Anchoring for estimation
A seminal paper from 1995 is Measures of Anchoring in Estimation Tasks by Jacowitz and Kahneman. They created a list of 15 quantities to estimate and took 156 U.C. Berkeley students who were taking an intro to psychology course. They had three questionnaires, one—for calibration—simply asked them to estimate everything. The other two questionnaires first asked if each quantity was more or less than a certain anchor and then asked for an estimate of the quantity.
The results were that anchoring had a massive effect:
Here, the left column shows the median among people who weren’t given an anchor, the next two columns show what numbers were used as low/high anchors, and the last two columns show the median responses among people given low/high anchors.
But does it replicate?
The Many Labs project is an effort from a team of psychologists around the world. In their first paper (Klein et al., 2014) they tried to replicate 13 different well-known effects, including anchoring. They had a large sample of 5000 people from around the world and gave them four estimation problems.
There was one small difference from the original study. Rather than being asked if the population of Chicago was greater or less than the anchor, participants were simply informed of the truth.
And what are the results? You have to dig to find the raw data, but I was able to put it together from https://osf.io/tjx3f/ and https://osf.io/y36m8/.
Statement | Low Anchor | High Anchor | Mean with low anchor | Mean with high anchor |
---|---|---|---|---|
Babies born per day in US | >100 | <50k | 3.2k | 26.7k |
Population of Chicago | >200k | <5m | 1.03m | 3.00m |
Height of Mt. Everest | >2k ft | <45.5k ft | 11.8k | 34.5k |
San Francisco to New York City distance | >1.5k miles | <6k miles | 2.85k | 3.99k |
The effect is real. It replicates.
Effect size mystery
There is a mystery, though. In the Many Labs paper, almost all the replications find smaller effect sizes than the original papers, but anchoring stands out as having a much larger effect size:
In the additional material on anchoring, they speculate about what might be causing the effect to be larger. This is all pretty strange, because here are the raw numbers in the original paper compared to those in the replication:
Quantity | Low anchor | High anchor |
---|---|---|
Babies born per day US (original) | 1k | 40k |
Babies born per day US (replication) | 3.2k | 26.7k |
Population of Chicago (original) | 600k | 5.05m |
Population of Chicago (replication) | 1.03m | 3m |
Height of Mt. Everest (original) | 8k | 42.55k |
Height of Mt. Everest (replication) | 11.8k | 34.5k |
SF to NYC distance (original) | 2.6k | 4k |
SF to NYC distance (replication) | 2.85k | 3.99k |
In every case, the effect is smaller, not larger.
What’s going on? After puzzling over this, I think it comes down to the definition of an “effect size”. It’s totally possible to have a smaller effect (in the normal English-language sense) whilst having a larger effect size (in the statistical sense).
One possibility is that it’s an artifact of how the effect size is calculated. They report, “The Anchoring original effect size is a mean point-biserial correlation computed across 15 different questions in a test-retest design, whereas the present replication adopted a between-subjects design with random assignments.”
Another possibility is that it comes down to the definition. Remember, the effect size is a technical term: The difference in means divided by the standard deviation of the full population. The above table shows that the difference in means is smaller in the replication. So why is the effect size larger? Well, it could be that even though the difference of means is smaller, the standard deviation in the replication is much smaller, meaning the overall ratio is larger.
I can’t confirm which of these explanations is right. This document seems to include standard deviations for the replication, but the original study doesn’t seem to list them, so I’m not sure.
In any case, I don’t think it changes the takeaways:
Anchoring for estimation tasks is real, and robust, and has large effects.
Be careful when looking at summary statistics. The further something is from raw data, the easier it is to misinterpret.
Bayesian view
Anchoring is real, but someone determined to see the entire world in Bayesian terms might chafe at the term “bias”. This ostensibly-not-me person might argue:
“You say anchoring is a ‘cognitive bias’ because being asked ‘Is the Mississippi river greater or less than 70 miles long?’ is a question, and thus contains no information, and thus shouldn’t change anything. But are you sure about that?”
Let me give an analogy. In a recent homework assignment, I asked students to find the minima of a bunch of mathematical functions, and at the end gave this clarification:
“In some cases, the
argmin
might not be unique. If so, you can just give any minima.”
Some students asked me, “I’m finding all unique minima. What am I doing wrong?” I tried to wriggle out of this with, “Well I never promised there had to be some function without a unique minima, so that’s not necessarily a problem…” but my students weren’t fooled. My supposedly zero-information hint did in fact contain information, and they knew it.
Or say you’re abducted by aliens. Upon waking up in their ship, they say, “Quick question, Earthling, do you think we first built the Hivemind Drive on our homeworld more or less than 40 million years ago?” After you answer, they then ask, “OK, now please estimate the true age of the Drive.” Being aware of anchoring, wouldn’t you still take the 40 million number into account?
So sure, in this experiment as performed, the anchors made estimates worse. But that only happened because the anchors were extremely loose. If the experiment were repeated with anchors that were closer to the true values, they would surely make the estimates better. It’s not at all clear that being “biased” is a problem in general.
Arguably, what’s happening is simply that people had unrealistic expectations of how experimenters would choose the anchor values. And can we blame them? The true height of Mt. Everest is 29k feet. Isn’t it a little perverse to ask if it is greater than 2k feet? In Bayesian terms, maybe it was very unlikely to choose such misleading anchors according to people’s expectations of how anchors would be selected.
Here’s another clue that people might be doing Bayesian updating: The anchoring effect is larger for quantities people know less about. Anchoring has a huge effect when estimating the number of babies born per day (which most people have no idea about) but a small effect on the distance from San Francisco to New York (where people have some information). That’s exactly what Bayesian inference would predict: If your prior is more concentrated, the likelihood will have less influence on your posterior.
I’m sure this has been suggested many times in the past, but to my surprise, I wasn’t able to find any specific reference. The closest thing I came across was a much more complex argument that anchoring stems from bounded rationality or limited computational resources.
Irrelevant anchors
Some people have suggested that even irrelevant numbers could act as anchors and change what numbers people choose. The most well-known example is a 2003 paper by Ariely, Loewenstein, and Prelec. They took 55 MBA students and showed them descriptions of six different products—a cordless keyboard, a bottle of wine, etc. Students were asked two questions:
To get the product, would you pay X dollars? (Where X was the last two digits of the student’s social security number.)
What is the maximum amount you would pay for the product?
Ideally, to make sure people were honest, you’d have them actually pay for the products. For whatever reason, that wasn’t done. Instead, there was a complex mechanism where depending on randomness and the choices people, people might get the product. This mechanism created an incentive to be honest about how much they valued things.
It’s unclear exactly how the experimenters got the social security numbers. Most likely subjects just wrote them down before the first question.
This experiment found a strong relationship between social security numbers and the numbers students gave for the second question.
The effect was huge and statistically significant for every product.
But does it replicate?
Some replications look good. In 2010, Bergman et al. tried to replicate this study using a very similar design. They are very clear about how their anchoring works: Students first write down the last two digits of their social security number (in Swedish Krona) and then state if they would pay that much. These results were fairly similar to the original study.
In 2017, Li et al. tried to replicate this result by having college students at Nankai University (near Beijing) write the last two digits of their phone number as an anchor, so anchors ranged from ¥0 to ¥99 (around $15). The purpose of this experiment was to test if the strength of the anchoring effect could be modulated by tDCS (applying a current to the head via electrodes, yes really). But ignoring that test, here is what they found for the 30 people in the control group—or, more precisely, the 30 people who had just undergone a sham version of tDCS with no actual stimulation:
Although the p-values are all above 0.05, this still looks like a positive relationship for chocolate (Ferrero Rocher) and maybe the novel (One Hundred Years of Solitude) but not for the wine. That could be related to the fact that the true value of the wine (¥150) was more than the largest possible anchor value of ¥99.
But seriously, does it replicate?
Other replications that don’t look so good. In 2012, Fudenberg, Levine, and Maniadis did one using 79 students from UCLA. One difference was that, rather than using social security numbers, the replication rolled two dice (in front of the participants) to create anchor values. A second difference was that participants in the replication were giving numbers they would sell the product at, rather than values they would buy them at. (With another complex mechanism.) They found nothing.
Another experiment was done by Ioannidis and Offerman in 2020. This had three phases. First, people were “given” a bottle of wine (on their screen) and were asked if they would sell the wine for some value between €0 and €9.9, chosen randomly by rolling two dice. Second, people participated in an auction for the wine with other participants. Finally, they were asked what value they would accept in lieu of the wine.
Their goal was to test if the auction phase would reduce the anchoring effect. Their results were that—umm—there was no anchoring effect, and so the market didn’t have anything to reduce.
Here’s how I score the replications:
Bergman et al. (2010): successful
Li et al. (2017): half-successful
Fudenberg et al. (2012): failed
Ioannidis and Offerman (2020): failed
What to make of this? One theory is that it depends on if people were choosing values to “buy” or “sell” the good at. (Or rather, mechanistic proxies for buying and selling.)
Another theory, suggested by Ioannidis and Offerman, is that it depends on how the anchor is chosen. The original study and the successful replication both used social security numbers. The half-successful replication used phone numbers instead. Both of the failed replications used dice, clearly demonstrated in front of the participants to be random.
Why would this matter? Well, Chapman and Johnson (1999; Experiment 3) did a related experiment with anchors and social security numbers. They found a correlation, but then also took the brave step of asking the participants what they thought was going on. Around 34% of participants said they thought the anchor values were informative and that that they thought the experimenters wanted them to be influenced by the anchor values! So perhaps anchoring on irrelevant information has an effect because participants for some reason think it is relevant.
I’m not sure if I believe this. For one thing, how could social security numbers be informative? It doesn’t make sense. For another, 34% of people doesn’t seem large enough to produce the reported effects. And third, when Chapman and Johnson broke things down, they found that among people who said anchor values were informative, the correlation was four times weaker than among those who didn’t say that!
Takeaways
So do people really anchor on irrelevant information?
When I first came across Fudenberg et al.’s failed replication, I assumed the effect wasn’t real. But if that were true, how can we explain one and a half successful replications? It’s very puzzling.
If the effect is real, it’s fragile. Either it depends on people mistakenly thinking the irrelevant information is relevant, or it depends on people wanting to please the experimenter, or it depends on if people are “buying” or “selling” something, or it depends on some other unclear details. Very small changes to how the experiment is done (like using dice instead of social security numbers) seem to make the effect vanish.
As things currently stand, I don’t think there’s any real-world scenario in which we can be confident that people will anchor on irrelevant information (other than, perhaps, exactly repeating Ariely et al.’s experiment). Hopefully future work can clarify this.
I was once in a forecasting workshop with ~11 other people where we were given a sheet of paper that asked us to estimate the probability of a future event. We all knew about Fermi modeling, and we were allowed to use the internet to look up relevant facts, so I really felt I was giving my very best guess. But after we turned the papers in, the people running the workshop revealed that there was a number written in the top right corner of each sheet of paper (like where a page number would usually be), and half of us had gotten one number (relatively big, like 50 or something), and the other half a different number (smaller, like 2). I don’t remember the specifics but the effect size (using the term colloquially) was ridiculously huge, where the average of the first group’s guesses was something like an order of magnitude more than the average of the second group’s.
I have been completely baffled by this experience ever since and have no idea what to do with it. When I told my husband about it he was like, “wtf, all 12 of you should be fired.” Maybe he’s right? It just seems impossible??? And yet, this is a workshop they’d run many times, and clearly an outcome they were expecting, and I don’t have reason to believe they lied.
Fuck I’m so confused can anyone make sense of this experience
A 2008 paper found anchoring effects from these kinds of “incidental environmental anchors”, but then a replication of one of its studies with a much larger sample size found no effect (see “9. Influence of incidental anchors on judgment (Critcher & Gilovich, 2008, Study 2)”).
So that at least says something about why the people running your forecasting workshop thought this would have an effect, and provides some entry points into the published research which someone could look into in more depth, but it still leaves it surprising/confusing that there was such a large difference.
I suggest renaming the “Incidental anchoring” section to something else, such as “irrelevant anchors” or “transparently random anchors”, since the term “incidental anchoring” is used to refer to something else.
Also, one of the classic 1970s Kahneman & Tversky anchoring studies used a (apparently) random wheel of fortune to generate a transparently irrelevant anchor value—the one on African countries in the UN. When this came up on LW previously, it turned out that Andrew Gelman used it as an in-class demo and (said that he) generally found effects in the predicted direction (though instead of spinning a viscerally random wheel they just handed each student a piece of paper that included the sentences “We chose (by computer) a random number between 0 and 100. The number selected and assigned to you is X = ___”).
I made the change to “irrelevant anchors”, thanks.
What’s funny is that these kinds of small changes (a piece of paper vs. spinning a wheel) might be responsible for if the effects appear or not, at least if you take the published research at face value.