In an online discussion elsewhere today someone linked this article which in turn linked the paper Gignac & Zajenkowski, The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data (PDF) (ironically hosted on @gwern’s site).
And I just don’t understand what they were thinking.
Let’s look at their methodology real quick in section 2.2 (emphasis added):
2.2.1. Subjectively assessed intelligence Participants assessed their own intelligence on a scale ranging from 1 to 25 (see Zajenkowski, Stolarski, Maciantowicz, Malesza, & Witowska, 2016). Five groups of five columns were labelled as very low, low, average, high or very high, respectively (see Fig. S1). Participants’ SAIQ was indexed with the marked column counting from the first to the left; thus, the scores ranged from 1 to 25. Prior to providing a response to the scale, the following instruction was presented: “People differ with respect to their intelligence and can have a low, average or high level. Using the following scale, please indicate where you can be placed compared to other people. Please mark an X in the appropriate box corresponding to your level of intelligence.” In order to place the 25-point scale SAIQ scores onto a scale more comparable to a conventional IQ score (i.e., M = 100; SD = 15), we transformed the scores such that values of 1, 2, 3, 4, 5… 21, 22, 23, 24, 25 were recoded to 40, 45, 50, 55, 60… 140, 145, 150, 155, 160. As the transformation was entirely linear, the results derived from the raw scale SAI scores and the recoded scale SAI scores were the same.
Any alarm bells yet? Let’s look at how they measured actual results:
2.2.2. Objectively assessed intelligence Participants completed the Advanced Progressive Matrices (APM; Raven, Court, & Raven, 1994). The APM is a non-verbal intelligence test which consists of items that include a matrix of figural patterns with a missing piece. The goal is to discover the rules that govern the matrix and to apply them to the response options. The APM is considered to be less affected by culture and/or education (Raven et al., 1994). It is known as good, but not perfect, indicator of general intellectual functioning (Carroll, 1993; Gignac, 2015). We used the age-based norms published in Raven et al. (1994, p. 55) to convert the raw APM scores into percentile scores. We then converted the percentile scores into z-scores with the IDF.NORMAL function in SPSS. Then, we converted the z-scores into IQ scores by multiplying them by 15 and adding 100. Although the norms were relatively old, we considered them essentially valid, given evidence that the Flynn effect had slowed down considerably by 1980 to 1990 and may have even reversed to a small degree since the early 1990s (Woodley of Menie et al., 2018).
An example of the self-assessment scoring question was in the supplemental materials of the paper. I couldn’t access it behind a paywall, but the paper they reference does include a great example of the scoring sheet in its appendix which I’m including here:
So we have what appears to be a linear self-assessment scale broken into 25 segments. If I were a participant filling this out, knowing how I’ve consistently performed on standardized tests around the 96-98th percentile, I’d have personally selected the top segment, which looks like it corresponds to the self-assessment of being in the top 4% of test takers.
Behind the scenes they would then have proceeded to take that assessment and scale it to an IQ score of 160, at the 99.99th percentile (no, I don’t think that highly of myself). Even if I had been conservative with my self assessment and gone with what looks like the 92-96th percentile box in this study I would have been assigned an expected score of 155, at the 99.98th percentile.
Now let’s say I take the test and actually exceeded my expected result of landing in the 96-98th percentile and ended up in the top 99th percentile according to the age-based norms in Raven et al. Where would my actual score have been? Well they would have taken the 99th percentile result, converted that relative percentage to standard deviation, multiplied it by 15, and added 100. So somewhere around a 135 result.
And guess what? That’s exactly the results (SAIQ of 160 and objective IQ of 135) they got at their top end described at the start of section 3:
Consequently, parametric statistical analyses were considered appropriate. The SAIQ scores (range: 85/160; inter-quartile range: 115⁄135) and the objective IQ scores (range: 65/135; inter-quartile-range: 96⁄109) were also representative of a wide spectrum of ability, suggesting the sample was not disproportionately sampled from one end of the distribution in the population. The SAIQ mean (M = 123.76; SD = 14.19) was statistically significantly larger than the objective IQ mean (M = 101.70; SD = 11.63), t(928) = 43.02, p < .001, Cohen’s d = 1.71). Thus, on average, people estimated their IQ to be higher than that verified by their IQ measured objectively, as hypothesized.
“As hypothesized” indeed.
So on the low range of their test, they had the low estimate of a 6⁄25 (the first box of the “low” section) which on a linear basis would have been at around the 20th to 24th percentile, and they had an actual low score of 65, corresponding to the 1st percentile.
Let’s take a look at their quadrant graph:
And now let’s convert these back into linear self-assessment and percentile results:
Low quadrant subjective 120 becomes a 16 out of 25, i.e. 60 to 64th percentile. The actual result is ~84 so around the 14th percentile.
Medium low quadrant subjective ~125 becomes a 17 out of 25, so around 64th to 68th percentile. The actual result is ~95 which becomes the 37th percentile.
Medium high quadrant subjective of ~125 becomes a 17⁄25 which corresponds to the 64th to 68th percentile on a linear self-assessment scale. The actual result of ~105 is at the 63rd percentile.
And the high quadrant subjective of ~130 becomes a 18⁄25 which would be the 68th to 72nd percentile on a linear scale. The actual result of ~115 is at the 84th percentile.
Suddenly the classic Dunning-Kruger quadrants re-emerge after normalizing the scores back to a linear scale of relative percentages with the wide gap at the low quadrant and the reversed aggregate self-assessment at the high end.
I absolutely appreciate the work that has been done in making a case that the original Dunning-Kruger’s effects are reduced depending on the statistical modeling and that a better-than-average effect plus a regression towards the mean could be what’s really going on.
But if you’re going to write a paper making that case, it might be a good idea not to get too complex with mixing different scoring methods such that you introduce a better-than-average effect with your own scoring system for subjective assessments at the top quadrant. (Introducing an additional top weighted better-than-average effect kind of undermines the whole ‘homoscedastic’ counter result.)
Also, if you are measuring something that’s been replicated a bunch of times and your data doesn’t replicate it even in the graph where it’s supposed to for one of the groups, it’s probably worth double checking before running off to the presses.
This was completely unnecessary. All they had to do was keep the self-assessment scores on the 1-25 point basis as they were and then divide the relative performance from the age-based Raven et al. tables by 4 to correspond to them. They’d be starting with a capped linear scale and they’d end up with the Raven tables with a capped linear scale and the two would have very cleanly matched up. There was no need to try to convert both linear distributions to a normal distribution for comparison, and the choices made in doing so seem (at least to my eyes) to undermine the entire point of the paper.
Anyways, in a quick search I didn’t see this criticism pop up so I figured I’d rant about it a bit.
And in the spirit of this site, I more than welcome anyone pointing out where I may be wrong in seeing this as a poor design choice (I love few things more than being proven wrong)!
But from my past experience when working on designing market research, I kind of had the Looney Tunes eyes out of body experience when I dug into this and found myself looking at the 25 point scale and associated copy knowing that without any visual indicators or textual clues what would have seemed to the average respondent to be a linear scale was secretly being converted to a normal distribution curve such that the entire “Very High” segment selection range only corresponded to the top 0.5% percent of the population and not the top 20% of the population.
The Dunning-Kruger of disproving Dunning-Kruger
In an online discussion elsewhere today someone linked this article which in turn linked the paper Gignac & Zajenkowski, The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data (PDF) (ironically hosted on @gwern’s site).
And I just don’t understand what they were thinking.
Let’s look at their methodology real quick in section 2.2 (emphasis added):
Any alarm bells yet? Let’s look at how they measured actual results:
An example of the self-assessment scoring question was in the supplemental materials of the paper. I couldn’t access it behind a paywall, but the paper they reference does include a great example of the scoring sheet in its appendix which I’m including here:
So we have what appears to be a linear self-assessment scale broken into 25 segments. If I were a participant filling this out, knowing how I’ve consistently performed on standardized tests around the 96-98th percentile, I’d have personally selected the top segment, which looks like it corresponds to the self-assessment of being in the top 4% of test takers.
Behind the scenes they would then have proceeded to take that assessment and scale it to an IQ score of 160, at the 99.99th percentile (no, I don’t think that highly of myself). Even if I had been conservative with my self assessment and gone with what looks like the 92-96th percentile box in this study I would have been assigned an expected score of 155, at the 99.98th percentile.
Now let’s say I take the test and actually exceeded my expected result of landing in the 96-98th percentile and ended up in the top 99th percentile according to the age-based norms in Raven et al. Where would my actual score have been? Well they would have taken the 99th percentile result, converted that relative percentage to standard deviation, multiplied it by 15, and added 100. So somewhere around a 135 result.
And guess what? That’s exactly the results (SAIQ of 160 and objective IQ of 135) they got at their top end described at the start of section 3:
“As hypothesized” indeed.
So on the low range of their test, they had the low estimate of a 6⁄25 (the first box of the “low” section) which on a linear basis would have been at around the 20th to 24th percentile, and they had an actual low score of 65, corresponding to the 1st percentile.
Let’s take a look at their quadrant graph:
And now let’s convert these back into linear self-assessment and percentile results:
Low quadrant subjective 120 becomes a 16 out of 25, i.e. 60 to 64th percentile. The actual result is ~84 so around the 14th percentile.
Medium low quadrant subjective ~125 becomes a 17 out of 25, so around 64th to 68th percentile. The actual result is ~95 which becomes the 37th percentile.
Medium high quadrant subjective of ~125 becomes a 17⁄25 which corresponds to the 64th to 68th percentile on a linear self-assessment scale. The actual result of ~105 is at the 63rd percentile.
And the high quadrant subjective of ~130 becomes a 18⁄25 which would be the 68th to 72nd percentile on a linear scale. The actual result of ~115 is at the 84th percentile.
Suddenly the classic Dunning-Kruger quadrants re-emerge after normalizing the scores back to a linear scale of relative percentages with the wide gap at the low quadrant and the reversed aggregate self-assessment at the high end.
I absolutely appreciate the work that has been done in making a case that the original Dunning-Kruger’s effects are reduced depending on the statistical modeling and that a better-than-average effect plus a regression towards the mean could be what’s really going on.
But if you’re going to write a paper making that case, it might be a good idea not to get too complex with mixing different scoring methods such that you introduce a better-than-average effect with your own scoring system for subjective assessments at the top quadrant. (Introducing an additional top weighted better-than-average effect kind of undermines the whole ‘homoscedastic’ counter result.)
Also, if you are measuring something that’s been replicated a bunch of times and your data doesn’t replicate it even in the graph where it’s supposed to for one of the groups, it’s probably worth double checking before running off to the presses.
This was completely unnecessary. All they had to do was keep the self-assessment scores on the 1-25 point basis as they were and then divide the relative performance from the age-based Raven et al. tables by 4 to correspond to them. They’d be starting with a capped linear scale and they’d end up with the Raven tables with a capped linear scale and the two would have very cleanly matched up. There was no need to try to convert both linear distributions to a normal distribution for comparison, and the choices made in doing so seem (at least to my eyes) to undermine the entire point of the paper.
Anyways, in a quick search I didn’t see this criticism pop up so I figured I’d rant about it a bit.
And in the spirit of this site, I more than welcome anyone pointing out where I may be wrong in seeing this as a poor design choice (I love few things more than being proven wrong)!
But from my past experience when working on designing market research, I kind of had the Looney Tunes eyes out of body experience when I dug into this and found myself looking at the 25 point scale and associated copy knowing that without any visual indicators or textual clues what would have seemed to the average respondent to be a linear scale was secretly being converted to a normal distribution curve such that the entire “Very High” segment selection range only corresponded to the top 0.5% percent of the population and not the top 20% of the population.
Rant over.