First, you might be interested in tests like the Wonderlic, which are not transformed to a normal variable, and instead use raw scores. [As a side note, the original IQ test was not normalized—it was a quotient!--and so the name continues to be a bit wrong to this day.]
Second, when we have variables like height, there are obvious units to use (centimenters). Looking at raw height distributions makes sense. When we discover that the raw height distribution (split by sex) is a bell curve, that tells us something about how height works.
When we look at intelligence, or results on intelligence tests, there aren’t obvious units to use. You can report raw scores (i.e. number of questions correctly answered), but in order for the results to be comparable the questions have to stay the same (the Wonderlic has multiple forms, and differences between the forms do lead to differences in measured test scores). For a normalized test, you normalize each version separately, allowing you to have more variable questions and be more robust to the variation in questions (which is useful as an anti-cheating measure).
But ‘raw score’ just pushes the problem back a step. Why the 50 questions of the Wonderlic? Why not different questions? Replace the ten hardest questions with easier ones, and the distribution looks different. Replace the ten easiest questions with harder ones, and the distribution looks different. And for any pair of tests, we need to construct a translation table between them, so we can know what a 32 on the Wonderlic corresponds to on the ASVAB.
Using a normal distribution sidesteps a lot of this. If your test is bad in some way (like, say, 5% of the population maxing out the score on a subtest), then your resulting normal distribution will be a little wonky, but all sufficiently expressive tests can be directly compared. Because we think there’s this general factor of intelligence, this also means tests are more robust to inclusion or removal of subtests than one might naively expect. (If you remove ‘classics’ from your curriculum, the people who would have scored well on classics tests will still be noticeable on average, because they’re the people who score well on the other tests. This is an empirical claim; the world didn’t have to be this way.)
“Sure,” you reply, “but this is true of any translation.” We could have said intelligence is uniformly distributed between 0 and 100 and used percentile rank (easier to compute and understand than a normal distribution!) instead. We could have thought the polygenic model was multiplicative instead of additive, and used a lognormal distribution instead. (For example, the impact of normally distributed intelligence scores on income seems multiplicative, but if we had lognormally distributed intelligence scores it would be linear instead.) It also matters whether you get the splitting right—doing a normal distribution on height without splitting by sex first gives you a worse fit.
So in conclusion, for basically as long as we’ve had intelligence testing there have been normalized and non-normalized tests, and today the normalized tests are more popular. From my read, this is mostly for reasons of convenience, and partly because we expect the underlying distribution to be normal. We don’t do everything we could with normalization, and people aren’t looking for mixture Gaussian models in a way that might make sense.
First, you might be interested in tests like the Wonderlic, which are not transformed to a normal variable, and instead use raw scores. [As a side note, the original IQ test was not normalized—it was a quotient!--and so the name continues to be a bit wrong to this day.]
Second, when we have variables like height, there are obvious units to use (centimenters). Looking at raw height distributions makes sense. When we discover that the raw height distribution (split by sex) is a bell curve, that tells us something about how height works.
When we look at intelligence, or results on intelligence tests, there aren’t obvious units to use. You can report raw scores (i.e. number of questions correctly answered), but in order for the results to be comparable the questions have to stay the same (the Wonderlic has multiple forms, and differences between the forms do lead to differences in measured test scores). For a normalized test, you normalize each version separately, allowing you to have more variable questions and be more robust to the variation in questions (which is useful as an anti-cheating measure).
But ‘raw score’ just pushes the problem back a step. Why the 50 questions of the Wonderlic? Why not different questions? Replace the ten hardest questions with easier ones, and the distribution looks different. Replace the ten easiest questions with harder ones, and the distribution looks different. And for any pair of tests, we need to construct a translation table between them, so we can know what a 32 on the Wonderlic corresponds to on the ASVAB.
Using a normal distribution sidesteps a lot of this. If your test is bad in some way (like, say, 5% of the population maxing out the score on a subtest), then your resulting normal distribution will be a little wonky, but all sufficiently expressive tests can be directly compared. Because we think there’s this general factor of intelligence, this also means tests are more robust to inclusion or removal of subtests than one might naively expect. (If you remove ‘classics’ from your curriculum, the people who would have scored well on classics tests will still be noticeable on average, because they’re the people who score well on the other tests. This is an empirical claim; the world didn’t have to be this way.)
“Sure,” you reply, “but this is true of any translation.” We could have said intelligence is uniformly distributed between 0 and 100 and used percentile rank (easier to compute and understand than a normal distribution!) instead. We could have thought the polygenic model was multiplicative instead of additive, and used a lognormal distribution instead. (For example, the impact of normally distributed intelligence scores on income seems multiplicative, but if we had lognormally distributed intelligence scores it would be linear instead.) It also matters whether you get the splitting right—doing a normal distribution on height without splitting by sex first gives you a worse fit.
So in conclusion, for basically as long as we’ve had intelligence testing there have been normalized and non-normalized tests, and today the normalized tests are more popular. From my read, this is mostly for reasons of convenience, and partly because we expect the underlying distribution to be normal. We don’t do everything we could with normalization, and people aren’t looking for mixture Gaussian models in a way that might make sense.