differences (between the sexes) in engineering aptitude in the general population says nothing about differences in engineering skill among people who have already been hired as engineers
I think Bayes would disagree a little :-) If your prior says blue weasels are generally better at C++ than red weasels, then a red weasel’s high test score is more likely to be a random fluke than a blue weasel’s equally high score.
ETA: it seems Robin made a similar point a while ago and got crucified for it because he didn’t use weasels!
The problem seems even worse than that. Suppose I can somehow magically determine the actual C++ ability of any weasel, and hire the first ten I come across that is above some threshold, then someone who doesn’t have my magical ability would still (rationally) expect that the average skill among red weasels that I hire is lower than the average skill among blue weasels that I hire. (And I would expect this myself before I started the hiring process.) Similarly if decide to gather some fixed number of candidates and hire the top 10%.
One way Perplexed could be right is if I have the magical ability (or a near perfect test), and I decide to hire only weasels whose C++ ability is exactly X (no higher and no lower), but that seems rather unrealistic. What other situations could produce the result that Perplexed claimed?
Ouch. I wish I had read this before dismissing cousin_it’s citation of Robin’s point.
Ok, my nothing was unjustified. Though I will point out that if I hire for a second-tier engineering organization which pays no higher than it has too, then the blue weasels that I hire will probably not be much better than the red weasels. All the blue super-weasels will get jobs elsewhere. In fact, it will be found that they are not weasels at all, but rather martins or minks.
Depends on the distribution of C++ ability. Suppose C++ weasels are a mix of normal weasels and geniuses, that geniuses have far higher ability than normal weasels, and that across both groups blue weasels are on average better by a constant considerably smaller than that difference. Your test could leave you with mostly genius red weasels and a mix of normal and genius blue weasels such that the average ability of red weasels who pass is higher.
Alternatively if far fewer red weasels learn C++ and the red weasels who do are selected for aptitude the average aptitude of red weasels who learn C++ could be higher than that of blue weasels.
True, under the assumption that the weasels are selected only by the threshold test. Actually, since time immemorial red weasels program in Fortran and C++ is thought to be a blue weasel domain. Therefore few red weasels actually plan to be hired as a C++ programmer, only those who are extraordinarily apt apply for such a job. As a coincidence, among the weasels who apply the average C++ ability is significantly higher withing the red subset.
Suppose I can somehow magically determine the actual C++ ability of any weasel, and hire the first ten I come across that is above some threshold, then someone who doesn’t have my magical ability would still (rationally) expect that the average skill among red weasels that I hire is lower than the average skill among blue weasels that I hire.
Note that this holds even if the skill distribution of red and blue weasels is exactly the same, but red weasels are rarer (or, say, red weasels that qualify are rarer, but the ability distribution among the red weasels that qualify is exactly the same as for the blue weasels). (Or, you could just apply this to the class of weasels named John.)
Thanks! You could produce Perplexed’s claimed outcome by fiat: use your magic detector to hire weasels so that they fit the desired distribution :-) Or you could set the threshold higher for red weasels and get the same result. Both options seem unsatisfactory...
I agree with Robin’s point. But completing 4 years of Engineering school and then getting hired is a bit different than scoring high on a single test. I stand by my italicized nothing as mild hyperbole. Milder, in fact, than “crucified”.
If the test is not susceptible to flukes, then my argument doesn’t work. That said, flukes aren’t necessarily extreme outliers. The red weasel you hired is more likely to be a slightly-below-standard performer that was having an unusually lucid day when you interviewed it.
On the other hand, Wei’s argument works even if the test has no flukes. Here’s one way to reformulate it: your binary decision to hire or reject a weasel is not informed by gradations of skill above the cutoff point. If blue weasels are more likely than red ones to hit the extreme high notes of software design (that weren’t on the test because then the test would reject pretty much everyone), you’ll see that inequality among the weasels you hire too.
If you successfully design your C++ hiring criteria to be colorblind — to not notice the color of weasels, but only to notice how good they are at C++ — then performance on the hiring criteria will shadow weasel color as an indicator of C++ ability.
You might end up with 99 blue weasels hired for every one red weasel hired; but you will have successfully filtered out all the red weasels that are bad at C++, just as you successfully filtered out all the blue weasels that are bad at C++. After all, only a tiny fraction of blue weasels meet your C++ hiring criteria, too.
So at that point, you should actually trust your hiring criteria and compensate weasels with no regard for their color.
(It’s true that red weasels are likely to, at one or two times in their career, need a couple of months off for frenzied weasel dancing. But it’s also true that blue weasels are more likely to get trodden on by a cow and need medical leave, because they’re less careful when walking through pastures on the way to work.)
Maybe. It depends on the distributions over programming ability that the test and color, respectively, provide. [ETA1: I should have written, “the test and the conjunction of test and color, respectively...”. The point is that, conditioned on test results, color could be independent of ability.] [ETA2: Though, if your “further conclusions” was meant to include things beyond what the test tests for, but which correlate with color, then you’re definitely right.]
The test’s being colorblind doesn’t mean that its results don’t correlate with color in the population of subjects. It means that, were you to fix a test subject and vary its color while holding everything else constant, its test results wouldn’t correlate with the color change.
Learning the color means you can make further predictions about the general distribution of ability over the general populace—not over the populace you have already selected/hired.
You didn’t give a reason for your wrong claim, so it’s hard to guess why you held it.
Maybe this will help: only if the test is infinitely long (produces an infinite amount of evidence as to the actual skill of the tested subject) will the prior evidence be completely irrelevant.
Ok, but I had the sense that, one you’ve already hired, based on skill, learning the colour will no longer give you any help in determining the skills of the people you have already hired… but will only give an indication of what percentage of each colour in the general population has the level of skill you hired-for.
Um—I’m not sure how this relates to what I said… can you please expand/clarify? :)
What I mean is: once you learn the colour, you can reason backwards that “oh, given we have X people with a skill roughly between 13 and 15… 90% of them are blue… this must imply that in the general population, blue weasels are more likely than red weasels to score roughly between 13 and 15 on skill tests at a ratio of roughly 9 to 1”
I don’t know that you can prove much else base don just that data alone.
If you successfully design your C++ hiring criteria to be colorblind — to not notice the color of weasels, but only to notice how good they are at C++ — then performance on the hiring criteria will shadow weasel color as an indicator of C++ ability.
My comment kinda assumed that hiring criteria meeting your strict standard of colorblindness are unexpectedly hard to design. Let’s say all red weasels and most blue ones suck at C++, but some blue weasels completely rule. Also, once a month every weasel (blues and reds equally) unpredictably goes into a code frenzy for 182 minutes and temporarily becomes exactly as good as a blue one that rules. Your standardized test will mostly admit blue weasels that rule, but sometimes you’ll get a random-colored weasel that sucks. If you’re colorblind, you have no hope of weeding out the random suckers. But if you’re color-aware, you can weed out half of them. Of course it also works if a tiny minority of red weasels can code instead of none.
The problems begin when the ability-distribution of red/blue weasels change, and the hiring-committee is still using restrictions based on the old distribution.
eg red weasel ability has been steadily increasing, but the old hiring criteria still says “don’t hire red weasels as they have no technical ability to speak of!”
but yes I agree—it’s all difficult because it’s hard to create a test that is as accurate as actual real-life working with a person. That’s why the popularity of those awful “three month probation periods”.
I think Bayes would disagree a little :-) If your prior says blue weasels are generally better at C++ than red weasels, then a red weasel’s high test score is more likely to be a random fluke than a blue weasel’s equally high score.
ETA: it seems Robin made a similar point a while ago and got crucified for it because he didn’t use weasels!
The problem seems even worse than that. Suppose I can somehow magically determine the actual C++ ability of any weasel, and hire the first ten I come across that is above some threshold, then someone who doesn’t have my magical ability would still (rationally) expect that the average skill among red weasels that I hire is lower than the average skill among blue weasels that I hire. (And I would expect this myself before I started the hiring process.) Similarly if decide to gather some fixed number of candidates and hire the top 10%.
One way Perplexed could be right is if I have the magical ability (or a near perfect test), and I decide to hire only weasels whose C++ ability is exactly X (no higher and no lower), but that seems rather unrealistic. What other situations could produce the result that Perplexed claimed?
Ouch. I wish I had read this before dismissing cousin_it’s citation of Robin’s point.
Ok, my nothing was unjustified. Though I will point out that if I hire for a second-tier engineering organization which pays no higher than it has too, then the blue weasels that I hire will probably not be much better than the red weasels. All the blue super-weasels will get jobs elsewhere. In fact, it will be found that they are not weasels at all, but rather martins or minks.
Depends on the distribution of C++ ability. Suppose C++ weasels are a mix of normal weasels and geniuses, that geniuses have far higher ability than normal weasels, and that across both groups blue weasels are on average better by a constant considerably smaller than that difference. Your test could leave you with mostly genius red weasels and a mix of normal and genius blue weasels such that the average ability of red weasels who pass is higher.
Alternatively if far fewer red weasels learn C++ and the red weasels who do are selected for aptitude the average aptitude of red weasels who learn C++ could be higher than that of blue weasels.
True, under the assumption that the weasels are selected only by the threshold test. Actually, since time immemorial red weasels program in Fortran and C++ is thought to be a blue weasel domain. Therefore few red weasels actually plan to be hired as a C++ programmer, only those who are extraordinarily apt apply for such a job. As a coincidence, among the weasels who apply the average C++ ability is significantly higher withing the red subset.
Note that this holds even if the skill distribution of red and blue weasels is exactly the same, but red weasels are rarer (or, say, red weasels that qualify are rarer, but the ability distribution among the red weasels that qualify is exactly the same as for the blue weasels). (Or, you could just apply this to the class of weasels named John.)
Thanks! You could produce Perplexed’s claimed outcome by fiat: use your magic detector to hire weasels so that they fit the desired distribution :-) Or you could set the threshold higher for red weasels and get the same result. Both options seem unsatisfactory...
I agree with Robin’s point. But completing 4 years of Engineering school and then getting hired is a bit different than scoring high on a single test. I stand by my italicized nothing as mild hyperbole. Milder, in fact, than “crucified”.
How likely is this if the test involves writing programs that work?
If the test is not susceptible to flukes, then my argument doesn’t work. That said, flukes aren’t necessarily extreme outliers. The red weasel you hired is more likely to be a slightly-below-standard performer that was having an unusually lucid day when you interviewed it.
On the other hand, Wei’s argument works even if the test has no flukes. Here’s one way to reformulate it: your binary decision to hire or reject a weasel is not informed by gradations of skill above the cutoff point. If blue weasels are more likely than red ones to hit the extreme high notes of software design (that weren’t on the test because then the test would reject pretty much everyone), you’ll see that inequality among the weasels you hire too.
If you successfully design your C++ hiring criteria to be colorblind — to not notice the color of weasels, but only to notice how good they are at C++ — then performance on the hiring criteria will shadow weasel color as an indicator of C++ ability.
You might end up with 99 blue weasels hired for every one red weasel hired; but you will have successfully filtered out all the red weasels that are bad at C++, just as you successfully filtered out all the blue weasels that are bad at C++. After all, only a tiny fraction of blue weasels meet your C++ hiring criteria, too.
So at that point, you should actually trust your hiring criteria and compensate weasels with no regard for their color.
(It’s true that red weasels are likely to, at one or two times in their career, need a couple of months off for frenzied weasel dancing. But it’s also true that blue weasels are more likely to get trodden on by a cow and need medical leave, because they’re less careful when walking through pastures on the way to work.)
But if you then additionally learn the color, you can make further conclusions which the test failed to deliver because of the color blindness.
Maybe. It depends on the distributions over programming ability that the test and color, respectively, provide. [ETA1: I should have written, “the test and the conjunction of test and color, respectively...”. The point is that, conditioned on test results, color could be independent of ability.] [ETA2: Though, if your “further conclusions” was meant to include things beyond what the test tests for, but which correlate with color, then you’re definitely right.]
The test’s being colorblind doesn’t mean that its results don’t correlate with color in the population of subjects. It means that, were you to fix a test subject and vary its color while holding everything else constant, its test results wouldn’t correlate with the color change.
Learning the color means you can make further predictions about the general distribution of ability over the general populace—not over the populace you have already selected/hired.
I have no problem with being wrong… but I do like to know why :)
You didn’t give a reason for your wrong claim, so it’s hard to guess why you held it.
Maybe this will help: only if the test is infinitely long (produces an infinite amount of evidence as to the actual skill of the tested subject) will the prior evidence be completely irrelevant.
Ok, but I had the sense that, one you’ve already hired, based on skill, learning the colour will no longer give you any help in determining the skills of the people you have already hired… but will only give an indication of what percentage of each colour in the general population has the level of skill you hired-for.
Instead of thinking “I perfectly measured his skill level; it’s 14”, think “I obtained X bits of evidence that his skill is between 13 and 15″.
Um—I’m not sure how this relates to what I said… can you please expand/clarify? :)
What I mean is: once you learn the colour, you can reason backwards that “oh, given we have X people with a skill roughly between 13 and 15… 90% of them are blue… this must imply that in the general population, blue weasels are more likely than red weasels to score roughly between 13 and 15 on skill tests at a ratio of roughly 9 to 1”
I don’t know that you can prove much else base don just that data alone.
My comment kinda assumed that hiring criteria meeting your strict standard of colorblindness are unexpectedly hard to design. Let’s say all red weasels and most blue ones suck at C++, but some blue weasels completely rule. Also, once a month every weasel (blues and reds equally) unpredictably goes into a code frenzy for 182 minutes and temporarily becomes exactly as good as a blue one that rules. Your standardized test will mostly admit blue weasels that rule, but sometimes you’ll get a random-colored weasel that sucks. If you’re colorblind, you have no hope of weeding out the random suckers. But if you’re color-aware, you can weed out half of them. Of course it also works if a tiny minority of red weasels can code instead of none.
The problems begin when the ability-distribution of red/blue weasels change, and the hiring-committee is still using restrictions based on the old distribution. eg red weasel ability has been steadily increasing, but the old hiring criteria still says “don’t hire red weasels as they have no technical ability to speak of!”
but yes I agree—it’s all difficult because it’s hard to create a test that is as accurate as actual real-life working with a person. That’s why the popularity of those awful “three month probation periods”.
+1 for this amusing and surprisingly accurate description of pregnancy and early post-natal care :)