So the better a woman does, the less you believe she can actually do it.
Not quite. (Saving assumptions for the end of the comment.) If a female got a 499 on the Math SAT, then my estimate of her real score is centered on 499. If she scores a 532, then my estimate is centered on 530; a 600, 593; an 800, 780. A 20 point penalty is bigger than a 7 point penalty, but 780 is bigger than 593, so if by “it” you mean “math” that’s not the right way to look at it, but if by “it” you mean “that particular score” then yes.
Note that this should also be done to male scores, with the appropriate means and standard deviations. (The std difference was smaller than I remembered it being, so the mean effect will probably dominate.) Males scoring 499, 532, 600, and 800 would be estimated as actually getting 501, 532, 596, and 784. So at the 800 level, the relative penalty for being female would only be 4 points, not the 20 it first appears to be.
Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the standard measurement error is 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumptions are good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
Another way to think about this is that a 2.25 sigma male mathematician will score 800, but a 2.66 sigma female mathematician is necessary to score 800, and >2.25 sigmas are 12 out of a thousand, whereas >2.66 sigmas are 4 out of a thousand.
At what point do you update your prior about what women can do?
This isn’t necessary if the prior comes from data that includes the individual in question, and is practically unnecessary in cases where the individual doesn’t appreciably change the distribution. Enough females take the SAT that one more female scorer won’t move the mean or std enough to be noticeable at the precision that they report it.
In the writing example, where we’re dealing with a long tail, then it’s not clear how to deal with the sampling issues. You’d probably make an estimate for the current individual under consideration just using historical data as your prior, and then incorporate them in the historical data for the next individual under consideration, but you might include them before doing the estimation. I’m sure there’s a statistician who’s thought about this much longer and more rigorously than I have.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
Even if it’s true (at least until transhumanism really gets going) that the best mathematicians will always be men, it’s not as though second rank mathematicians are useless.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them. If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
It’s not clear how desirable accurate expectations of future success are. To use startups as an example, 10% of startups succeed, but founders seem to put their chance of success at over 90%, and this may be better than more realistic expectations and less startups. For clever women, though, there seems to be a significant amount of pressure to go into STEM fields followed by high rates of burnout and transfer away from STEM work. What rate of burnout would be strong evidence for overencouragement? I’m not sure.
Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them.
Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability, even if some of the same internal strategies would help with both.
If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability
Agreed; that phrase was more appropriate in an earlier draft of the comment, and became less appropriate when I deleted other parts which mused about how much people should expect themselves to regress towards the population mean. They have a lot of private information about themselves, but it’s not clear to me that they have good information about the rest of the population, and so it seems easier to judge one’s absolute than one’s relative competence.
On topic to dealing with biased gatekeepers, it seems self-defeating to use the presence of obstacles as a discouraging rather than encouraging factor, conditioned on the opportunity being worth pursuing. Since the probability of success is an input to the calculation of whether or not an opportunity is worth pursuing, it’s not clear when and how much accuracy in expectations is desirable.
How likely is this?
I don’t know enough about the population of gatekeepers to comment on the likelihood of finding it in the field, but I am confident in it as a prescription.
What rate of burnout would be strong evidence for overencouragement?
Burnout might be related to factors other than not being able to do the work well enough. It could be a matter of hostile work environment.
From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
I find it interesting that you site evidence that is exactly what traditionalist theories of gender would predict, and not even mention them as a possible explanation.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
As this sort of think becomes more common, it will be necessary to take into account the fact that others are also doing this when making these calculations.
Even if it’s true (at least until transhumanism really gets going)
And once transhumanism gets going it will be the case that the best mathematicians will be the people who received intelligence upgrade “Euler” as children. My point is that if you’re hoping for transhumanism because it will solve problems with inequality of ability, you should be careful what you wish for.
It seems to me that, given people are already sexist, and given that telling someone their group has a lower average directly lowers their performance, such a re-weighting should never ever be used.
Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the test-retest variability has a std of 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumption is good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
I’m not sure you’re using the right numbers for the variability. The material I’m findingonline indicates that ’30 points with 67% confidence’ is not the meaningful number, but simply the r correlation between 2 administrations of the SAT: the percent of regression is 100*(1-r).
Using your female math mean of 499, a female score of 800 would be regressed to 800 - ((800 − 499) 0.1) = 769.9. Using your male math mean of 532, then a male score of 800 would regress down to 800 - ((800 − 532) 0.1) = 773.2.
Hmm. You’re right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I’ll edit the grandparent to use the correct terms.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that’d probably be the simpler way to calculate it.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn’t come into it because we’ve already singled out a specific datapoint; we’re not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I’ve run through the math, I agree with your method. Supposing the measurement error is independent of score (which can’t be true because of the bounds, and in general probably isn’t true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.
Not quite. (Saving assumptions for the end of the comment.) If a female got a 499 on the Math SAT, then my estimate of her real score is centered on 499. If she scores a 532, then my estimate is centered on 530; a 600, 593; an 800, 780. A 20 point penalty is bigger than a 7 point penalty, but 780 is bigger than 593, so if by “it” you mean “math” that’s not the right way to look at it, but if by “it” you mean “that particular score” then yes.
Note that this should also be done to male scores, with the appropriate means and standard deviations. (The std difference was smaller than I remembered it being, so the mean effect will probably dominate.) Males scoring 499, 532, 600, and 800 would be estimated as actually getting 501, 532, 596, and 784. So at the 800 level, the relative penalty for being female would only be 4 points, not the 20 it first appears to be.
Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the standard measurement error is 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumptions are good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
Another way to think about this is that a 2.25 sigma male mathematician will score 800, but a 2.66 sigma female mathematician is necessary to score 800, and >2.25 sigmas are 12 out of a thousand, whereas >2.66 sigmas are 4 out of a thousand.
This isn’t necessary if the prior comes from data that includes the individual in question, and is practically unnecessary in cases where the individual doesn’t appreciably change the distribution. Enough females take the SAT that one more female scorer won’t move the mean or std enough to be noticeable at the precision that they report it.
In the writing example, where we’re dealing with a long tail, then it’s not clear how to deal with the sampling issues. You’d probably make an estimate for the current individual under consideration just using historical data as your prior, and then incorporate them in the historical data for the next individual under consideration, but you might include them before doing the estimation. I’m sure there’s a statistician who’s thought about this much longer and more rigorously than I have.
Thanks for the details.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
Even if it’s true (at least until transhumanism really gets going) that the best mathematicians will always be men, it’s not as though second rank mathematicians are useless.
Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them. If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
It’s not clear how desirable accurate expectations of future success are. To use startups as an example, 10% of startups succeed, but founders seem to put their chance of success at over 90%, and this may be better than more realistic expectations and less startups. For clever women, though, there seems to be a significant amount of pressure to go into STEM fields followed by high rates of burnout and transfer away from STEM work. What rate of burnout would be strong evidence for overencouragement? I’m not sure.
Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability, even if some of the same internal strategies would help with both.
How likely is this?
Agreed; that phrase was more appropriate in an earlier draft of the comment, and became less appropriate when I deleted other parts which mused about how much people should expect themselves to regress towards the population mean. They have a lot of private information about themselves, but it’s not clear to me that they have good information about the rest of the population, and so it seems easier to judge one’s absolute than one’s relative competence.
On topic to dealing with biased gatekeepers, it seems self-defeating to use the presence of obstacles as a discouraging rather than encouraging factor, conditioned on the opportunity being worth pursuing. Since the probability of success is an input to the calculation of whether or not an opportunity is worth pursuing, it’s not clear when and how much accuracy in expectations is desirable.
I don’t know enough about the population of gatekeepers to comment on the likelihood of finding it in the field, but I am confident in it as a prescription.
Burnout might be related to factors other than not being able to do the work well enough. It could be a matter of hostile work environment.
From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
I find it interesting that you site evidence that is exactly what traditionalist theories of gender would predict, and not even mention them as a possible explanation.
I’m less and less surprised to see interesting comments like this at 0 karma.
I took your “apt” at first to mean “more able to”!
As this sort of think becomes more common, it will be necessary to take into account the fact that others are also doing this when making these calculations.
And once transhumanism gets going it will be the case that the best mathematicians will be the people who received intelligence upgrade “Euler” as children. My point is that if you’re hoping for transhumanism because it will solve problems with inequality of ability, you should be careful what you wish for.
I just threw in the bit about transhumanism for completeness.
Needing to get the implants in childhood is probably an early phase—I’m expecting that more and better plasticity for adults will also get developed.
Well, unconstrained self-modification can have even more unpleasant results.
It seems to me that, given people are already sexist, and given that telling someone their group has a lower average directly lowers their performance, such a re-weighting should never ever be used.
I’m not sure you’re using the right numbers for the variability. The material I’m finding online indicates that ’30 points with 67% confidence’ is not the meaningful number, but simply the r correlation between 2 administrations of the SAT: the percent of regression is 100*(1-r).
The 2011 SAT test-retest reliabilities are all around 0.9 (the math section is 0.91-0.93), so that’s 10%.
Using your female math mean of 499, a female score of 800 would be regressed to 800 - ((800 − 499) 0.1) = 769.9. Using your male math mean of 532, then a male score of 800 would regress down to 800 - ((800 − 532) 0.1) = 773.2.
Hmm. You’re right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I’ll edit the grandparent to use the correct terms.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that’d probably be the simpler way to calculate it.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn’t come into it because we’ve already singled out a specific datapoint; we’re not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I’ve run through the math, I agree with your method. Supposing the measurement error is independent of score (which can’t be true because of the bounds, and in general probably isn’t true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.