Vaniver comments on LW Women: LW Online

Vaniver 16 Feb 2013 0:37 UTC
13 points
I don’t think I’ve seen that on LW, but I also haven’t looked for it.

The version of the argument I’m familiar with boils down to ‘regression to the mean.’ Because tests provide imperfect estimates of the true ability, our final posterior is a combination of the prior (i.e. population ability distribution) and the new evidence.

Suppose someone scores 600 on a test whose mean is 500, and the test scores and underlying ability are normally distributed. Our prior belief that someone’s true ability is 590 is higher than our prior belief that their true ability is 600, which is higher than our prior belief that their true ability is 610, because the normal distribution is decreasing as you move away from the mean. If the test was off by 10, then it’s more likely to overestimate than underestimate. That is, our posterior is that it’s more likely that their real ability is 590 than 610. (Assuming it’s as easy to be positively lucky as negatively lucky, which is questionable.)

The same happens in the reverse direction: abnormally low scores are more likely to underestimate than overestimate the true ability (again, assuming it’s equally easy for luck to push up and down). Depending on the precision of the test, the end effect is probably small, but the size of the effect increases the more extreme the results are.

On math scores in particular, both the male mean and the male standard deviation are higher than the female mean and female standard deviation. The difference in standard deviations is discussed much less than the difference in means, but it turns out to be very important when calculating this effect. Thus, the chance that a female got an 800 on the Math SAT due to luck is higher than the chance that a male got an 800 on the Math SAT due to luck. Of course, the true ability necessary to get an 800 by luck is rather high, but could still be below some meaningful cutoff, and like Nancy points out, getting more evidence should make the posterior better reflect the true ability.
What links here?
- Eugine_Nier's comment on Don’t Get Offended by katydee (9 Mar 2013 6:44 UTC; 2 points)
- Eugine_Nier's comment on Don’t Get Offended by katydee (9 Mar 2013 20:05 UTC; 2 points)
- NancyLebovitz 16 Feb 2013 4:30 UTC
  6 points
  Parent
  So the better a woman does, the less you believe she can actually do it. At what point do you update your prior about what women can do?
  
  This is reminding me of How to Suppress Women’s Writing.
  - Vaniver 16 Feb 2013 4:56 UTC
    26 points
    Parent
    
    So the better a woman does, the less you believe she can actually do it.
    
    Not quite. (Saving assumptions for the end of the comment.) If a female got a 499 on the Math SAT, then my estimate of her real score is centered on 499. If she scores a 532, then my estimate is centered on 530; a 600, 593; an 800, 780. A 20 point penalty is bigger than a 7 point penalty, but 780 is bigger than 593, so if by “it” you mean “math” that’s not the right way to look at it, but if by “it” you mean “that particular score” then yes.
    
    Note that this should also be done to male scores, with the appropriate means and standard deviations. (The std difference was smaller than I remembered it being, so the mean effect will probably dominate.) Males scoring 499, 532, 600, and 800 would be estimated as actually getting 501, 532, 596, and 784. So at the 800 level, the relative penalty for being female would only be 4 points, not the 20 it first appears to be.
    
    Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the standard measurement error is 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumptions are good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
    
    Another way to think about this is that a 2.25 sigma male mathematician will score 800, but a 2.66 sigma female mathematician is necessary to score 800, and >2.25 sigmas are 12 out of a thousand, whereas >2.66 sigmas are 4 out of a thousand.
    
    At what point do you update your prior about what women can do?
    
    This isn’t necessary if the prior comes from data that includes the individual in question, and is practically unnecessary in cases where the individual doesn’t appreciably change the distribution. Enough females take the SAT that one more female scorer won’t move the mean or std enough to be noticeable at the precision that they report it.
    
    In the writing example, where we’re dealing with a long tail, then it’s not clear how to deal with the sampling issues. You’d probably make an estimate for the current individual under consideration just using historical data as your prior, and then incorporate them in the historical data for the next individual under consideration, but you might include them before doing the estimation. I’m sure there’s a statistician who’s thought about this much longer and more rigorously than I have.
    What links here?
    gwern's comment on Problems in Education by ThinkOfTheChildren (10 Apr 2013 18:47 UTC; 4 points)
    - NancyLebovitz 16 Feb 2013 12:44 UTC
      7 points
      Parent
      Thanks for the details.
      
      Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
      
      Even if it’s true (at least until transhumanism really gets going) that the best mathematicians will always be men, it’s not as though second rank mathematicians are useless.
      - Vaniver 16 Feb 2013 18:17 UTC
        6 points
        Parent
        
        Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
        
        Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them. If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
        
        It’s not clear how desirable accurate expectations of future success are. To use startups as an example, 10% of startups succeed, but founders seem to put their chance of success at over 90%, and this may be better than more realistic expectations and less startups. For clever women, though, there seems to be a significant amount of pressure to go into STEM fields followed by high rates of burnout and transfer away from STEM work. What rate of burnout would be strong evidence for overencouragement? I’m not sure.
        NancyLebovitz 17 Feb 2013 13:47 UTC
        6 points
        Parent
        
        Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them.
        
        Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability, even if some of the same internal strategies would help with both.
        
        If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
        
        How likely is this?
        Vaniver 17 Feb 2013 16:09 UTC
        3 points
        Parent
        
        Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability
        
        Agreed; that phrase was more appropriate in an earlier draft of the comment, and became less appropriate when I deleted other parts which mused about how much people should expect themselves to regress towards the population mean. They have a lot of private information about themselves, but it’s not clear to me that they have good information about the rest of the population, and so it seems easier to judge one’s absolute than one’s relative competence.
        
        On topic to dealing with biased gatekeepers, it seems self-defeating to use the presence of obstacles as a discouraging rather than encouraging factor, conditioned on the opportunity being worth pursuing. Since the probability of success is an input to the calculation of whether or not an opportunity is worth pursuing, it’s not clear when and how much accuracy in expectations is desirable.
        
        How likely is this?
        
        I don’t know enough about the population of gatekeepers to comment on the likelihood of finding it in the field, but I am confident in it as a prescription.
        NancyLebovitz 16 Feb 2013 18:54 UTC
        0 points
        Parent
        
        What rate of burnout would be strong evidence for overencouragement?
        
        Burnout might be related to factors other than not being able to do the work well enough. It could be a matter of hostile work environment.
        
        From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
        Eugine_Nier 17 Feb 2013 3:12 UTC
        14 points
        Parent
        
        From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
        
        I find it interesting that you site evidence that is exactly what traditionalist theories of gender would predict, and not even mention them as a possible explanation.
        [deleted] 17 Feb 2013 16:22 UTC
        0 points
        Parent
        I’m less and less surprised to see interesting comments like this at 0 karma.
        Jonathan_Graehl 18 Feb 2013 10:04 UTC
        0 points
        Parent
        I took your “apt” at first to mean “more able to”!
      - Eugine_Nier 17 Feb 2013 3:20 UTC
        1 point
        Parent
        
        Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
        
        As this sort of think becomes more common, it will be necessary to take into account the fact that others are also doing this when making these calculations.
        
        Even if it’s true (at least until transhumanism really gets going)
        
        And once transhumanism gets going it will be the case that the best mathematicians will be the people who received intelligence upgrade “Euler” as children. My point is that if you’re hoping for transhumanism because it will solve problems with inequality of ability, you should be careful what you wish for.
        NancyLebovitz 17 Feb 2013 4:44 UTC
        0 points
        Parent
        I just threw in the bit about transhumanism for completeness.
        
        Needing to get the implants in childhood is probably an early phase—I’m expecting that more and better plasticity for adults will also get developed.
        Eugine_Nier 17 Feb 2013 5:36 UTC
        4 points
        Parent
        
        I’m expecting that more and better plasticity for adults will also get developed.
        
        Well, unconstrained self-modification can have even more unpleasant results.
      - blashimov 9 Apr 2013 16:41 UTC
        0 points
        Parent
        It seems to me that, given people are already sexist, and given that telling someone their group has a lower average directly lowers their performance, such a re-weighting should never ever be used.
    - gwern 16 Feb 2013 18:19 UTC
      3 points
      Parent
      
      Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the test-retest variability has a std of 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumption is good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
      
      I’m not sure you’re using the right numbers for the variability. The material I’m finding online indicates that ’30 points with 67% confidence’ is not the meaningful number, but simply the r correlation between 2 administrations of the SAT: the percent of regression is 100*(1-r).
      
      The 2011 SAT test-retest reliabilities are all around 0.9 (the math section is 0.91-0.93), so that’s 10%.
      
      Using your female math mean of 499, a female score of 800 would be regressed to 800 - ((800 − 499) 0.1) = 769.9. Using your male math mean of 532, then a male score of 800 would regress down to 800 - ((800 − 532) 0.1) = 773.2.
      What links here?
      gwern's comment on Open thread, February 15-28, 2013 by David_Gerard (20 Feb 2013 22:07 UTC; 10 points)
      - Vaniver 16 Feb 2013 18:27 UTC
        1 point
        Parent
        Hmm. You’re right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I’ll edit the grandparent to use the correct terms.
        
        I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that’d probably be the simpler way to calculate it.
        gwern 16 Feb 2013 18:36 UTC
        1 point
        Parent
        
        I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect.
        
        Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn’t come into it because we’ve already singled out a specific datapoint; we’re not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
        Vaniver 16 Feb 2013 18:59 UTC
        1 point
        Parent
        Now that I’ve run through the math, I agree with your method. Supposing the measurement error is independent of score (which can’t be true because of the bounds, and in general probably isn’t true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
        
        In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
        gwern 16 Feb 2013 19:39 UTC
        1 point
        Parent
        
        In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
        
        That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.
  - gwern 16 Feb 2013 4:56 UTC
    15 points
    Parent
    
    So the better a woman does, the less you believe she can actually do it.
    
    Yes; “extraordinary claims require extraordinary evidence, but ordinary claims require only ordinary evidence.” If a random person tells me that they are a Rhodes Scholar and a certified genius, I will be more skeptical than if they told me they merely went to Harvard, and more skeptical of that than if they told me they went to community college. And at some level of ‘better’ I will stop believing them entirely.
    
    At what point do you update your prior about what women can do?
    
    To go back to the multilevel model framework: a single high data point/group will be pulled back down to the mean of the population data points/group (how much will depend on the quality of the test), while the combined mean will slightly increase.
    
    However, this increase may be extremely small, as makes sense. If you know from the official SAT statistics that 3 million women took the SAT last year and scored an average of 1200 (or whatever a medium score looks like these days, they keep changing the test), then that’s an extremely informative number which will be hard to change since you already know of how millions of women have done in the past: so whatever you learn from a single random woman scoring 800 this year will be diluted like 1 in 3 million...
    - gwern 5 Mar 2013 4:50 UTC
      0 points
      Parent
      Nifty: I’ve found an explanation of Stein’s paradox, and it turns out to be basically shrinkage!
      - wedrifid 5 Mar 2013 5:40 UTC
        0 points
        Parent
        
        Nifty: I’ve found an explanation of Stein’s paradox, and it turns out to be basically shrinkage!
        
        Ahh… “Expect regression to the mean ”.
  - drethelin 16 Feb 2013 4:49 UTC
    8 points
    Parent
    The funny thing is this kind of discrimination can lead to (or appear to lead to)the average elite woman being MORE qualified than the average man at a similar level.
    - Eugine_Nier 16 Feb 2013 7:42 UTC
      −1 points
      Parent
      Only if you over do it.
      - NancyLebovitz 16 Feb 2013 12:40 UTC
        1 point
        Parent
        What are the odds?
        
        Also, do you apply a downwards adjustment to your evaluation of a woman’s original mathematics?
        
        As randomness* would have it, I just ran into an example of women doing that to a woman for her fiction.
        
        *On the radio as I was catching up on the thread.
        Eugine_Nier 17 Feb 2013 20:38 UTC
        −1 points
        Parent
        
        As randomness* would have it, I just ran into an example of women doing that to a woman for her fiction.
        
        Just read the article. Given the information presented my prior is that Jamaica Kincaid got her job due to (possibly informal) affirmative action, i.e., the New Yorker felt like they needed a black female writer to be “diverse”.
        NancyLebovitz 18 Feb 2013 16:17 UTC
        6 points
        Parent
        You don’t know how many black female authors they’ve got, and you haven’t read any of her work.
        Eugine_Nier 19 Feb 2013 1:19 UTC
        −1 points
        Parent
        True. This is my prior for “black female author gets extremely fast tracked” and the article didn’t say anything that would make me update away from it.
        Eugine_Nier 17 Feb 2013 3:09 UTC
        −1 points
        Parent
        
        Also, do you apply a downwards adjustment to your evaluation of a woman’s original mathematics?
        
        Depends on what other evidence I have.
  - bogdanb 9 Mar 2013 22:54 UTC
    3 points
    Parent
    
    So the better a woman does, the less you believe she can actually do it.
    
    It occurs to me that from Vaniver’s explanation one could also derive the sentence “So the better a man does, the less you believe he can actually do it.” As far as I can tell, the processes of drawing either of the two conclusions are isomorphic. For that matter, the same reasoning would also lead to the derivation “So the worse a woman does, the more you believe she is actually better.” (With an analogous statement for men. This is explicitly pointed out in the explanation.)
    
    The difference between the men and the women is point where we switch from “better/less” to “worse/more”, and the magnitude of the effect as we get further away from that point. (That is, the mean and the standard deviation.)
    
    I can’t figure out a way of saying this without making me sound bad even to myself, but it seems… I don’t know, annoying at least, that you picked a logical conclusion that aplies exactly the same to both genders, but apply that to women, don’t mention at all what appears to be the only factual assertion of an actual difference between the abilities of women and men (and which I haven’t seen actually contested in neither this nor the earlier discussion on the subject), did not in fact criticise Vaniver’s explanation—which, by the way, as far as I can tell from his post, is just an explanation for beo’s benefit, I can’t deduce from its text that he’s actually endorsing using the procedure—and at the same time you manage to make both him and me, even before I participate, seem that we should be ashamed of ourselves, by sort of implying that he’ll also do something else not mentioned by him, and not logically implied by the explanation, and that would have a bad consequence if done very badly. (Well, it feels that way to me, I can’t tell if Vaniver took umbrage nor if I’m actually reading correctly the society around me with respect to which the shame relates.)
    
    I’m not sure if I have a point, exactly, I’m sort of just sharing my feelings in case it generates some insight. I don’t think you did this as an intentional dishonesty. It’s weird, it looks like there’s a blind spot exactly in the direction you’re looking at (after all, this is exactly the topic of the discussion).
    
    But then again I also feel like I have such a blind spot, like it’s impolite that I should have noticed this, or even that I’m a bad person for not agreeing with your conotation and I can’t tell why. (And I’m some sort of misoginistic pig because I can’t see it.)
    
    I seem to have that reaction quite often around this kind of discussion. I usually get sort of angry, go away, and dismiss the particular person that caused the reaction, but (I like to think) that’s only because I have low priors on people in general, which doesn’t apply here, and it seems worse somehow.
    
    As far as I can tell I actually like men much less than women (in the “being around them” sense), it feels as if I’m very inclined to equality, but somehow this kind of feminism seem very annoying. (I’m not exactly sure what I mean when I say “this kind of feminism”. The kind that argues for better women rights in some islamic countries isn’t annoying, except in the sense that it gets me angry at humanity, but that again that’s kind of expected in my society, so it doesn’t say much.)
- Nornagest 16 Feb 2013 1:35 UTC
  2 points
  Parent
  
  Thus, the chance that a female got an 800 on the Math SAT due to luck is higher than the chance that a male got an 800 on the Math SAT due to luck.
  
  Shouldn’t it be possible to estimate the magnitude of this effect by comparing score distributions on tests with differently sized question pools, or write-in versus multiple choice, or which are otherwise more or less susceptible to luck?
  - Vaniver 16 Feb 2013 3:50 UTC
    0 points
    Parent
    You’d need a model of how much luck depends on those factors. Test-retest variability gives a good measure of how much one person’s scores vary from test to test; apparently for the SAT the test-retest standard deviation is about 30 points. (We can’t quite apply this number, since it might not be independent of score, but it’s better than nothing.)
  - CronoDAS 16 Feb 2013 1:48 UTC
    0 points
    Parent
    That’s part of the whole “getting more information” thing.
    
    I think.
- gwern 16 Feb 2013 1:26 UTC
  2 points
  Parent
  The regression to the mean adjustment can be seen as a limited form of hierarchical/multilevel models with a fixed population mean, so any one score gets shrunk toward the population mean.
  
  (I was reading about them because apparently the pooling eliminates multiple comparison problems, and Gelman is a big fan of them.)