handoflixue comments on LW Women: LW Online

handoflixue 20 Feb 2013 19:38 UTC
2 points

The effect holds unless the test is perfectly accurate.

WARNING: Rambly, half-thought-out answer here. It’s genuinely not something I’ve fully worked through myself, and I am totally open to feedback from you that I’m wrong.

The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.

Hmmm. Is that actually true? If we know the test has a 10% false positive rate for both red and blue weasels, doesn’t that suggest we should have 9 non-programmer blue weasels and 1 non-programmer red weasel?

Like, if I have a bag with 2 red marbles, and 2 white marbles, the odds of drawing a red marble are ⁵⁰⁄₅₀. But if my first draw is a red marble, I can’t claim that it’s still ⁵⁰⁄₅₀, and I can’t update to say that drawing one red marble makes me MORE likely to draw a second one. The new odds are ³³⁄₆₆, no matter what math you run. The only correct update is the one that leaves you concluding ³³⁄₆₆.

It seems like there is such a test that the test results… already factor in our prior distribution? I’m not sure if I’m being at all clear here :\

Absolutely, this isn’t always the case—if you just know that you have a 10% false positive, and it’s not calibrated for red false positives vs blue false positives, you DO have evidence that red false positives are probably more common. BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.

It all depends on the accuracy of your test. If your test is sufficiently accurate that red weasels are only 1% more likely to be false positives, then this probably shouldn’t affect your actual decision making that much.

Then, if you decide to FOCUS on how red weasels have a +1% false positive rate, it implies that you consider this fact particularly important and relevant. It implies that this is a very central decision making factor, and you’re liable to do things like “not hire red weasels unless they got an A+ on their test”, even though the math doesn’t support this. If you’re just doing cold, hard math, we’d expect this factor to be down near the bottom of t he list, not plastered up on a neon marquee saying “we did the cold hard math, and all you red weasels can f**k off!”

If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.

Then we can go in to the utilitarian arguments about how feeding the red-weasel-haters political ammunition does actually increase their strength, and thus harms the red weasels, keeps them away from programming, and thus harms programming culture by reducing our pool of available programmers.
- gwern 20 Feb 2013 21:24 UTC
  5 points
  Parent
  
  The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.
  
  Yes, the effect is small in absolute magnitude—if you look at the example SAT shrinking that Vaniver and I were working out, the difference between the male/female shrunk scores is like 5 points although that’s probably an underestimate since it’s ignoring the difference in variance and only looking at means—but these 5 points could have a big difference depending on how the score is used or what other differences you look at.
  
  For example, not shrinking could lead to a number of girls getting into Harvard that would not have since Harvard has so many applicants and they all have very high SAT scores; there could well be a noticeable effect on the margin. When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
  
  One could probably estimate how many by looking for logistic regressions of ‘SAT score vs admission chance’, seeing how much 10 points is worth, and multiplying against the number of applicants. 35k applicants in 2011 for 2.16k spots. One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of ((10.381 - 1.907) / (1600-1300)) * 10 + 1 = 1.282. So the members of a group given a 10pt gain are each 1.28x more likely to be admitted than they were before; before, they had a 2.16/35 = 6.17% chance, and now they have a (1.28 * 2.16) / 35 = 2.76 / 35 = 7.89% chance. To finish the analysis: if 17.5k boys apply and 17.5k girls apply and 6.17% of the boys are admitted while 7.89% of the girls are admitted, then there will be an extra (17500 * 0.0789) - (17500 * 0.0617) = 301 girls.
  
  (A boost of more than 1% leading to 301 additional girls on the margin sounds too high to me. Probably I did something wrong in manipulating the odds ratios.)
  
  One could make the same point about means of bell curves differing a little bit: it may lead to next to no real difference towards the middle, but out on the tails it can lead to absurd differentials. I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x. One sd is a lot and certainly not comparable to 10 points on the SAT, but you see what I mean.
  
  But if my first draw is a red marble
  
  How do you know your first draw is a red marble?
  
  BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.
  
  Depends on what you’re going to do with them, I suppose… If you can only hire 1 weasel, you’ll be better off going with one of the blue weasels, no? While if you’re just giving probabilities (I’m straining to think of how to continue the analogy: maybe the weasels are floating Hanson-style student loans on prediction markets and you want to see how to buy or sell their interest rates), sure, you just mark down your estimated probability by 1% or whatever.
  
  If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.
  
  Alas! When red-weasel-hating is supported by statistics, only people interested in statistics will be hating on red-weasels. :)
  What links here?
  - gwern's comment on Open thread, February 15-28, 2013 by David_Gerard (20 Feb 2013 22:07 UTC; 10 points)
  - Vaniver 20 Feb 2013 23:13 UTC
    4 points
    Parent
    
    an extra 10pts on your total SAT is worth an odds ratio of 1.282
    
    We can check this interpretation by taking it to the 30th power, and seeing if we recover something sensible; unfortunately, that gives us an odds ratio of over 1700! If we had their beta coefficients, we could see how much 10 points corresponds to, but it doesn’t look like they report it.
    
    Logistic regression is a technique that compresses the real line down to the range between 0 and 1; you can think of that model as the schools giving everyone a score, admitting people above a threshold with probably approximately 1, admitting people below a threshold with probability approximately 0, and then admitting people in between with a probability that increases based on their score (with a score of ‘0’ corresponding to a 50% chance of getting in).
    
    We might be able to recover their beta by taking the log of the odds they report (see here). This gives us a reasonable but not too pretty result, with an estimate that 100 points of SAT is worth a score adjustment of .8. (The actual amount varies for each SAT band, which makes sense if their score for each student nonlinearly weights SAT scores. The jump from the 1400s to the 1500s is slightly bigger than the jump from the 1300s to the 1400s, suggesting that at the upper bands differences in SAT scores might matter more.)
    
    A score increase of .08 cashes out as an odds ratio of 1.083, which when we take that to the power 30 we get 11.023, which is pretty close to what we’d expect.
    
    I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x.
    
    Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days. Four standard deviations gets you to finishing in the top 200 of the Putnam competition, according to Griffe’s calculations, which are also great at illustrating male/female ratios at various levels given Project Talent data on math ability.
    
    I’ll also note again that the SAT is probably not the best test to use for this; it gives a male/female math ability variance ratio estimate of 1.1, whereas Project Talent estimated it as 1.2. Which estimate you choose makes a big difference in your estimation of the strength of this effect. (Note that, typically, more females take the SAT than males, because the cutoff for interest in the SAT is below the population mean, where male variability hurts as well as other factors, and this systemic bias in subject selection will show up in the results.)
    - gwern 21 Feb 2013 16:11 UTC
      4 points
      Parent
      Thanks for the odds corrections. I knew I got something wrong...
      
      Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days.
      
      G&T stuff, yeah, but in the materials I’ve read 2sd is not enough to move you from ‘bright’ or ‘gifted and talented’ to ‘genius’ categories, which seems to usually be defined as >2.5-3sd, and using 3sd made the calculation easier.
      - Vaniver 21 Feb 2013 16:52 UTC
        0 points
        Parent
        Eh. MENSA requires upper 2% (which is ~2 standard deviations). Whether you label that ‘genius’ or ‘bright’ or something else doesn’t seem terribly important. 3.5 standard deviations is the 2.3 out of 10,000 level, which is about a hundred times more restrictive.
        gwern 21 Feb 2013 16:59 UTC
        4 points
        Parent
        I’d call MENSA merely bright… You need something in between ‘normal’ and ‘genius’ and bright seems fine. Genius carries all the wrong connotations for something as common as MENSA-level; 2.3 out of 10k seems more reasonable.
  - Douglas_Knight 21 Feb 2013 20:17 UTC
    0 points
    Parent
    
    Harvard… When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
    
    Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
    
    This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal. The paper argues that some schools really do penalize SAT scores in some regimes. I do not buy the argument, but the graph convinces me that I don’t know how it works. Many people respond to the graph that it is the aggregation of two populations admitted under different scoring rules, both of which value SATs, but I do not think that explains the graph.
    - gwern 21 Feb 2013 21:14 UTC
      0 points
      Parent
      
      Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
      
      Your graph doesn’t show that the average applicant won’t benefit from 10 points. It shows that overall, SAT scores make a big difference (from ~0 to 0.2, with not even bothering to show anyone below the 88th percentile).
      
      This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal.
      
      The paper I cited earlier for logistic regressions used models controlling for other things. Given the benefits to athletes, legacies, and minorities, benefits necessary presumably because they cannot compete as well on other factors (like SAT scores), it’s not necessarily surprising if aggregating these populations can lead to a raw graph like those you show. Note that the most meritocratic school which places the least emphasis on ‘holistic’ admissions (enabling them to discriminate in various ways) is MIT, and their curve looks dramatically different from, say, Princeton.
      - Douglas_Knight 22 Feb 2013 0:01 UTC
        0 points
        Parent
        Yes, if large SAT changes matter, then there must be some small changes that matter. But it is possible that other points on the scale where they don’t, or are harmful. I’m sorry if I failed to indicate that I meant only this limited point.
        
        If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission? I suppose Harvard’s graph makes sense if students apply when their assessment of their ability to get in crosses some threshold. Then applying screens off SATs, at least in some normal regime.* But at Yale and especially Princeton, rising SATs in the middle regime predicts greater mistaken belief in ability to get in. Legacies (but not athletes or AA) might explain the phenomenon by only applying to one elite school, but I don’t think legacies alone are big enough to cause the graph.
        
        Here are the lessons I take away from the graphs that I would apply if I had been doing the regressions and wanted to explain the graphs. First, schools have different admissions policies, even schools as similar as Harvard and Yale. Averaging them together, as in the paper, may make things appear smoother than they really are. Second, given the nonlinear effect of SATs, it is good that the regression used buckets rather than assuming a linear effect. Third, since the bizarre downward slope is over the course of less than 100 points, the 100 point buckets of the regression may be too coarse to see it. Fourth, they could have shown graphs, too. It would have been so much more useful to graph SAT scores of athletes and probability of admission as a function of SAT scores of athletes. The main value of regressions is using the words “model” and “p-value.” Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category). But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
        
        * Actually, the whole point of this thread is that you can’t completely screen off. But I want to elaborate on “normal regime.” At the high end, screening breaks down because if, say, 1500 SAT is enough to cross the threshold, everyone with 1500+ SAT applies and there is no screening phenomenon. At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
        gwern 22 Feb 2013 0:27 UTC
        0 points
        Parent
        
        But it is possible that other points on the scale where they don’t, or are harmful.
        
        Sure, there could be non-monotonicity.
        
        If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission?...Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category).
        
        Imagine that Harvard lets in equal numbers of ‘athletes’ and ‘nerds’, the 2 groups are different populations with different means, and they do something like pick the top 10% in each group by score. Clearly there’s going to be a bimodal histogram of SAT scores: you have a lump of athlete scores in the 1000s, say, and a lump of nerd scores in the 1500s. Sure. 2 equal populations, different means, of course you’re going to see a bimodal.
        
        Now imagine Harvard gets more 10x more nerd applicants than athletic applicants; since each group gets the same number of spots, a random nerd will have ¹⁄₁₀ the admission chance as an athlete. Poor nerds. But Harvard kept the admission procedure the same as before. So what happens when you look at admission probability if all you know is the SAT score? Well, if you look at the 1500s applicants, you’ll notice that an awful lot of them aren’t admitted; and if you look at the 1000s applicants, you’ll notice that an awful lot of them getting in. Does Harvard hate SAT scores? No, of course not: we specified they were picking mostly the high scorers, and indeed, if we classify each applicant into nerd or athlete categories and then looked at admission rates by score, we’d see that yes, increasing SAT scores is always good: the nerd with a 1200 better apply to other colleges, and the athlete with 1400 might as well start learning how to yacht.
        
        So even though in aggregate in our little model, high SAT scores look like a bad thing, for each group higher SAT scores are better.
        
        Reminds me of Simpson’s paradox.
        
        But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
        
        Yes, I don’t think we could make a conclusive argument against the claim that SAT scores may not help at all levels, not without digging deep into all the papers running logistic regressions; but I regard that claim as pretty darn unlikely in the first place.
        
        At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
        
        They could be self-delusive, doing it to appease a delusive parent (‘My Johnnie Yu must go to Harvard and become a doctor!’), gambling that a tiny chance of admission is worth the effort, doing it on a dare, expecting that legacies or other things are more helpful than they actually are...
        Douglas_Knight 22 Feb 2013 3:42 UTC
        0 points
        Parent
        Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton? It is easier to get into Princeton as either a jock or a nerd, but at 98th SAT percentile, it is harder to get into Princeton than Harvard. These are the smart jocks or dumb nerds. Maybe Harvard has first dibs on the smart jocks so that the student body is more bimodal at other schools. But why would admissions be more bimodal? Does Princeton not bother to admit the smart jocks? That’s the hypothesis in the paper: an SAT penalty. Or maybe Princeton rejects the dumb nerds. It would be one thing if Princeton, as a small school, admitted fewer nerds and just had higher standards for nerds. But they don’t at the high end. What’s going on? Here’s a hypothesis: Harvard (like Caltech) could admit nerds based on other achievements that only correlate with SATs, while Princeton has high pure-SAT standards.
        
        I don’t think an SAT penalty is very plausible, but nothing I’ve heard sounds plausible. Mostly people make vague models like yours that I don’t think explain all the observations. The hypothesis that Princeton in contrast to Harvard does not count SAT for jocks beyond a graduation threshold at least does not sound insane.
        
        not without digging deep into all the papers running logistic regressions
        
        I take graphs over regressions, any day.
        Regressions fit a model. They yield very little information. Sometimes it’s exactly the information you want, as in the calculation you originally brought in the regression for. But with so little information there is no possibility of exploration or model checking.
        
        By the way, the paper you cite is published at a journal with a data access provision.
        gwern 22 Feb 2013 4:44 UTC
        0 points
        Parent
        
        Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton?
        
        Dunno. I’ve already pointed out the quasi-Simpsons Paradox effect that could produce a lot of different shapes even while SAT score increases always help. Maybe Princeton favors musicians or something. If the only reason to look into the question is your incredulity and interest in the unlikely possibility that increase in SAT score actually hurts some applicants, I don’t care nearly enough to do more than speculate.
        
        By the way, the paper you cite is published at a journal with a data access provision.
        
        I have citations in my DNB FAQ on how such provisions are honored mostly in the breach… I wonder what the odds that you could get the data and that it would be complete and useful.
  - VincentYu 21 Feb 2013 5:49 UTC
    0 points
    Parent
    
    One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of ((10.381 − 1.907) / (1600-1300)) * 10 + 1 = 1.282.
    
    Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
    
    As Vaniver mentioned, this estimate varies across the SAT score bins. If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
    
    Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression (I presume they did this to simplify their work because all other predictor variables are dichotomous).
    - gwern 21 Feb 2013 16:13 UTC
      0 points
      Parent
      
      Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
      
      Yeah; Vaniver already did it via log odds.
      
      If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
      
      Which is higher than the top bin of 1.088 so I guess that makes using the top bin an underestimate (fine by me).
      
      Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression
      
      Alas! I just went with the first paper on Harvard I found in Google which did a logistic regression involving SAT scores (well, second: the first one confounded scores with being legacies and minorities and so wasn’t useful). There may be a more useful paper out there.
  - handoflixue 20 Feb 2013 22:15 UTC
    0 points
    Parent
    I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
    
    i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
    
    My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
    - Vaniver 20 Feb 2013 23:36 UTC
      4 points
      Parent
      
      i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
      
      So, first let’s ask this question, supposing that the test is perfectly accurate. We’ll run through the numbers separately for the two subtests (so we don’t have to deal with correlation), taking means and variances from here.
      
      Of those who scored 600-700 on the hypothetical normally distributed math SAT (hence “HNDMSAT”), the male mean was 643.3 (with 20% of the male population in this band), and the female mean was 640.6 (with 14.8% of the female population in this band).
      
      Of those who scored 600-700 on the HNDVSAT, the male mean was 641.0 (with 14.9% of the male population in this band), and the female mean was 640.1 (with 13.7% of the female population in this band).
      
      When we introduce the test error into the process, the computation gets a lot messier. The quick and dirty way to do things is to say “well, let’s just shrink the mean band scores towards the population mean with the reliability coefficient.” This turns the male edge on the HNDMSAT of 2.7 into 5.4, and the male edge of .9 into 1.8. (I think it’s coincidental that this is roughly doubling the edge.)
      
      My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
      
      That’s because you’re not thinking in bell curves. The range is all on one side of the mean, the male mean is closer to the bottom of the band, and the male variation is higher.
    - gwern 20 Feb 2013 22:25 UTC
      2 points
      Parent
      
      I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
      
      My point was that ‘suppose that the true shrinkage leads to an adjusted difference of 10 points between the two groups; how much of a gift does 10 extra points represent?’ By using the nominal score rather than the true score, this has the effect of inflating the score. Once you’ve established how much the inflation might be, it’s natural to wonder about how much real-world consequence it might have leading into the Harvard musings.
      
      i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
      
      Depends on the mean and standard deviations of the 2 distributions, and then you could estimate how often the male sample average will be higher than the female sample average and vice versa.
      
      The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
      - handoflixue 20 Feb 2013 23:20 UTC
        0 points
        Parent
        
        The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
        
        Ahhh, that makes the statistics click in my brain, thanks :)
        
        Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
        gwern 20 Feb 2013 23:33 UTC
        0 points
        Parent
        
        Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
        
        I haven’t seen any, offhand. Maybe the testing company provides info about retests, but then you’re going to have different issues: anyone who takes the second test may be doing so because they had a bad day (giving you regression to a mean from the other direction) and may’ve boned up on test prep since, and there’s the additional issue of test-retest effect—now that they know what the test is like, they will be less anxious and will know what to do, and test-takers in general may score better. (Since I’m looking at that right now, my DNB meta-analysis offers a case in point: in many of the experiments, the controls have slightly higher post-test IQ scores. Just the test-retest effect.)