I’m ok with the general emotional tone (lack of tone?) here. I think I read the style of discussion as “we’re all here to be smart at each other, and we respect each other for being able to play”.
However, the gender issues have been beyond tiresome. My default is to assume that men and women are pretty similar. LW has been the first place which has given me the impression that men and women are opposed groups. I still think they’re pretty similar. The will to power is a shared trait even if it leads to conflict between opposed interests.
LW was the first place I’ve been where women caring about their own interests is viewed as a weird inimical trait which it’s only reasonable to subvert, and I’m talking about PUA.
I wish I could find the link, but I remember telling someone he’d left women out of his utilitarian calculations. He took it well, but I wish it hadn’t been my job to figure it out and find a polite way to say it.
Remember that motivational video Eliezer linked to? One of the lines toward the end was “If she puts you in the friend zone, put her in the rape zone.” I can’t imagine Eliezer saying that himself, and I expect he was only noticing and making use of the go for it and ignore your own pain slogans—but I’m still shocked and angry that it’s possible to not notice something like that. It’s all a matter of who you identify with. Truth is truth, but I didn’t want to find out that the culture had become that degraded.
And going around and around with HughRustik about PUA.… I think of him as polite and intelligent, and it took me a long time to realize that I kept saying that what I knew about PUA was what I’d read at LW, and he kept saying that it wasn’t all like Roissy, who I kept saying I hadn’t read. I grant that this is well within the normal range of human pigheadedness, and I’m sure I’ve done such myself because it can be hard to register that people hate what you love, but it was pretty grating to be on the receiving end of it.
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper. No thought about the unfairness or the possible loss of information. I think it finally occurred to someone to give a second test rather than just assuming it was a good day or good luck.
Unfortunately, I don’t have an efficient way of finding these discussions I remember—I’ll grateful if anyone finds links, and then we can see how accurate my memories were.
All this being said, I think LW has also become Less Awful so far as gender issues are concerned. I’m not sure how much anyone has been convinced that women have actual points of view (partly my fault because I haven’t been tracking individuals) since there are still the complaints about what one is not allowed to say.
Remember that motivational video Eliezer linked to? One of the lines toward the end was “If she puts you in the friend zone, put her in the rape zone.” I can’t imagine Eliezer saying that himself, and I expect he was only noticing and making use of the go for it and ignore your own pain slogans—but I’m still shocked and angry that it’s possible to not notice something like that.
My apologies for that! You’re correct that I didn’t notice that on a different level than, say, the parts about killing your friends if they don’t believe in you or whatever else was in the Courage Wolf montage. I expect I made a ‘bleah’ face at that and some other screens which demonstrated concepts exceptionally less savory than ‘Courage’, but failed to mark it as something requiring a trigger warning. I think this was before I’d even heard of the concept of a “trigger warning”, which I first got to hear about after writing Ch. 7 of HPMOR.
Generally speaking, I’ve noticed that mentioning rape tends to mind-kill people on the Internet much more than mentioning murder. I hypothesize this is due to the fact that many more people are actually raped than murdered.
And that people who have been raped are much (infinitely?) more likely to go one to participate in discussions on rape than people who have been murdered are likely to participate in discussions on murder. Also, that rape is more likely to bring in gender politics.
And that people who have been raped are much (infinitely?) more likely to go one to participate in discussions on rape than people who have been murdered are likely to participate in discussions on murder.
What about people who have had friends or relatives murdered?
Presumably there’s as many such relatives as for the rape victims. (Unless lonely orphans are singled out by murderers? In order to inherit the family fortune, if I’ve learned anything about the real world from false made-up stories...)
Presumably there’s as many such relatives as for the rape victims.
This could be due to media filters, but I hear about people traumatized by the murder of their friends and family much more often than people traumatized by the rape of others.
...or people who survived attempted murder, for that matter. (Still probably many fewer of them in the average internet discussion than people who survived rape or attempted rape.)
I think there’s been a cultural shift—mentions of rape are taken a lot more seriously than they were maybe 20 years ago. (I’m sure of the shift, and less sure of the time scale.)
I believe part of it has been a feminist effort to get rape of women by men taken seriously which has started to get rape of men by men taken seriously. Rape by women is barely on the horizon so far.
PTSD being recognized as a real thing has made a major contribution—it meant that people could no longer say that rape is something which should just be gotten over. Another piece is an effort to make being raped not be a major status-lowering event, which made people more likely to talk about it.
As for comparison to murder, I’ve seen relatives of murdered people complain that murder jokes are still socially acceptable.
As far as I can tell, horrific events can be used as jokes when they aren’t vividly imagined, and whether something you haven’t experienced is vividly imagined is strongly affected by whether the people around you encourage you to imagine it or not.
[A]t the Parents of Murdered Children Conference, they have [a presentation
on] murder mystery dinners. And the way that they always do it is they say,
let’s just pretend that you were going to have a rape mystery dinner and you
were going to show up and the rule of the game was going to be that someone’s
been raped, and we’re all going to find the rapist. That wouldn’t go over.
Nobody would do it. Everybody would feel that that was deeply distasteful.
As far as I can tell, horrific events can be used as jokes when they aren’t vividly imagined, and whether something you haven’t experienced is vividly imagined is strongly affected by whether the people around you encourage you to imagine it or not.
I’m not sure about that. It seems like in places and times where horrific events are much more common, people take an almost gallows humor attitude towards the whole thing (at least the violence part). Things like PTSD seem to happen when people in cultures where horrific events are rare temporarily get exposed to them.
Oh, right. I interpreted it as saying that horrific events are only traumatic when you’re from a culture where they’re rare, not that repeated traumatic events somehow lower one’s levels of PTSD. That would be nonsense, obviously.
Right. One idea I had is that what causes PTSD is not so much the traumatic experience as being surrounded by people who can’t relate to it.
A more Hansonian version is that exhibiting PTSD is a strategy to gain attention and sympathy and that this strategy won’t work if everyone around has also suffered similar experiences.
Another possibility is that in cultures where traumatic events are common, people who can’t deal with them without suffering PTSD are likely to get killed off by the next one.
There are probably many reasons involved, but I’d point out that in our media we frequently glamorize protagonists who kill people, but generally not ones who rape people.
There may be some cultural variation in this; I recall reading an African folk tale wherein, early on, the protagonist rapes his own mother. Afterwards he proceeds to navigate various perils with feats of cunning and derring-do, and I spent the rest of the story asking “how am I supposed to root for this guy? He raped his own mother! For no apparent reason, even!”
Tell me about that… Last night I was watching Big Miracle and I was like “how am I supposed to root for the whales? It’d probably cost a lot to save them, and with that much money you could save people!” Until the youngest whale was shown to be ill, then I did. I guess that illustrates the Near vs Far distinction even though that wasn’t the point!
“how am I supposed to root for this guy? He raped his own mother! For no apparent reason, even!”
BTW (continuing along the rape vs murder thing), have you read (say) Crime and Punishment, and if so, were you able to root for the protagonist? (I was.)
This difference in commonality extends not only to victims but to perpetrators. A higher proportion of people who find rape funny will be rapists than those who find murder funny will be murderers; murder is much harder to get away with.
I hypothesize this is due to the fact that many more people are actually raped than murdered.
I think this has to do with the way we handle things related to sex, for example, if we were having this discussion 100 years ago, we might be talking about why portrayals of adultery are unacceptable in contexts where portrails of murder would be.
I agree with your conclusion, but that particular example doesn’t counterexemplify my point because I guess many more people were actually cuckolded than murdered!
Apology accepted. I hadn’t thought about it that way, but I can see how you could have filed it under “generic hyperbolic obnoxious”.
At the time, I was just too tired of discussing gender issues to be more direct about that part of the video.
Looking at the discussion a year and a half later, I was somewhat amazed at the range of reactions to the video. Apropo of a recent facebook discussion about the found cat and lotteries, there might be a clue about why people use imprecise hyperbolic language so much—it’s more likely to lead to action. I’ve also noticed that it doesn’t necessarily feel accurate to describe strong emotions in outside view accurate language.
There ought to be something intelligent and abstract to say about filtering mechanism conflicts, but I can’t think of what it might be right now. E.g., a mention once came up of os-tans on HN, someone said “What’s an os-tan?”, I posted a link to a page of OS-tans, and then replies complained that the page was NSFW and needed a warning. I was like “What? All those os-tans are totally safe for work, I checked”. Turns out there was a big ol’ pornographic ad at the top of the page which my eyes had probably literally skipped over, as in just never saccaded there.
That Courage Wolf video probably has a pretty different impact depending on whether or not you automatically skip over and mostly don’t even notice all the bad parts.
And in another ten years a naked person walking down the street will be invisible.
LW was the first place I’ve been where women caring about their own interests is viewed as a weird inimical trait which it’s only reasonable to subvert, and I’m talking about PUA.
It seems like in the best case, PUA would be kind of like makeup. Lots of male attraction cues are visual, so they can be gamed when women wear makeup, do their hair, or wear an attractive outfit. Lots of female attraction cues are behavioral, so they can be gamed by acting or becoming more confident and interesting.
If you want to understand the appeal of the PUAs, you have to remember that it does work. Mixed in with the cod psychology and jargon are some boring but sensible tips. I would say the big four are:
Approach lots of women
Act confident
Have entertaining things to say
Dress and groom well
There are quite a few guys who haven’t really practiced those four things, which do take a bit of effort and experience. So when they start to follow the PUA movement, they absorb the nonsense, start doing the sensible, practical things, and find that they’re getting a whole lot more sex. So they conclude that the nonsense is absolutely true.
Do you have ethical problems with any of 1-4?
Ed. - It’s possible that when HughRistik said “not all PUA advice is like Roissy’s”, he meant “the PUA stuff we’re discussing on Less Wrong is Roissy-type stuff, and not all PUA stuff is like that”.
I’m actually at the point when I think it is impossible to give men useful advice to improve their sex lives and relationships because of the social dynamics that arise in nearly all societies. Actually good advice aiming to optimize the life outcomes of the men who are given it has never been discussed in public spaces and considered reputable.
Same can naturally be said of advice for women. I think most modern dating advice both for men and women is anti-knowledge in that the more of it you follow the more miserable you will end up being. I would say follow your instincts but that doesn’t work either in our society since they are broken.
Advice about how to look better seems trivially useful and reputable… Overall, I find your claim that the intersection of palatable dating advice and useful dating advice is empty extremely implausible. What else would Clarisse Thorn’s “ethical PUA advice” be?
At the very least there should be some reasonably effective advice that’s only minimally unpalatable or whatever, like become a really good guitarist and impress girls with your guitar skillz.
Regarding PUA and evolutionary psychology: I don’t see how a self-selected population that’s under the influence of alcohol, and has been living with all kinds of weird modern norms and technology, has all that much in common with the EEA.
Regarding PUA and evolutionary psychology: I don’t see how a self-selected population that’s under the influence of alcohol, and has been living with all kinds of weird modern norms and technology, has all that much in common with the EEA.
Good point that I hadn’t thought of. And also, most mating in the EEA would be with people that you’d had and expect to have extended interactions with—this is probably very different from trying to pick up strangers.
I would say follow your instincts but that doesn’t work either in our society since they are broken.
I’d go with “keep your eyes on the road, your hands upon the wheel”, i.e.¹ use the evidence that you see to update your model of the world,² and your model of the world to decide which possible behaviours would be most likely to achieve your goals. This applies to any goal whatsoever (not just dating), and ought to be obvious to LW readers, but people may tend to forget this in certain contexts due to ugh fields.
This is probably not what Jim Morrison meant by that, but still.
Note that the world also includes you. Noticing what this fact implies is left as an exercise for the reader.
use the evidence that you see to update your model of the world,² and your model of the world to decide which possible behaviours would be most likely to achieve your goals
I endorse this advice. Note however some consider this in itself unethical when it comes to interpersonal relations. I have no clue why.
Note however some consider this in itself unethical when it comes to interpersonal relations. I have no clue why.
I think I may have just figured out why. Think about the evolutionary purpose of niceness. Thinking about the nice vs. candid argument here, I suspect the purpose of niceness is to provide a credible precommitment to cooperate with someone in the future by sabotaging one’s own reasoning in such a way that will make one overestimate the value of cooperating with the other person.
Hmm, yeah. Causal decision theory doesn’t work right in several-player games and you shouldn’t defect in the Prisoner’s Dilemma, but that was one of the things I alluded to in Footnote 2; “would” in my comment was intended to be interpreted as explained in Good and Real.
If all PUA said was those 4 things, it wouldn’t be interesting or controversial, so I think it’s pretty ridiculous to respond to a conversation about PUA mentioning the parts few people would disagree with. Trickery, lies, insults, treating people as things, these are the sorts of problems people have with PUA.
If all PUA said was those 4 things, it wouldn’t be interesting or controversial
This sounds reasonable until you actually think about the four points mentioned in Near mode. Consider:
What does approaching lots of women actually look like if done in a logistically sound way? How does this relate to social norms? How does this relate to how feminists would like social norms to be?
Observe what actually confident humans do to signal their confidence. Just do.
Observe what is actually considered entertaining in a club envrionment that most PUA is designed to work in.
You know most of the things considered disreputable that PUAs advocate are precisely the result of first observing how points one to three actually work in our society and then optimizing to mimic this.
Only dressing and grooming well is probably not inherently controversial and even then pick up artists are mocked for their attempts to reverse engineer fashion that signals what they want to signal.
My default is to assume that men and women are pretty similar.
How do you reconcile this view with the way questions of tone have become entangled with gender issues in this very thread?
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper.
It was also an extremely straightforward application of Bayes’s theorem.
No thought about the unfairness
The problem is that the concept of “fairness” you are using there is incompatible with VNM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
I’m not sure how much anyone has been convinced that women have actual points of view
Where has anyone claimed they don’t? At least beyond the general rejection of qualia?
My default is to assume that men and women are pretty similar.
How do you reconcile this view with the way questions of tone have become entangled with gender issues in the very thread?
I was surprised at how strongly some people (probably mostly women) are uncomfortable with the tone here, so I have a lot to update.
I don’t like emoticons much—I don’t hate people who use them, but I use emoticons very rarely, and I’m not comfortable with them. I still find it hard to believe that if people do something a lot, there’s a reasonable chance (if they aren’t being paid) that they like it a lot, even though I can’t imagine liking whatever it is.
I don’t know what proportion of people are apt to interpret lack of overt friendliness as dislike, nor what the gender split is.
In the spirit of exploration, I took a look at Ravelry, a major knitting and crocheting blog. I haven’t found major discussions there yet. I’m interested in examples of blogs with different emotional tones/courtesy rules/gender balances.
Now that I think about it, blogs that are mostly women may be more likely to have overt statements of strong friendship and support. I believe that sort of effusiveness is partly cultural—wasn’t more common for both men and women at least from the colonial era (US) to the Victorian era?
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper.
It was also an extremely straightforward application of Bayes’s theorem.
That depends on how much you demand of your priors, and low quality priors is something that makes me nervous about Bayes.
For this particular case, there’s no examination of how much variance on the high side people get on tests. In particular, it seems very unlikely that people will get scores much above their baseline on tests about any sophisticated subject, though various factors (illness and other distractions) could drive their scores below their baseline.
What’s VHF Utilitarianism? Is there any utilitarian cost to some capable people giving up because they believe rightly that their accomplishments will be discounted?
I’m not sure how much anyone has been convinced that women have actual points of view
Where has anyone claimed they don’t? At least beyond the general rejection of qualia?
My language may have been hyperbolic and/or vague. I was thinking of “creepiness = low status” which sounds to me like “it’s so unfair that women don’t want to spend time with men they’re uncomfortable around”. In this case, I was thinking “lack of point of view”, but “preferences are irrelevant” might be more accurate.
I think I’ve interpreted “creepiness = low status” as, “it’s unfair that low-status men get labeled as creepy for behavior that high-status men would get away with.”
Of course, one could respond that making people at least feel comfortable around you is an easy way to improve your status. :)
My language may have been hyperbolic and/or vague. I was thinking of “creepiness = low status” which sounds to me like “it’s so unfair that women don’t want to spend time with men they’re uncomfortable around”.
Is there any utilitarian cost to some capable people giving up because they believe rightly that their accomplishments will be discounted?
Well, this depends on the exact circumstances, but this may happen to the people who got unlucky on the test anyway, and using a better predictor decreases the number of people who get mischaracterized.
The von Neumann-Morgenstern theorem has nothing to do with utilitarianism, and it’s not about what you “should” do. Those words don’t appear in the statement of the theorem. The theorem does state that a VNM-rational agent has a preference ordering over lotteries of outcomes. In fact it can have any preferences over outcomes at all and still satisfy the hypotheses of the theorem. In particular, it can prefer fair outcomes to unfair outcomes for any definition of “fair”.
If you want to argue that one shouldn’t pursue fairness, you don’t want to use the VNM theorem.
The von Neumann-Morgenstern theorem has nothing to do with utilitarianism, and it’s not about what you “should” do.
Agreed, unfortunately a lot of people around here seem to interpret it this way.
In particular, it can prefer fair outcomes to unfair outcomes for any definition of “fair”.
I would argue that fairness is a property of a process rather than an outcome, e.g., a kangaroo court doesn’t become “fair” just because it happens to reach the same verdict a fair trial would have.
Downvoted Eugine for the same reason, and upvoted MugaSofer back to positive. I value honest feedback, and see no reason to downvote ’em for providing it.
Then why is it that this difference, out of the many dimensions of differences that form up humankind, and the multitude of interest-group formation patterns that could have been generated, is the one that gets so much attention? It would be bizarre if an unbiased deliberation process systematically decides that one unremarkable axis (gender) is the one difference that should be discussed at great length and with very vigorous champions, while ignoring all of the other axes of diversity of human minds.
Now it is possible for one unremarkable axis to become overwhelmingly dominant in coalition formation, but that would involve some fairly unpleasant implications about the truth-seekiness and utilitarian consequences of this sort of thinking.
I dunno about this. It seems that the difference between those concerned with an intelligence explosion and those concerned with other scenarios has gotten way more attention here than gender.
I wasn’t surprised on the occasions when questions of differences in tone between the two camps flared up when discussing that topic. I would have been shocked almost beyond belief if, when discussing that topic, questions of tone differences between men and women had arisen.
The idea is, almost every topic, men and women are very similar, because the differences aren’t relevant. When you begin looking at the differences, then you get amplifying effects. In particular, each participant being what they are and completely unable to change that means:
that the topic isn’t going to be to convert people from one camp to the other or otherwise influence their choice as in the example above, but it’s going to have to be about something about that. This added layer of meta makes things much less stable. Imagine having a discussion about how we ought to talk about the differences between intelligence explosion and other scenarios, while universally acknowledged that no one was going to change their position on the actual subject. It’d be all over the place.
that empathy is harder to achieve. And in particular looking at the difference from one end gives exactly opposite perspectives on the issue. When you ‘normalize’ the differences, it’s maximally different.
By definition, those on either side have different experiences with regard to the difference, and thus are vastly more likely to hold different opinions.
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper.
It was also an extremely straightforward application of Bayes’s theorem.
We have a population of 200 weasels, 100 blue and 100 red. 90% of blue weasels are programmers, and 10% of red weasels are programmers.
If we design a perfect test-of-being-a-programmer, we will have a pool of 100 programmers (90 blue, 10 red).
If our pool of programmers does NOT follow that distribution, it suggests that we’re probably doing something wrong in our screening, like de-facto excluding all of the red weasels due to bigotry. This HURTS us, because we now have fewer programmers in our pool, and/or we have non-programmers in our pool.
If you go out and test all the weasels, and 50% of them pass, and it’s 90% blue and 10% red, I don’t see any rational reason to assume that the blue weasels are going to be superior to the red weasels, or that the red weasels are more likely to be because of test variance.
Now, if you get a pool that’s 80 red weasels and 20 blue weasels, you’re right to be suspicious that maybe this is not a very accurate test. But given the real-world job market, we should expect such outliers to occur. If everyone else is getting 90 blue and 10 red weasels from this test, you should assume you’re such an outlier, since you have plenty of evidence towards the test being accurate.
And if we’re getting that 90-10 ratio that we expect, there’s no reason to assume that the red weasels are any less competent. If 10% of all weasels are super-programmers, we should expect 10% of our blue programming weasels and 10% of our red programming weasels to be super-programmers (so, on average, 9 blue super-programmers and 1 red super-programmer).
Seriously, where is this anti-red-weasel bias coming from? Nothing in the math seems to suggest it, unless you’re using a seriously crappy test >.>
If you go out and test all the weasels, and 50% of them pass, and it’s 90% blue and 10% red, I don’t see any rational reason to assume that the blue weasels are going to be superior to the red weasels, or that the red weasels are more likely to be because of test variance.
I don’t follow. Just because your test happened to result in a split that superficially resembles the underlying frequencies, why do you then assume that your imperfect test turned in exactly the right result in all 200 cases? The same logic of an imperfect test leading to shrinking estimates to the mean seems to still apply.
Nothing in the math seems to suggest it, unless you’re using a seriously crappy test
Did you follow my and Vaniver’s thread on this topic? The effect holds unless the test is perfectly accurate.
The effect holds unless the test is perfectly accurate.
WARNING: Rambly, half-thought-out answer here. It’s genuinely not something I’ve fully worked through myself, and I am totally open to feedback from you that I’m wrong.
The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.
Hmmm. Is that actually true? If we know the test has a 10% false positive rate for both red and blue weasels, doesn’t that suggest we should have 9 non-programmer blue weasels and 1 non-programmer red weasel?
Like, if I have a bag with 2 red marbles, and 2 white marbles, the odds of drawing a red marble are 50⁄50. But if my first draw is a red marble, I can’t claim that it’s still50⁄50, and I can’t update to say that drawing one red marble makes me MORE likely to draw a second one. The new odds are 33⁄66, no matter what math you run. The only correct update is the one that leaves you concluding 33⁄66.
It seems like there is such a test that the test results… already factor in our prior distribution? I’m not sure if I’m being at all clear here :\
Absolutely, this isn’t always the case—if you just know that you have a 10% false positive, and it’s not calibrated for red false positives vs blue false positives, you DO have evidence that red false positives are probably more common. BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.
It all depends on the accuracy of your test. If your test is sufficiently accurate that red weasels are only 1% more likely to be false positives, then this probably shouldn’t affect your actual decision making that much.
Then, if you decide to FOCUS on how red weasels have a +1% false positive rate, it implies that you consider this fact particularly important and relevant. It implies that this is a very central decision making factor, and you’re liable to do things like “not hire red weasels unless they got an A+ on their test”, even though the math doesn’t support this. If you’re just doing cold, hard math, we’d expect this factor to be down near the bottom of t he list, not plastered up on a neon marquee saying “we did the cold hard math, and all you red weasels can f**k off!”
If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.
Then we can go in to the utilitarian arguments about how feeding the red-weasel-haters political ammunition does actually increase their strength, and thus harms the red weasels, keeps them away from programming, and thus harms programming culture by reducing our pool of available programmers.
The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.
Yes, the effect is small in absolute magnitude—if you look at the example SAT shrinking that Vaniver and I were working out, the difference between the male/female shrunk scores is like 5 points although that’s probably an underestimate since it’s ignoring the difference in variance and only looking at means—but these 5 points could have a big difference depending on how the score is used or what other differences you look at.
For example, not shrinking could lead to a number of girls getting into Harvard that would not have since Harvard has so many applicants and they all have very high SAT scores; there could well be a noticeable effect on the margin. When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
One could probably estimate how many by looking for logistic regressions of ‘SAT score vs admission chance’, seeing how much 10 points is worth, and multiplying against the number of applicants. 35k applicants in 2011 for 2.16k spots. One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of ((10.381 - 1.907) / (1600-1300)) * 10 + 1 = 1.282. So the members of a group given a 10pt gain are each 1.28x more likely to be admitted than they were before; before, they had a 2.16/35 = 6.17% chance, and now they have a (1.28 * 2.16) / 35 = 2.76 / 35 = 7.89% chance. To finish the analysis: if 17.5k boys apply and 17.5k girls apply and 6.17% of the boys are admitted while 7.89% of the girls are admitted, then there will be an extra (17500 * 0.0789) - (17500 * 0.0617) = 301 girls.
(A boost of more than 1% leading to 301 additional girls on the margin sounds too high to me. Probably I did something wrong in manipulating the odds ratios.)
One could make the same point about means of bell curves differing a little bit: it may lead to next to no real difference towards the middle, but out on the tails it can lead to absurd differentials. I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x. One sd is a lot and certainly not comparable to 10 points on the SAT, but you see what I mean.
But if my first draw is a red marble
How do you know your first draw is a red marble?
BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.
Depends on what you’re going to do with them, I suppose… If you can only hire 1 weasel, you’ll be better off going with one of the blue weasels, no? While if you’re just giving probabilities (I’m straining to think of how to continue the analogy: maybe the weasels are floating Hanson-style student loans on prediction markets and you want to see how to buy or sell their interest rates), sure, you just mark down your estimated probability by 1% or whatever.
If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.
Alas! When red-weasel-hating is supported by statistics, only people interested in statistics will be hating on red-weasels. :)
an extra 10pts on your total SAT is worth an odds ratio of 1.282
We can check this interpretation by taking it to the 30th power, and seeing if we recover something sensible; unfortunately, that gives us an odds ratio of over 1700! If we had their beta coefficients, we could see how much 10 points corresponds to, but it doesn’t look like they report it.
Logistic regression is a technique that compresses the real line down to the range between 0 and 1; you can think of that model as the schools giving everyone a score, admitting people above a threshold with probably approximately 1, admitting people below a threshold with probability approximately 0, and then admitting people in between with a probability that increases based on their score (with a score of ‘0’ corresponding to a 50% chance of getting in).
We might be able to recover their beta by taking the log of the odds they report (see here). This gives us a reasonable but not too pretty result, with an estimate that 100 points of SAT is worth a score adjustment of .8. (The actual amount varies for each SAT band, which makes sense if their score for each student nonlinearly weights SAT scores. The jump from the 1400s to the 1500s is slightly bigger than the jump from the 1300s to the 1400s, suggesting that at the upper bands differences in SAT scores might matter more.)
A score increase of .08 cashes out as an odds ratio of 1.083, which when we take that to the power 30 we get 11.023, which is pretty close to what we’d expect.
I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x.
Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days. Four standard deviations gets you to finishing in the top 200 of the Putnam competition, according to Griffe’s calculations, which are also great at illustrating male/female ratios at various levels given Project Talent data on math ability.
I’ll also note again that the SAT is probably not the best test to use for this; it gives a male/female math ability variance ratio estimate of 1.1, whereas Project Talent estimated it as 1.2. Which estimate you choose makes a big difference in your estimation of the strength of this effect. (Note that, typically, more females take the SAT than males, because the cutoff for interest in the SAT is below the population mean, where male variability hurts as well as other factors, and this systemic bias in subject selection will show up in the results.)
Thanks for the odds corrections. I knew I got something wrong...
Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days.
G&T stuff, yeah, but in the materials I’ve read 2sd is not enough to move you from ‘bright’ or ‘gifted and talented’ to ‘genius’ categories, which seems to usually be defined as >2.5-3sd, and using 3sd made the calculation easier.
Eh. MENSA requires upper 2% (which is ~2 standard deviations). Whether you label that ‘genius’ or ‘bright’ or something else doesn’t seem terribly important. 3.5 standard deviations is the 2.3 out of 10,000 level, which is about a hundred times more restrictive.
I’d call MENSA merely bright… You need something in between ‘normal’ and ‘genius’ and bright seems fine. Genius carries all the wrong connotations for something as common as MENSA-level; 2.3 out of 10k seems more reasonable.
Harvard… When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal. The paper argues that some schools really do penalize SAT scores in some regimes. I do not buy the argument, but the graph convinces me that I don’t know how it works. Many people respond to the graph that it is the aggregation of two populations admitted under different scoring rules, both of which value SATs, but I do not think that explains the graph.
Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
Your graph doesn’t show that the average applicant won’t benefit from 10 points. It shows that overall, SAT scores make a big difference (from ~0 to 0.2, with not even bothering to show anyone below the 88th percentile).
This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal.
The paper I cited earlier for logistic regressions used models controlling for other things. Given the benefits to athletes, legacies, and minorities, benefits necessary presumably because they cannot compete as well on other factors (like SAT scores), it’s not necessarily surprising if aggregating these populations can lead to a raw graph like those you show. Note that the most meritocratic school which places the least emphasis on ‘holistic’ admissions (enabling them to discriminate in various ways) is MIT, and their curve looks dramatically different from, say, Princeton.
Yes, if large SAT changes matter, then there must be some small changes that matter. But it is possible that other points on the scale where they don’t, or are harmful. I’m sorry if I failed to indicate that I meant only this limited point.
If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission? I suppose Harvard’s graph makes sense if students apply when their assessment of their ability to get in crosses some threshold. Then applying screens off SATs, at least in some normal regime.* But at Yale and especially Princeton, rising SATs in the middle regime predicts greater mistaken belief in ability to get in. Legacies (but not athletes or AA) might explain the phenomenon by only applying to one elite school, but I don’t think legacies alone are big enough to cause the graph.
Here are the lessons I take away from the graphs that I would apply if I had been doing the regressions and wanted to explain the graphs. First, schools have different admissions policies, even schools as similar as Harvard and Yale. Averaging them together, as in the paper, may make things appear smoother than they really are. Second, given the nonlinear effect of SATs, it is good that the regression used buckets rather than assuming a linear effect. Third, since the bizarre downward slope is over the course of less than 100 points, the 100 point buckets of the regression may be too coarse to see it. Fourth, they could have shown graphs, too. It would have been so much more useful to graph SAT scores of athletes and probability of admission as a function of SAT scores of athletes. The main value of regressions is using the words “model” and “p-value.” Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category). But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
* Actually, the whole point of this thread is that you can’t completely screen off. But I want to elaborate on “normal regime.” At the high end, screening breaks down because if, say, 1500 SAT is enough to cross the threshold, everyone with 1500+ SAT applies and there is no screening phenomenon. At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
But it is possible that other points on the scale where they don’t, or are harmful.
Sure, there could be non-monotonicity.
If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission?...Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category).
Imagine that Harvard lets in equal numbers of ‘athletes’ and ‘nerds’, the 2 groups are different populations with different means, and they do something like pick the top 10% in each group by score. Clearly there’s going to be a bimodal histogram of SAT scores: you have a lump of athlete scores in the 1000s, say, and a lump of nerd scores in the 1500s. Sure. 2 equal populations, different means, of course you’re going to see a bimodal.
Now imagine Harvard gets more 10x more nerd applicants than athletic applicants; since each group gets the same number of spots, a random nerd will have 1⁄10 the admission chance as an athlete. Poor nerds. But Harvard kept the admission procedure the same as before. So what happens when you look at admission probability if all you know is the SAT score? Well, if you look at the 1500s applicants, you’ll notice that an awful lot of them aren’t admitted; and if you look at the 1000s applicants, you’ll notice that an awful lot of them getting in. Does Harvard hate SAT scores? No, of course not: we specified they were picking mostly the high scorers, and indeed, if we classify each applicant into nerd or athlete categories and then looked at admission rates by score, we’d see that yes, increasing SAT scores is always good: the nerd with a 1200 better apply to other colleges, and the athlete with 1400 might as well start learning how to yacht.
So even though in aggregate in our little model, high SAT scores look like a bad thing, for each group higher SAT scores are better.
But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
Yes, I don’t think we could make a conclusive argument against the claim that SAT scores may not help at all levels, not without digging deep into all the papers running logistic regressions; but I regard that claim as pretty darn unlikely in the first place.
At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
They could be self-delusive, doing it to appease a delusive parent (‘My Johnnie Yu must go to Harvard and become a doctor!’), gambling that a tiny chance of admission is worth the effort, doing it on a dare, expecting that legacies or other things are more helpful than they actually are...
Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton? It is easier to get into Princeton as either a jock or a nerd, but at 98th SAT percentile, it is harder to get into Princeton than Harvard. These are the smart jocks or dumb nerds. Maybe Harvard has first dibs on the smart jocks so that the student body is more bimodal at other schools. But why would admissions be more bimodal? Does Princeton not bother to admit the smart jocks? That’s the hypothesis in the paper: an SAT penalty. Or maybe Princeton rejects the dumb nerds. It would be one thing if Princeton, as a small school, admitted fewer nerds and just had higher standards for nerds. But they don’t at the high end. What’s going on? Here’s a hypothesis: Harvard (like Caltech) could admit nerds based on other achievements that only correlate with SATs, while Princeton has high pure-SAT standards.
I don’t think an SAT penalty is very plausible, but nothing I’ve heard sounds plausible. Mostly people make vague models like yours that I don’t think explain all the observations. The hypothesis that Princeton in contrast to Harvard does not count SAT for jocks beyond a graduation threshold at least does not sound insane.
not without digging deep into all the papers running logistic regressions
I take graphs over regressions, any day. Regressions fit a model. They yield very little information. Sometimes it’s exactly the information you want, as in the calculation you originally brought in the regression for. But with so little information there is no possibility of exploration or model checking.
By the way, the paper you cite is published at a journal with a data access provision.
Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton?
Dunno. I’ve already pointed out the quasi-Simpsons Paradox effect that could produce a lot of different shapes even while SAT score increases always help. Maybe Princeton favors musicians or something. If the only reason to look into the question is your incredulity and interest in the unlikely possibility that increase in SAT score actually hurts some applicants, I don’t care nearly enough to do more than speculate.
By the way, the paper you cite is published at a journal with a data access provision.
I have citations in my DNB FAQ on how such provisions are honored mostly in the breach… I wonder what the odds that you could get the data and that it would be complete and useful.
One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of ((10.381 − 1.907) / (1600-1300)) * 10 + 1 = 1.282.
Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
As Vaniver mentioned, this estimate varies across the SAT score bins. If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression (I presume they did this to simplify their work because all other predictor variables are dichotomous).
Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
Yeah; Vaniver already did it via log odds.
If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
Which is higher than the top bin of 1.088 so I guess that makes using the top bin an underestimate (fine by me).
Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression
Alas! I just went with the first paper on Harvard I found in Google which did a logistic regression involving SAT scores (well, second: the first one confounded scores with being legacies and minorities and so wasn’t useful). There may be a more useful paper out there.
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
So, first let’s ask this question, supposing that the test is perfectly accurate. We’ll run through the numbers separately for the two subtests (so we don’t have to deal with correlation), taking means and variances from here.
Of those who scored 600-700 on the hypothetical normally distributed math SAT (hence “HNDMSAT”), the male mean was 643.3 (with 20% of the male population in this band), and the female mean was 640.6 (with 14.8% of the female population in this band).
Of those who scored 600-700 on the HNDVSAT, the male mean was 641.0 (with 14.9% of the male population in this band), and the female mean was 640.1 (with 13.7% of the female population in this band).
When we introduce the test error into the process, the computation gets a lot messier. The quick and dirty way to do things is to say “well, let’s just shrink the mean band scores towards the population mean with the reliability coefficient.” This turns the male edge on the HNDMSAT of 2.7 into 5.4, and the male edge of .9 into 1.8. (I think it’s coincidental that this is roughly doubling the edge.)
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
That’s because you’re not thinking in bell curves. The range is all on one side of the mean, the male mean is closer to the bottom of the band, and the male variation is higher.
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
My point was that ‘suppose that the true shrinkage leads to an adjusted difference of 10 points between the two groups; how much of a gift does 10 extra points represent?’ By using the nominal score rather than the true score, this has the effect of inflating the score. Once you’ve established how much the inflation might be, it’s natural to wonder about how much real-world consequence it might have leading into the Harvard musings.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
Depends on the mean and standard deviations of the 2 distributions, and then you could estimate how often the male sample average will be higher than the female sample average and vice versa.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
Ahhh, that makes the statistics click in my brain, thanks :)
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
I haven’t seen any, offhand. Maybe the testing company provides info about retests, but then you’re going to have different issues: anyone who takes the second test may be doing so because they had a bad day (giving you regression to a mean from the other direction) and may’ve boned up on test prep since, and there’s the additional issue of test-retest effect—now that they know what the test is like, they will be less anxious and will know what to do, and test-takers in general may score better. (Since I’m looking at that right now, my DNB meta-analysis offers a case in point: in many of the experiments, the controls have slightly higher post-test IQ scores. Just the test-retest effect.)
The problem is that the concept of “fairness” you are using there is incompatible with VNM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
First off, I have to say, just asking this sets off a serious, serious troll alert.
So, we have 5 players, and 50 utilions to divide between them. Players all value utilions equally, and utilions have linear value (i.e. 5 utilions is five times better than 1). Fairness says we give each player 10 utilions. Let’s make our unfair distribution 8, 8, 10, 12, 12.
How to express this mathematically? You could have a factor in your utility equation that is based on deviation from the mean (least-square immediately strikes me as elegant), or one which values the absolute difference between best and worst, or which averages against the lowest value.
For the first technique, the distribution 8,8,10,12,12, has 2^2 = 4 x 4 = −16 utility compared to ideal.
For the second technique, you lose −4 utility (12-8)
For the third technique, the utility for each player is 8, 8 (10+8/2 = 9), (12+8/2 = 10), (12+8/2 = 10), for a total penalty of −5 against ideal.
And that’s all assuming that fairness is a terminal value, not something that generates utility. That’s all assuming we’re playing with Platonic Utilions with linear value, rather than money (which seems to fall in value the more you get).
I mean this sincerely: if you’re not a troll, I am genuinely and deeply confused how you could possibly think this is the slightest bit incompatible with VNM utilitarianism.
How to express this mathematically? You could have a factor in your utility equation that is based on deviation from the mean (least-square immediately strikes me as elegant), or one which values the absolute difference between best and worst, or which averages against the lowest value.
Ok, let’s apply these functions to a different scenario:
There are two people A and B, A has utility 5 and B has utility 10. We have no way of increasing their utilities but we can make thinks worse for them. Your term suggests we should lower B’s utility as a deadweight loss to make things more fair. This seems wrong.
Technique C already handles this: 10+5/2 = 7.5. 5+5/2 = 5. So clearly going from 10->5 is bad, but having both of them be at 7.5 would be better, and having both of them at 10 would be even better still.
For technique B, yes, you will get results that say power imbalances are unfair and should be destroyed. The simplest example I could give is a world where Hitler has a million soldiers and everyone else has 100,000 combined. That power imbalance is dangerous, because Hitler can leverage that advantage to gain an even larger advanage, and so, over time, that inequality gets worse, and it can even reduce net utility (after the war, Hitler has 950,000 soldiers and everyone else has 50,000 − 100K people died, and the world is more unfair!)
One of the big stumbling blocks for me with social justice was understanding that power imbalances can be bad in and of themselves. It’s not just soldiers, either. This happens rather vividly with money and many other resources (“spoons” seem to work this way, if you’re familiar with “spoon theory”)
Technique C already handles this: 10+5/2 = 7.5. 5+5/2 = 5. So clearly going from 10->5 is bad, but having both of them be at 7.5 would be better, and having both of them at 10 would be even better still.
Of course technique C doesn’t address the weasel example.
For technique B, yes, you will get results that say power imbalances are unfair and should be destroyed.
When did we switch from talking about utility to talking about power? I agree power imbalances are dangerous; however, this fact doesn’t seem to bear on the weasel example.
Of course technique C doesn’t address the weasel example.
Have you considered using full thoughts… ooooh. What the hell is with all the trolls these days? :(
When did we switch from talking about utility to talking about power?
For the audience at home: That’s because out in “reality”, we can’t measure utilions, so we use things like power and money as proxies. In an ideal utopia with perfectly calibrated Utili-meters, this would not be as relevant.
Of course technique C doesn’t address the weasel example.
Have you considered using full thoughts… ooooh.
I’m not sure how to read this. I’m leaning towards, “I don’t have a counter argument so I’m going to resort to insults.”
To get back to the point, the problem with technique C is that it doesn’t address the case of adjusting test scores based on demographic priors, since the lowest utility (the people not accepted) is the same either way.
What the hell is with all the trolls these days?
You’re the one who just dropped the discussion to DH level 1 or 2.
You have a repeated pattern of not offering real responses: “Is this a parody?” “Is this?” being the biggest red flag I’ve encountered in this thread.
You are correct that I didn’t have a refutation, because “I don’t see how this ties in to the weasels” doesn’t give me enough information to try and resolve your confusion. In short, lately you seem to be putting near-zero effort in to your replies: you’re not attempting to explain your position, just offering pithy one-sentence objections that don’t seem to contribute anything.
Given you have 2K karma and a few +50 rated comments, I’m willing to assume you’ve just had a bad week and actually explain this, but I still see no point in actually continuing the conversation, since your replies are all “taxing” me the same way a troll does: you put in minimal effort, and force the other person to hold it all afloat.
You’re the one who just dropped the discussion to DH level 1 or 2.
It’s the very definition of skilled trolling, to force other people to spend paragraphs defending themselves while you resort to easily misinterpreted one-sentence replies that do nothing to advance actual discourse.
The idea that I must maintain quality discourse, or even that it’s more productive, is a trap that ends up with a bunch of well-fed trolls.
You have a repeated pattern of not offering real responses: “Is this a parody?” “Is this?” being the biggest red flag I’ve encountered in this thread.
It’s as real a response as the question it’s a response to and I give a substantive response to Nisan’s more substantive sentence.
You are correct that I didn’t have a refutation, because “I don’t see how this ties in to the weasels” doesn’t give me enough information to try and resolve your confusion.
You could give some indication of what addition information would help. Here are some possibilities:
1) You didn’t get what the weasels were referring to. Arguably I should have linked to this comment in the great-grandparent, but since the comment in question is yours, I assumed you’d get the reference.
2) You think the technique does in fact address the weasel example, in that case you could have said so as well as possibly how you think it applies.
The problem is that the concept of “fairness” you are using there is incompatible with VHM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
People care about fairness, and get negative utility from feeling like they are being treated unfairly.
I’d have to think about it but if I didn’t think it would involve being severely taken advantage of to the point where it impacts what I want to do I’d probably take it.
Commenting to state a disagreement with a LW narrative (you’re okay with the emotional tone / lack thereof) on a LW narratives thread will chip away at anonymity. If enough LW women were to do that, then people may figure out who wrote which narratives by process of elimination. I acknowledge that it would be way infeasible for all of us to memorize all the narratives and never say something that disagrees, and that’s not what I’m suggesting. I’m saying that adding a comment on the LW narratives thread itself that’s in clear disagreement with one of the narratives is poor anonymity strategy.
LW has been the first place which has given me the impression that men and women are opposed groups.
[...]
LW was the first place I’ve been where women caring about their own interests is viewed as a weird inimical trait which it’s only reasonable to subvert, and I’m talking about PUA.
Could you give some examples? I’m having trouble thinking of any.
LW was the first place I’ve been where women caring about their own interests is viewed as a weird inimical trait which it’s only reasonable to subvert, and I’m talking about PUA.
Could you give some examples? I’m having trouble thinking of any.
The general idea that women not being attracted to men who are attracted to them is just some arbitrary wrongness in the universe that any sensible man should try to get the women to ignore.
The general idea that women not being attracted to men who are attracted to them is just some arbitrary wrongness in the universe that any sensible man should try to get the women to ignore.
Fixing the man (as opposed to confusing the woman) seems like a good intervention, if it’s possible to a sufficient extent. The difficulty is that behavior and appearance are important aspects of a person, so fixing someone might involve fixing their behavior and appearance, which will be superficially similar to changing their behavior and appearance with the goal of confusion/deception. This apparently inescapable superficial similarity opens benevolent self-improvement in this area to the charge of deception, and it looks like it’s often hard for both sides to avoid mixing up the categories.
The general idea that women not being attracted to men who are attracted to them is just some arbitrary wrongness in the universe
Well, if they were attracted to the men attracted to them this would increase total utility. One of the less pleasant implications of utilitarianism.
On the other hand, it’s interesting that people are willing to swallow pushing people in front of trolleys, but not swallow the above. Probably related to this.
The general idea that women not being attracted to men who are attracted to them is just some arbitrary wrongness in the universe
Well, if they were attracted to the men attracted to them this would increase total utility. One of the less pleasant implications of utilitarianism.
This is only an implication of utilitarianism to the extent that forcibly wireheading everyone is an implication of utilitarianism. However, given some of your other remarks about unpleasant truths conflicting with social conformity, I doubt if you intended your comment as an argument against utilitarianism, but rather as an argument for PUA. Am I reading the tea-leaves correctly here?
This is only an implication of utilitarianism to the extent that forcibly wireheading everyone is an implication of utilitarianism.
Well, one can deal with wireheading by declaring that wireheads don’t count towards utility and/or have negative utility. That approach doesn’t work in this case since we don’t want to assign negative utility to the state of two people being attracted to each other.
I doubt if you intended your comment as an argument against utilitarianism, but rather as an argument for PUA. Am I reading the tea-leaves correctly here?
Why can’t I do both? After all, the correct Bayesian response to discovering that two ideas seem to contradict is decrease one’s confidence in both.
Well, one can deal with wireheading by declaring that wireheads don’t count towards utility and/or have negative utility.
One can deal with any counterexample by declaring that it “doesn’t count”. That does not make it not count. Wireheads, by definition, experience huge utility. That is what the word means, in discussions of utilitarianism.
That approach doesn’t work in this case since we don’t want to assign negative utility to the state of two people being attracted to each other.
We might very well want to assign negative utility to the process whereby that happened, for the same reasons as for forcible wireheading.
I doubt if you intended your comment as an argument against utilitarianism, but rather as an argument for PUA. Am I reading the tea-leaves correctly here?
Why can’t I do both?
That is just a way of not saying what you do. Do, you, in fact, do both, and how much of each?
After all, the correct Bayesian response to discovering that two ideas seem to contradict is decrease one’s confidence in both.
The correct rational response is to resolve the contradiction, not to ignore it and utter platitudes about the truth lying between extremes. Dressing the latter up in rationalist jargon does not change that.
We might very well want to assign negative utility to the process whereby that happened, for the same reasons as for forcible wireheading.
That’s my point, you need to assign utility to processes rather than just outcomes.
That is just a way of not saying what you do. Do, you, in fact, do both, and how much of each?
I am in fact doing both, in this case mostly against utilitarianism.
The correct rational response is to resolve the contradiction, not to ignore it and utter platitudes about the truth lying between extremes.
There is a difference between assuming the truth lies between two extremes, and assigning significant probability (say ~50%) to each of the two extremes. I’m trying to do the latter.
This thing allows you to see all contributions by a given user on the same page, so you can Ctrl-F through them. (OTOH, it is quite slow, at least on my system.)
Thank you. I found the thread about the video, but I’m not sure I replied to the discussion of discounting excellent results from people who aren’t expected to produce them. On the other hand, there doesn’t seem to be a problem with not finding it since there’s a consensus that it’s the sort of thing which would be plausible to find at LW.
I don’t think I’ve seen that on LW, but I also haven’t looked for it.
The version of the argument I’m familiar with boils down to ‘regression to the mean.’ Because tests provide imperfect estimates of the true ability, our final posterior is a combination of the prior (i.e. population ability distribution) and the new evidence.
Suppose someone scores 600 on a test whose mean is 500, and the test scores and underlying ability are normally distributed. Our prior belief that someone’s true ability is 590 is higher than our prior belief that their true ability is 600, which is higher than our prior belief that their true ability is 610, because the normal distribution is decreasing as you move away from the mean. If the test was off by 10, then it’s more likely to overestimate than underestimate. That is, our posterior is that it’s more likely that their real ability is 590 than 610. (Assuming it’s as easy to be positively lucky as negatively lucky, which is questionable.)
The same happens in the reverse direction: abnormally low scores are more likely to underestimate than overestimate the true ability (again, assuming it’s equally easy for luck to push up and down). Depending on the precision of the test, the end effect is probably small, but the size of the effect increases the more extreme the results are.
On math scores in particular, both the male mean and the male standard deviation are higher than the female mean and female standard deviation. The difference in standard deviations is discussed much less than the difference in means, but it turns out to be very important when calculating this effect. Thus, the chance that a female got an 800 on the Math SAT due to luck is higher than the chance that a male got an 800 on the Math SAT due to luck. Of course, the true ability necessary to get an 800 by luck is rather high, but could still be below some meaningful cutoff, and like Nancy points out, getting more evidence should make the posterior better reflect the true ability.
So the better a woman does, the less you believe she can actually do it.
Not quite. (Saving assumptions for the end of the comment.) If a female got a 499 on the Math SAT, then my estimate of her real score is centered on 499. If she scores a 532, then my estimate is centered on 530; a 600, 593; an 800, 780. A 20 point penalty is bigger than a 7 point penalty, but 780 is bigger than 593, so if by “it” you mean “math” that’s not the right way to look at it, but if by “it” you mean “that particular score” then yes.
Note that this should also be done to male scores, with the appropriate means and standard deviations. (The std difference was smaller than I remembered it being, so the mean effect will probably dominate.) Males scoring 499, 532, 600, and 800 would be estimated as actually getting 501, 532, 596, and 784. So at the 800 level, the relative penalty for being female would only be 4 points, not the 20 it first appears to be.
Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the standard measurement error is 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumptions are good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
Another way to think about this is that a 2.25 sigma male mathematician will score 800, but a 2.66 sigma female mathematician is necessary to score 800, and >2.25 sigmas are 12 out of a thousand, whereas >2.66 sigmas are 4 out of a thousand.
At what point do you update your prior about what women can do?
This isn’t necessary if the prior comes from data that includes the individual in question, and is practically unnecessary in cases where the individual doesn’t appreciably change the distribution. Enough females take the SAT that one more female scorer won’t move the mean or std enough to be noticeable at the precision that they report it.
In the writing example, where we’re dealing with a long tail, then it’s not clear how to deal with the sampling issues. You’d probably make an estimate for the current individual under consideration just using historical data as your prior, and then incorporate them in the historical data for the next individual under consideration, but you might include them before doing the estimation. I’m sure there’s a statistician who’s thought about this much longer and more rigorously than I have.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
Even if it’s true (at least until transhumanism really gets going) that the best mathematicians will always be men, it’s not as though second rank mathematicians are useless.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them. If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
It’s not clear how desirable accurate expectations of future success are. To use startups as an example, 10% of startups succeed, but founders seem to put their chance of success at over 90%, and this may be better than more realistic expectations and less startups. For clever women, though, there seems to be a significant amount of pressure to go into STEM fields followed by high rates of burnout and transfer away from STEM work. What rate of burnout would be strong evidence for overencouragement? I’m not sure.
Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them.
Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability, even if some of the same internal strategies would help with both.
If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability
Agreed; that phrase was more appropriate in an earlier draft of the comment, and became less appropriate when I deleted other parts which mused about how much people should expect themselves to regress towards the population mean. They have a lot of private information about themselves, but it’s not clear to me that they have good information about the rest of the population, and so it seems easier to judge one’s absolute than one’s relative competence.
On topic to dealing with biased gatekeepers, it seems self-defeating to use the presence of obstacles as a discouraging rather than encouraging factor, conditioned on the opportunity being worth pursuing. Since the probability of success is an input to the calculation of whether or not an opportunity is worth pursuing, it’s not clear when and how much accuracy in expectations is desirable.
How likely is this?
I don’t know enough about the population of gatekeepers to comment on the likelihood of finding it in the field, but I am confident in it as a prescription.
What rate of burnout would be strong evidence for overencouragement?
Burnout might be related to factors other than not being able to do the work well enough. It could be a matter of hostile work environment.
From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
I find it interesting that you site evidence that is exactly what traditionalist theories of gender would predict, and not even mention them as a possible explanation.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
As this sort of think becomes more common, it will be necessary to take into account the fact that others are also doing this when making these calculations.
Even if it’s true (at least until transhumanism really gets going)
And once transhumanism gets going it will be the case that the best mathematicians will be the people who received intelligence upgrade “Euler” as children. My point is that if you’re hoping for transhumanism because it will solve problems with inequality of ability, you should be careful what you wish for.
It seems to me that, given people are already sexist, and given that telling someone their group has a lower average directly lowers their performance, such a re-weighting should never ever be used.
Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the test-retest variability has a std of 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumption is good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
I’m not sure you’re using the right numbers for the variability. The material I’m findingonline indicates that ’30 points with 67% confidence’ is not the meaningful number, but simply the r correlation between 2 administrations of the SAT: the percent of regression is 100*(1-r).
Using your female math mean of 499, a female score of 800 would be regressed to 800 - ((800 − 499) 0.1) = 769.9. Using your male math mean of 532, then a male score of 800 would regress down to 800 - ((800 − 532) 0.1) = 773.2.
Hmm. You’re right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I’ll edit the grandparent to use the correct terms.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that’d probably be the simpler way to calculate it.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn’t come into it because we’ve already singled out a specific datapoint; we’re not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I’ve run through the math, I agree with your method. Supposing the measurement error is independent of score (which can’t be true because of the bounds, and in general probably isn’t true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.
So the better a woman does, the less you believe she can actually do it.
Yes; “extraordinary claims require extraordinary evidence, but ordinary claims require only ordinary evidence.” If a random person tells me that they are a Rhodes Scholar and a certified genius, I will be more skeptical than if they told me they merely went to Harvard, and more skeptical of that than if they told me they went to community college. And at some level of ‘better’ I will stop believing them entirely.
At what point do you update your prior about what women can do?
To go back to the multilevel model framework: a single high data point/group will be pulled back down to the mean of the population data points/group (how much will depend on the quality of the test), while the combined mean will slightly increase.
However, this increase may be extremely small, as makes sense. If you know from the official SAT statistics that 3 million women took the SAT last year and scored an average of 1200 (or whatever a medium score looks like these days, they keep changing the test), then that’s an extremely informative number which will be hard to change since you already know of how millions of women have done in the past: so whatever you learn from a single random woman scoring 800 this year will be diluted like 1 in 3 million...
The funny thing is this kind of discrimination can lead to (or appear to lead to)the average elite woman being MORE qualified than the average man at a similar level.
As randomness* would have it, I just ran into an example of women doing that to a woman for her fiction.
Just read the article. Given the information presented my prior is that Jamaica Kincaid got her job due to (possibly informal) affirmative action, i.e., the New Yorker felt like they needed a black female writer to be “diverse”.
True. This is my prior for “black female author gets extremely fast tracked” and the article didn’t say anything that would make me update away from it.
So the better a woman does, the less you believe she can actually do it.
It occurs to me that from Vaniver’s explanation one could also derive the sentence “So the better a man does, the less you believe he can actually do it.” As far as I can tell, the processes of drawing either of the two conclusions are isomorphic. For that matter, the same reasoning would also lead to the derivation “So the worse a woman does, the more you believe she is actually better.” (With an analogous statement for men. This is explicitly pointed out in the explanation.)
The difference between the men and the women is point where we switch from “better/less” to “worse/more”, and the magnitude of the effect as we get further away from that point. (That is, the mean and the standard deviation.)
I can’t figure out a way of saying this without making me sound bad even to myself, but it seems… I don’t know, annoying at least, that you picked a logical conclusion that aplies exactly the same to both genders, but apply that to women, don’t mention at all what appears to be the only factual assertion of an actual difference between the abilities of women and men (and which I haven’t seen actually contested in neither this nor the earlier discussion on the subject), did not in fact criticise Vaniver’s explanation—which, by the way, as far as I can tell from his post, is just an explanation for beo’s benefit, I can’t deduce from its text that he’s actually endorsing using the procedure—and at the same time you manage to make both him and me, even before I participate, seem that we should be ashamed of ourselves, by sort of implying that he’ll also do something else not mentioned by him, and not logically implied by the explanation, and that would have a bad consequence if done very badly. (Well, it feels that way to me, I can’t tell if Vaniver took umbrage nor if I’m actually reading correctly the society around me with respect to which the shame relates.)
I’m not sure if I have a point, exactly, I’m sort of just sharing my feelings in case it generates some insight. I don’t think you did this as an intentional dishonesty. It’s weird, it looks like there’s a blind spot exactly in the direction you’re looking at (after all, this is exactly the topic of the discussion).
But then again I also feel like I have such a blind spot, like it’s impolite that I should have noticed this, or even that I’m a bad person for not agreeing with your conotation and I can’t tell why. (And I’m some sort of misoginistic pig because I can’t see it.)
I seem to have that reaction quite often around this kind of discussion. I usually get sort of angry, go away, and dismiss the particular person that caused the reaction, but (I like to think) that’s only because I have low priors on people in general, which doesn’t apply here, and it seems worse somehow.
As far as I can tell I actually like men much less than women (in the “being around them” sense), it feels as if I’m very inclined to equality, but somehow this kind of feminism seem very annoying. (I’m not exactly sure what I mean when I say “this kind of feminism”. The kind that argues for better women rights in some islamic countries isn’t annoying, except in the sense that it gets me angry at humanity, but that again that’s kind of expected in my society, so it doesn’t say much.)
Thus, the chance that a female got an 800 on the Math SAT due to luck is higher than the chance that a male got an 800 on the Math SAT due to luck.
Shouldn’t it be possible to estimate the magnitude of this effect by comparing score distributions on tests with differently sized question pools, or write-in versus multiple choice, or which are otherwise more or less susceptible to luck?
You’d need a model of how much luck depends on those factors. Test-retest variability gives a good measure of how much one person’s scores vary from test to test; apparently for the SAT the test-retest standard deviation is about 30 points. (We can’t quite apply this number, since it might not be independent of score, but it’s better than nothing.)
The regression to the mean adjustment can be seen as a limited form of hierarchical/multilevel models with a fixed population mean, so any one score gets shrunk toward the population mean.
I attempted a few searches with things like “test results, luck, lucky, group, prior, blues, gatekeeper, good day, second test”, etc.
Found nothing that fits what you were describing, unfortunately. Perhaps a few less-common terms from the discussion if you remember any, or even better any sentence / specific formulation used there, might help when combined together.
I’m ok with the general emotional tone (lack of tone?) here. I think I read the style of discussion as “we’re all here to be smart at each other, and we respect each other for being able to play”.
However, the gender issues have been beyond tiresome. My default is to assume that men and women are pretty similar. LW has been the first place which has given me the impression that men and women are opposed groups. I still think they’re pretty similar. The will to power is a shared trait even if it leads to conflict between opposed interests.
LW was the first place I’ve been where women caring about their own interests is viewed as a weird inimical trait which it’s only reasonable to subvert, and I’m talking about PUA.
I wish I could find the link, but I remember telling someone he’d left women out of his utilitarian calculations. He took it well, but I wish it hadn’t been my job to figure it out and find a polite way to say it.
Remember that motivational video Eliezer linked to? One of the lines toward the end was “If she puts you in the friend zone, put her in the rape zone.” I can’t imagine Eliezer saying that himself, and I expect he was only noticing and making use of the go for it and ignore your own pain slogans—but I’m still shocked and angry that it’s possible to not notice something like that. It’s all a matter of who you identify with. Truth is truth, but I didn’t want to find out that the culture had become that degraded.
And going around and around with HughRustik about PUA.… I think of him as polite and intelligent, and it took me a long time to realize that I kept saying that what I knew about PUA was what I’d read at LW, and he kept saying that it wasn’t all like Roissy, who I kept saying I hadn’t read. I grant that this is well within the normal range of human pigheadedness, and I’m sure I’ve done such myself because it can be hard to register that people hate what you love, but it was pretty grating to be on the receiving end of it.
There was that discussion of ignoring good test results from a member of a group if you already believe that they’re bad at whatever was being tested. (They were referred to as blues, but it seemed to be a reference to women and math.) It was a case of only identifying with the gatekeeper. No thought about the unfairness or the possible loss of information. I think it finally occurred to someone to give a second test rather than just assuming it was a good day or good luck.
Unfortunately, I don’t have an efficient way of finding these discussions I remember—I’ll grateful if anyone finds links, and then we can see how accurate my memories were.
All this being said, I think LW has also become Less Awful so far as gender issues are concerned. I’m not sure how much anyone has been convinced that women have actual points of view (partly my fault because I haven’t been tracking individuals) since there are still the complaints about what one is not allowed to say.
My apologies for that! You’re correct that I didn’t notice that on a different level than, say, the parts about killing your friends if they don’t believe in you or whatever else was in the Courage Wolf montage. I expect I made a ‘bleah’ face at that and some other screens which demonstrated concepts exceptionally less savory than ‘Courage’, but failed to mark it as something requiring a trigger warning. I think this was before I’d even heard of the concept of a “trigger warning”, which I first got to hear about after writing Ch. 7 of HPMOR.
Generally speaking, I’ve noticed that mentioning rape tends to mind-kill people on the Internet much more than mentioning murder. I hypothesize this is due to the fact that many more people are actually raped than murdered.
And that people who have been raped are much (infinitely?) more likely to go one to participate in discussions on rape than people who have been murdered are likely to participate in discussions on murder. Also, that rape is more likely to bring in gender politics.
What about people who have had friends or relatives murdered?
The murder of children, I think, tends to be intrinsically serious in the way that fictional murder in general isn’t. This might be part of it.
Presumably there’s as many such relatives as for the rape victims. (Unless lonely orphans are singled out by murderers? In order to inherit the family fortune, if I’ve learned anything about the real world from false made-up stories...)
This could be due to media filters, but I hear about people traumatized by the murder of their friends and family much more often than people traumatized by the rape of others.
...or people who survived attempted murder, for that matter. (Still probably many fewer of them in the average internet discussion than people who survived rape or attempted rape.)
I think there’s been a cultural shift—mentions of rape are taken a lot more seriously than they were maybe 20 years ago. (I’m sure of the shift, and less sure of the time scale.)
I believe part of it has been a feminist effort to get rape of women by men taken seriously which has started to get rape of men by men taken seriously. Rape by women is barely on the horizon so far.
PTSD being recognized as a real thing has made a major contribution—it meant that people could no longer say that rape is something which should just be gotten over. Another piece is an effort to make being raped not be a major status-lowering event, which made people more likely to talk about it.
As for comparison to murder, I’ve seen relatives of murdered people complain that murder jokes are still socially acceptable.
As far as I can tell, horrific events can be used as jokes when they aren’t vividly imagined, and whether something you haven’t experienced is vividly imagined is strongly affected by whether the people around you encourage you to imagine it or not.
That’s the subject of the first couple minutes of This American Life episode 342.
(Transcript here.)
That’s definitely a place I’ve heard it.
I’m not sure about that. It seems like in places and times where horrific events are much more common, people take an almost gallows humor attitude towards the whole thing (at least the violence part). Things like PTSD seem to happen when people in cultures where horrific events are rare temporarily get exposed to them.
This … seems to fit the evidence, actually. Not sure why it was downvoted; is there some evidence nobody’s told me about?
From what I’ve read, repeated trauma is a good way of predicting PTSD, so lack of familiarity with trauma wouldn’t be a good explanation.
Oh, right. I interpreted it as saying that horrific events are only traumatic when you’re from a culture where they’re rare, not that repeated traumatic events somehow lower one’s levels of PTSD. That would be nonsense, obviously.
Right. One idea I had is that what causes PTSD is not so much the traumatic experience as being surrounded by people who can’t relate to it.
A more Hansonian version is that exhibiting PTSD is a strategy to gain attention and sympathy and that this strategy won’t work if everyone around has also suffered similar experiences.
Another possibility is that in cultures where traumatic events are common, people who can’t deal with them without suffering PTSD are likely to get killed off by the next one.
There are probably many reasons involved, but I’d point out that in our media we frequently glamorize protagonists who kill people, but generally not ones who rape people.
There may be some cultural variation in this; I recall reading an African folk tale wherein, early on, the protagonist rapes his own mother. Afterwards he proceeds to navigate various perils with feats of cunning and derring-do, and I spent the rest of the story asking “how am I supposed to root for this guy? He raped his own mother! For no apparent reason, even!”
Tell me about that… Last night I was watching Big Miracle and I was like “how am I supposed to root for the whales? It’d probably cost a lot to save them, and with that much money you could save people!” Until the youngest whale was shown to be ill, then I did. I guess that illustrates the Near vs Far distinction even though that wasn’t the point!
BTW (continuing along the rape vs murder thing), have you read (say) Crime and Punishment, and if so, were you able to root for the protagonist? (I was.)
No, I’ve never read it.
This difference in commonality extends not only to victims but to perpetrators. A higher proportion of people who find rape funny will be rapists than those who find murder funny will be murderers; murder is much harder to get away with.
I think this has to do with the way we handle things related to sex, for example, if we were having this discussion 100 years ago, we might be talking about why portrayals of adultery are unacceptable in contexts where portrails of murder would be.
I agree with your conclusion, but that particular example doesn’t counterexemplify my point because I guess many more people were actually cuckolded than murdered!
Apology accepted. I hadn’t thought about it that way, but I can see how you could have filed it under “generic hyperbolic obnoxious”.
At the time, I was just too tired of discussing gender issues to be more direct about that part of the video.
Looking at the discussion a year and a half later, I was somewhat amazed at the range of reactions to the video. Apropo of a recent facebook discussion about the found cat and lotteries, there might be a clue about why people use imprecise hyperbolic language so much—it’s more likely to lead to action. I’ve also noticed that it doesn’t necessarily feel accurate to describe strong emotions in outside view accurate language.
There ought to be something intelligent and abstract to say about filtering mechanism conflicts, but I can’t think of what it might be right now. E.g., a mention once came up of os-tans on HN, someone said “What’s an os-tan?”, I posted a link to a page of OS-tans, and then replies complained that the page was NSFW and needed a warning. I was like “What? All those os-tans are totally safe for work, I checked”. Turns out there was a big ol’ pornographic ad at the top of the page which my eyes had probably literally skipped over, as in just never saccaded there.
That Courage Wolf video probably has a pretty different impact depending on whether or not you automatically skip over and mostly don’t even notice all the bad parts.
And in another ten years a naked person walking down the street will be invisible.
Sometimes I fail to include NSFW tags because I use an adblocker, so NSFW ads don’t appear for me.
Huh?! I wonder if this is another instance of Eliezer not realizing how atypical the bay area is.
Science fiction reference—I think it’s to Kurland’s The Unicorn Girl.
I don’t see how it is.
It seems like in the best case, PUA would be kind of like makeup. Lots of male attraction cues are visual, so they can be gamed when women wear makeup, do their hair, or wear an attractive outfit. Lots of female attraction cues are behavioral, so they can be gamed by acting or becoming more confident and interesting.
As one Metafilter user put it:
Do you have ethical problems with any of 1-4?
Ed. - It’s possible that when HughRistik said “not all PUA advice is like Roissy’s”, he meant “the PUA stuff we’re discussing on Less Wrong is Roissy-type stuff, and not all PUA stuff is like that”.
I’m actually at the point when I think it is impossible to give men useful advice to improve their sex lives and relationships because of the social dynamics that arise in nearly all societies. Actually good advice aiming to optimize the life outcomes of the men who are given it has never been discussed in public spaces and considered reputable.
Same can naturally be said of advice for women. I think most modern dating advice both for men and women is anti-knowledge in that the more of it you follow the more miserable you will end up being. I would say follow your instincts but that doesn’t work either in our society since they are broken.
Advice about how to look better seems trivially useful and reputable… Overall, I find your claim that the intersection of palatable dating advice and useful dating advice is empty extremely implausible. What else would Clarisse Thorn’s “ethical PUA advice” be?
At the very least there should be some reasonably effective advice that’s only minimally unpalatable or whatever, like become a really good guitarist and impress girls with your guitar skillz.
Regarding PUA and evolutionary psychology: I don’t see how a self-selected population that’s under the influence of alcohol, and has been living with all kinds of weird modern norms and technology, has all that much in common with the EEA.
Good point that I hadn’t thought of. And also, most mating in the EEA would be with people that you’d had and expect to have extended interactions with—this is probably very different from trying to pick up strangers.
I’d go with “keep your eyes on the road, your hands upon the wheel”, i.e.¹ use the evidence that you see to update your model of the world,² and your model of the world to decide which possible behaviours would be most likely to achieve your goals. This applies to any goal whatsoever (not just dating), and ought to be obvious to LW readers, but people may tend to forget this in certain contexts due to ugh fields.
This is probably not what Jim Morrison meant by that, but still.
Note that the world also includes you. Noticing what this fact implies is left as an exercise for the reader.
I endorse this advice. Note however some consider this in itself unethical when it comes to interpersonal relations. I have no clue why.
I think I may have just figured out why. Think about the evolutionary purpose of niceness. Thinking about the nice vs. candid argument here, I suspect the purpose of niceness is to provide a credible precommitment to cooperate with someone in the future by sabotaging one’s own reasoning in such a way that will make one overestimate the value of cooperating with the other person.
Hmm, yeah. Causal decision theory doesn’t work right in several-player games and you shouldn’t defect in the Prisoner’s Dilemma, but that was one of the things I alluded to in Footnote 2; “would” in my comment was intended to be interpreted as explained in Good and Real.
Er… How the hell do those people think they learnt their own native language???
If all PUA said was those 4 things, it wouldn’t be interesting or controversial, so I think it’s pretty ridiculous to respond to a conversation about PUA mentioning the parts few people would disagree with. Trickery, lies, insults, treating people as things, these are the sorts of problems people have with PUA.
This sounds reasonable until you actually think about the four points mentioned in Near mode. Consider:
What does approaching lots of women actually look like if done in a logistically sound way? How does this relate to social norms? How does this relate to how feminists would like social norms to be?
Observe what actually confident humans do to signal their confidence. Just do.
Observe what is actually considered entertaining in a club envrionment that most PUA is designed to work in.
You know most of the things considered disreputable that PUAs advocate are precisely the result of first observing how points one to three actually work in our society and then optimizing to mimic this.
Only dressing and grooming well is probably not inherently controversial and even then pick up artists are mocked for their attempts to reverse engineer fashion that signals what they want to signal.
I recommend Clarisse Thorn’s Confessions of a Pickup Artist Chaser—PUA is a divergent group of subcultures.
Seems like a reasonable complaint.
How do you reconcile this view with the way questions of tone have become entangled with gender issues in this very thread?
It was also an extremely straightforward application of Bayes’s theorem.
The problem is that the concept of “fairness” you are using there is incompatible with VNM-utilitarianism. (If somebody disagrees with this, please describe what the term in one’s utility function corresponding to fairness would look like.)
Where has anyone claimed they don’t? At least beyond the general rejection of qualia?
I was surprised at how strongly some people (probably mostly women) are uncomfortable with the tone here, so I have a lot to update.
I don’t like emoticons much—I don’t hate people who use them, but I use emoticons very rarely, and I’m not comfortable with them. I still find it hard to believe that if people do something a lot, there’s a reasonable chance (if they aren’t being paid) that they like it a lot, even though I can’t imagine liking whatever it is.
I don’t know what proportion of people are apt to interpret lack of overt friendliness as dislike, nor what the gender split is.
In the spirit of exploration, I took a look at Ravelry, a major knitting and crocheting blog. I haven’t found major discussions there yet. I’m interested in examples of blogs with different emotional tones/courtesy rules/gender balances.
Now that I think about it, blogs that are mostly women may be more likely to have overt statements of strong friendship and support. I believe that sort of effusiveness is partly cultural—wasn’t more common for both men and women at least from the colonial era (US) to the Victorian era?
That depends on how much you demand of your priors, and low quality priors is something that makes me nervous about Bayes.
For this particular case, there’s no examination of how much variance on the high side people get on tests. In particular, it seems very unlikely that people will get scores much above their baseline on tests about any sophisticated subject, though various factors (illness and other distractions) could drive their scores below their baseline.
What’s VHF Utilitarianism? Is there any utilitarian cost to some capable people giving up because they believe rightly that their accomplishments will be discounted?
My language may have been hyperbolic and/or vague. I was thinking of “creepiness = low status” which sounds to me like “it’s so unfair that women don’t want to spend time with men they’re uncomfortable around”. In this case, I was thinking “lack of point of view”, but “preferences are irrelevant” might be more accurate.
I think I’ve interpreted “creepiness = low status” as, “it’s unfair that low-status men get labeled as creepy for behavior that high-status men would get away with.”
Of course, one could respond that making people at least feel comfortable around you is an easy way to improve your status. :)
That’s a large part of what PUA attempts to do.
Well is it unfair?
I wouldn’t say so. What do you think?
I’m trying to figure out what you mean by “fairness”. I don’t see why this isn’t unfair but adjusting the test scores based on priors is.
A typo, I meant VNM Utilitarianism.
Well, this depends on the exact circumstances, but this may happen to the people who got unlucky on the test anyway, and using a better predictor decreases the number of people who get mischaracterized.
Is this comment a satire?
In any case, the remark about the von Neumann-Morgenstern theorem is just wrong.
Is yours?
So, what does the term in a utility function corresponding to fairness look like?
Like, if someone wanted to mock this website, that’s exactly what they’d write.
You’re probably thinking that a utility function can’t prefer “fair” lotteries. But it can prefer fair outcomes, which is what’s relevant here.
I’m not a utilitarian and the arguments like the one I made about utility are part of the reason, if that’s what you’re asking.
What’s a “fair” outcome? Should we abandon life extension research because it would be “unfair” to those who died before it achieves results?
The von Neumann-Morgenstern theorem has nothing to do with utilitarianism, and it’s not about what you “should” do. Those words don’t appear in the statement of the theorem. The theorem does state that a VNM-rational agent has a preference ordering over lotteries of outcomes. In fact it can have any preferences over outcomes at all and still satisfy the hypotheses of the theorem. In particular, it can prefer fair outcomes to unfair outcomes for any definition of “fair”.
If you want to argue that one shouldn’t pursue fairness, you don’t want to use the VNM theorem.
Agreed, unfortunately a lot of people around here seem to interpret it this way.
I would argue that fairness is a property of a process rather than an outcome, e.g., a kangaroo court doesn’t become “fair” just because it happens to reach the same verdict a fair trial would have.
A simple “no” would have sufficed. Downvoted.
Downvoted Eugine for the same reason, and upvoted MugaSofer back to positive. I value honest feedback, and see no reason to downvote ’em for providing it.
When the difference IS the topic, that tends to amplify the relevance of the differences.
Then why is it that this difference, out of the many dimensions of differences that form up humankind, and the multitude of interest-group formation patterns that could have been generated, is the one that gets so much attention? It would be bizarre if an unbiased deliberation process systematically decides that one unremarkable axis (gender) is the one difference that should be discussed at great length and with very vigorous champions, while ignoring all of the other axes of diversity of human minds.
Now it is possible for one unremarkable axis to become overwhelmingly dominant in coalition formation, but that would involve some fairly unpleasant implications about the truth-seekiness and utilitarian consequences of this sort of thinking.
I dunno about this. It seems that the difference between those concerned with an intelligence explosion and those concerned with other scenarios has gotten way more attention here than gender.
I wasn’t surprised on the occasions when questions of differences in tone between the two camps flared up when discussing that topic. I would have been shocked almost beyond belief if, when discussing that topic, questions of tone differences between men and women had arisen.
The idea is, almost every topic, men and women are very similar, because the differences aren’t relevant. When you begin looking at the differences, then you get amplifying effects. In particular, each participant being what they are and completely unable to change that means:
that the topic isn’t going to be to convert people from one camp to the other or otherwise influence their choice as in the example above, but it’s going to have to be about something about that. This added layer of meta makes things much less stable. Imagine having a discussion about how we ought to talk about the differences between intelligence explosion and other scenarios, while universally acknowledged that no one was going to change their position on the actual subject. It’d be all over the place.
that empathy is harder to achieve. And in particular looking at the difference from one end gives exactly opposite perspectives on the issue. When you ‘normalize’ the differences, it’s maximally different.
This.
By definition, those on either side have different experiences with regard to the difference, and thus are vastly more likely to hold different opinions.
We have a population of 200 weasels, 100 blue and 100 red. 90% of blue weasels are programmers, and 10% of red weasels are programmers.
If we design a perfect test-of-being-a-programmer, we will have a pool of 100 programmers (90 blue, 10 red).
If our pool of programmers does NOT follow that distribution, it suggests that we’re probably doing something wrong in our screening, like de-facto excluding all of the red weasels due to bigotry. This HURTS us, because we now have fewer programmers in our pool, and/or we have non-programmers in our pool.
If you go out and test all the weasels, and 50% of them pass, and it’s 90% blue and 10% red, I don’t see any rational reason to assume that the blue weasels are going to be superior to the red weasels, or that the red weasels are more likely to be because of test variance.
Now, if you get a pool that’s 80 red weasels and 20 blue weasels, you’re right to be suspicious that maybe this is not a very accurate test. But given the real-world job market, we should expect such outliers to occur. If everyone else is getting 90 blue and 10 red weasels from this test, you should assume you’re such an outlier, since you have plenty of evidence towards the test being accurate.
And if we’re getting that 90-10 ratio that we expect, there’s no reason to assume that the red weasels are any less competent. If 10% of all weasels are super-programmers, we should expect 10% of our blue programming weasels and 10% of our red programming weasels to be super-programmers (so, on average, 9 blue super-programmers and 1 red super-programmer).
Seriously, where is this anti-red-weasel bias coming from? Nothing in the math seems to suggest it, unless you’re using a seriously crappy test >.>
I don’t follow. Just because your test happened to result in a split that superficially resembles the underlying frequencies, why do you then assume that your imperfect test turned in exactly the right result in all 200 cases? The same logic of an imperfect test leading to shrinking estimates to the mean seems to still apply.
Did you follow my and Vaniver’s thread on this topic? The effect holds unless the test is perfectly accurate.
WARNING: Rambly, half-thought-out answer here. It’s genuinely not something I’ve fully worked through myself, and I am totally open to feedback from you that I’m wrong.
The tl;dr version is that the effect is going to be small unless you have a very inaccurate test, and it’s suspicious to focus on a small effect when there’s probably other, larger effects we could be looking at.
Hmmm. Is that actually true? If we know the test has a 10% false positive rate for both red and blue weasels, doesn’t that suggest we should have 9 non-programmer blue weasels and 1 non-programmer red weasel?
Like, if I have a bag with 2 red marbles, and 2 white marbles, the odds of drawing a red marble are 50⁄50. But if my first draw is a red marble, I can’t claim that it’s still 50⁄50, and I can’t update to say that drawing one red marble makes me MORE likely to draw a second one. The new odds are 33⁄66, no matter what math you run. The only correct update is the one that leaves you concluding 33⁄66.
It seems like there is such a test that the test results… already factor in our prior distribution? I’m not sure if I’m being at all clear here :\
Absolutely, this isn’t always the case—if you just know that you have a 10% false positive, and it’s not calibrated for red false positives vs blue false positives, you DO have evidence that red false positives are probably more common. BUT, you’d still be a fool to exclude ALL red candidates on that basis, since you also know that you should legitimately have red candidates in your pool, and by accepting red candidates you increase the overall number of programmers you have access to.
It all depends on the accuracy of your test. If your test is sufficiently accurate that red weasels are only 1% more likely to be false positives, then this probably shouldn’t affect your actual decision making that much.
Then, if you decide to FOCUS on how red weasels have a +1% false positive rate, it implies that you consider this fact particularly important and relevant. It implies that this is a very central decision making factor, and you’re liable to do things like “not hire red weasels unless they got an A+ on their test”, even though the math doesn’t support this. If you’re just doing cold, hard math, we’d expect this factor to be down near the bottom of t he list, not plastered up on a neon marquee saying “we did the cold hard math, and all you red weasels can f**k off!”
If we assume two populations, red-weasel-haters and rationalists, we could even run Bayes’ Theorem and conclude that anyone who goes around feeling the need to point out that 1% difference is SIGNIFICANTLY more likely to be a red-weasel-hater, not a rationalist.
Then we can go in to the utilitarian arguments about how feeding the red-weasel-haters political ammunition does actually increase their strength, and thus harms the red weasels, keeps them away from programming, and thus harms programming culture by reducing our pool of available programmers.
Yes, the effect is small in absolute magnitude—if you look at the example SAT shrinking that Vaniver and I were working out, the difference between the male/female shrunk scores is like 5 points although that’s probably an underestimate since it’s ignoring the difference in variance and only looking at means—but these 5 points could have a big difference depending on how the score is used or what other differences you look at.
For example, not shrinking could lead to a number of girls getting into Harvard that would not have since Harvard has so many applicants and they all have very high SAT scores; there could well be a noticeable effect on the margin. When you’re looking at like 30 applications for each seat, 10 SAT points could be the difference between success and failure for a few applicants.
One could probably estimate how many by looking for logistic regressions of ‘SAT score vs admission chance’, seeing how much 10 points is worth, and multiplying against the number of applicants. 35k applicants in 2011 for 2.16k spots. One logistic regression has a ‘model 7’ taking into account many factors where going from 1300 to 1600 goes from an odds ratio of 1.907 to 10.381; so if I’m interpreting this right, an extra 10pts on your total SAT is worth an odds ratio of
((10.381 - 1.907) / (1600-1300)) * 10 + 1 = 1.282
. So the members of a group given a 10pt gain are each 1.28x more likely to be admitted than they were before; before, they had a2.16/35 = 6.17%
chance, and now they have a(1.28 * 2.16) / 35 = 2.76 / 35 = 7.89%
chance. To finish the analysis: if 17.5k boys apply and 17.5k girls apply and 6.17% of the boys are admitted while 7.89% of the girls are admitted, then there will be an extra(17500 * 0.0789) - (17500 * 0.0617) = 301
girls.(A boost of more than 1% leading to 301 additional girls on the margin sounds too high to me. Probably I did something wrong in manipulating the odds ratios.)
One could make the same point about means of bell curves differing a little bit: it may lead to next to no real difference towards the middle, but out on the tails it can lead to absurd differentials. I think I once calculated that a difference of one standard deviation in IQ between groups A and B leads to a difference out at 3 deviations for A vs 4 deviations for B, what is usually the cutoff for ‘genius’, of ~50x. One sd is a lot and certainly not comparable to 10 points on the SAT, but you see what I mean.
How do you know your first draw is a red marble?
Depends on what you’re going to do with them, I suppose… If you can only hire 1 weasel, you’ll be better off going with one of the blue weasels, no? While if you’re just giving probabilities (I’m straining to think of how to continue the analogy: maybe the weasels are floating Hanson-style student loans on prediction markets and you want to see how to buy or sell their interest rates), sure, you just mark down your estimated probability by 1% or whatever.
Alas! When red-weasel-hating is supported by statistics, only people interested in statistics will be hating on red-weasels. :)
We can check this interpretation by taking it to the 30th power, and seeing if we recover something sensible; unfortunately, that gives us an odds ratio of over 1700! If we had their beta coefficients, we could see how much 10 points corresponds to, but it doesn’t look like they report it.
Logistic regression is a technique that compresses the real line down to the range between 0 and 1; you can think of that model as the schools giving everyone a score, admitting people above a threshold with probably approximately 1, admitting people below a threshold with probability approximately 0, and then admitting people in between with a probability that increases based on their score (with a score of ‘0’ corresponding to a 50% chance of getting in).
We might be able to recover their beta by taking the log of the odds they report (see here). This gives us a reasonable but not too pretty result, with an estimate that 100 points of SAT is worth a score adjustment of .8. (The actual amount varies for each SAT band, which makes sense if their score for each student nonlinearly weights SAT scores. The jump from the 1400s to the 1500s is slightly bigger than the jump from the 1300s to the 1400s, suggesting that at the upper bands differences in SAT scores might matter more.)
A score increase of .08 cashes out as an odds ratio of 1.083, which when we take that to the power 30 we get 11.023, which is pretty close to what we’d expect.
Two standard deviations is generally enough to get you into ‘gifted and talented’ programs, as they call them these days. Four standard deviations gets you to finishing in the top 200 of the Putnam competition, according to Griffe’s calculations, which are also great at illustrating male/female ratios at various levels given Project Talent data on math ability.
I’ll also note again that the SAT is probably not the best test to use for this; it gives a male/female math ability variance ratio estimate of 1.1, whereas Project Talent estimated it as 1.2. Which estimate you choose makes a big difference in your estimation of the strength of this effect. (Note that, typically, more females take the SAT than males, because the cutoff for interest in the SAT is below the population mean, where male variability hurts as well as other factors, and this systemic bias in subject selection will show up in the results.)
Thanks for the odds corrections. I knew I got something wrong...
G&T stuff, yeah, but in the materials I’ve read 2sd is not enough to move you from ‘bright’ or ‘gifted and talented’ to ‘genius’ categories, which seems to usually be defined as >2.5-3sd, and using 3sd made the calculation easier.
Eh. MENSA requires upper 2% (which is ~2 standard deviations). Whether you label that ‘genius’ or ‘bright’ or something else doesn’t seem terribly important. 3.5 standard deviations is the 2.3 out of 10,000 level, which is about a hundred times more restrictive.
I’d call MENSA merely bright… You need something in between ‘normal’ and ‘genius’ and bright seems fine. Genius carries all the wrong connotations for something as common as MENSA-level; 2.3 out of 10k seems more reasonable.
Only if Harvard cares a lot about SAT scores. According to this graph, the value of SATs is pretty flat between the 93rd and 96th percentiles. Moreover, at other Ivies, SAT scores are penalized in this range. source, page 7(8)
This graph is not a direct measure of the role of SATs, because they can’t force all else to be equal. The paper argues that some schools really do penalize SAT scores in some regimes. I do not buy the argument, but the graph convinces me that I don’t know how it works. Many people respond to the graph that it is the aggregation of two populations admitted under different scoring rules, both of which value SATs, but I do not think that explains the graph.
Your graph doesn’t show that the average applicant won’t benefit from 10 points. It shows that overall, SAT scores make a big difference (from ~0 to 0.2, with not even bothering to show anyone below the 88th percentile).
The paper I cited earlier for logistic regressions used models controlling for other things. Given the benefits to athletes, legacies, and minorities, benefits necessary presumably because they cannot compete as well on other factors (like SAT scores), it’s not necessarily surprising if aggregating these populations can lead to a raw graph like those you show. Note that the most meritocratic school which places the least emphasis on ‘holistic’ admissions (enabling them to discriminate in various ways) is MIT, and their curve looks dramatically different from, say, Princeton.
Yes, if large SAT changes matter, then there must be some small changes that matter. But it is possible that other points on the scale where they don’t, or are harmful. I’m sorry if I failed to indicate that I meant only this limited point.
If a school admits two populations, then the histogram of SATs of its students might look like a camel. But why should the graph of chance of admission? I suppose Harvard’s graph makes sense if students apply when their assessment of their ability to get in crosses some threshold. Then applying screens off SATs, at least in some normal regime.* But at Yale and especially Princeton, rising SATs in the middle regime predicts greater mistaken belief in ability to get in. Legacies (but not athletes or AA) might explain the phenomenon by only applying to one elite school, but I don’t think legacies alone are big enough to cause the graph.
Here are the lessons I take away from the graphs that I would apply if I had been doing the regressions and wanted to explain the graphs. First, schools have different admissions policies, even schools as similar as Harvard and Yale. Averaging them together, as in the paper, may make things appear smoother than they really are. Second, given the nonlinear effect of SATs, it is good that the regression used buckets rather than assuming a linear effect. Third, since the bizarre downward slope is over the course of less than 100 points, the 100 point buckets of the regression may be too coarse to see it. Fourth, they could have shown graphs, too. It would have been so much more useful to graph SAT scores of athletes and probability of admission as a function of SAT scores of athletes. The main value of regressions is using the words “model” and “p-value.” Fifth, the other use of the regression model is that it lets them consider interactions, which do seem to say that there is not much interaction between SATs and other factors, that the marginal value of an SAT point does not depend on race, legacy status, or athlete status (except for the tiny <1000 category). But the coarseness of the buckets and the aggregating of schools does not allow me to draw much of a conclusion from this.
* Actually, the whole point of this thread is that you can’t completely screen off. But I want to elaborate on “normal regime.” At the high end, screening breaks down because if, say, 1500 SAT is enough to cross the threshold, everyone with 1500+ SAT applies and there is no screening phenomenon. At the low end, I don’t see why screening would break down. Why would someone with SAT<1000 apply to an elite school without really good reason? Yet lots of people apply with such low scores and don’t get in.
Sure, there could be non-monotonicity.
Imagine that Harvard lets in equal numbers of ‘athletes’ and ‘nerds’, the 2 groups are different populations with different means, and they do something like pick the top 10% in each group by score. Clearly there’s going to be a bimodal histogram of SAT scores: you have a lump of athlete scores in the 1000s, say, and a lump of nerd scores in the 1500s. Sure. 2 equal populations, different means, of course you’re going to see a bimodal.
Now imagine Harvard gets more 10x more nerd applicants than athletic applicants; since each group gets the same number of spots, a random nerd will have 1⁄10 the admission chance as an athlete. Poor nerds. But Harvard kept the admission procedure the same as before. So what happens when you look at admission probability if all you know is the SAT score? Well, if you look at the 1500s applicants, you’ll notice that an awful lot of them aren’t admitted; and if you look at the 1000s applicants, you’ll notice that an awful lot of them getting in. Does Harvard hate SAT scores? No, of course not: we specified they were picking mostly the high scorers, and indeed, if we classify each applicant into nerd or athlete categories and then looked at admission rates by score, we’d see that yes, increasing SAT scores is always good: the nerd with a 1200 better apply to other colleges, and the athlete with 1400 might as well start learning how to yacht.
So even though in aggregate in our little model, high SAT scores look like a bad thing, for each group higher SAT scores are better.
Reminds me of Simpson’s paradox.
Yes, I don’t think we could make a conclusive argument against the claim that SAT scores may not help at all levels, not without digging deep into all the papers running logistic regressions; but I regard that claim as pretty darn unlikely in the first place.
They could be self-delusive, doing it to appease a delusive parent (‘My Johnnie Yu must go to Harvard and become a doctor!’), gambling that a tiny chance of admission is worth the effort, doing it on a dare, expecting that legacies or other things are more helpful than they actually are...
Sure, maybe you can make a model that outputs Harvard or Princeton’s results, but how do you explain the difference between Harvard and Princeton? It is easier to get into Princeton as either a jock or a nerd, but at 98th SAT percentile, it is harder to get into Princeton than Harvard. These are the smart jocks or dumb nerds. Maybe Harvard has first dibs on the smart jocks so that the student body is more bimodal at other schools. But why would admissions be more bimodal? Does Princeton not bother to admit the smart jocks? That’s the hypothesis in the paper: an SAT penalty. Or maybe Princeton rejects the dumb nerds. It would be one thing if Princeton, as a small school, admitted fewer nerds and just had higher standards for nerds. But they don’t at the high end. What’s going on? Here’s a hypothesis: Harvard (like Caltech) could admit nerds based on other achievements that only correlate with SATs, while Princeton has high pure-SAT standards.
I don’t think an SAT penalty is very plausible, but nothing I’ve heard sounds plausible. Mostly people make vague models like yours that I don’t think explain all the observations. The hypothesis that Princeton in contrast to Harvard does not count SAT for jocks beyond a graduation threshold at least does not sound insane.
I take graphs over regressions, any day.
Regressions fit a model. They yield very little information. Sometimes it’s exactly the information you want, as in the calculation you originally brought in the regression for. But with so little information there is no possibility of exploration or model checking.
By the way, the paper you cite is published at a journal with a data access provision.
Dunno. I’ve already pointed out the quasi-Simpsons Paradox effect that could produce a lot of different shapes even while SAT score increases always help. Maybe Princeton favors musicians or something. If the only reason to look into the question is your incredulity and interest in the unlikely possibility that increase in SAT score actually hurts some applicants, I don’t care nearly enough to do more than speculate.
I have citations in my DNB FAQ on how such provisions are honored mostly in the breach… I wonder what the odds that you could get the data and that it would be complete and useful.
Aren’t odds ratios multiplicative? It also seems to me that we should take the center of the SAT score bins to avoid an off-by-one bin width bias, so (10.381 / 1.907) ^ (10 / (1550 − 1350)) = 1.088. (Or compute additively with log-odds.)
As Vaniver mentioned, this estimate varies across the SAT score bins. If we look only at the top two SAT bins in Model 7: (10.381 / 4.062) ^ (10 / (1550 − 1450)) = 1.098.
Note that within the logistic model, they binned their SAT score data and regressed on them as dichotomous indicator variables, instead of using the raw scores and doing polynomial/nonparametric regression (I presume they did this to simplify their work because all other predictor variables are dichotomous).
Yeah; Vaniver already did it via log odds.
Which is higher than the top bin of 1.088 so I guess that makes using the top bin an underestimate (fine by me).
Alas! I just went with the first paper on Harvard I found in Google which did a logistic regression involving SAT scores (well, second: the first one confounded scores with being legacies and minorities and so wasn’t useful). There may be a more useful paper out there.
I’d understood the question to be “given identical scores”, not “given a 10 point average difference in favor of the blue weasel”.
i.e. we take a random sample of 100 men and 100 women with SAT scores between 1200-1400 (high but not perfect scores). Are the male scores going to average better than the females?
My intuition says no: while I’d expect fewer females to be in that range to begin with, I can’t see any reason to assume their scores would cluster towards the lower end of the range compared to males.
So, first let’s ask this question, supposing that the test is perfectly accurate. We’ll run through the numbers separately for the two subtests (so we don’t have to deal with correlation), taking means and variances from here.
Of those who scored 600-700 on the hypothetical normally distributed math SAT (hence “HNDMSAT”), the male mean was 643.3 (with 20% of the male population in this band), and the female mean was 640.6 (with 14.8% of the female population in this band).
Of those who scored 600-700 on the HNDVSAT, the male mean was 641.0 (with 14.9% of the male population in this band), and the female mean was 640.1 (with 13.7% of the female population in this band).
When we introduce the test error into the process, the computation gets a lot messier. The quick and dirty way to do things is to say “well, let’s just shrink the mean band scores towards the population mean with the reliability coefficient.” This turns the male edge on the HNDMSAT of 2.7 into 5.4, and the male edge of .9 into 1.8. (I think it’s coincidental that this is roughly doubling the edge.)
That’s because you’re not thinking in bell curves. The range is all on one side of the mean, the male mean is closer to the bottom of the band, and the male variation is higher.
My point was that ‘suppose that the true shrinkage leads to an adjusted difference of 10 points between the two groups; how much of a gift does 10 extra points represent?’ By using the nominal score rather than the true score, this has the effect of inflating the score. Once you’ve established how much the inflation might be, it’s natural to wonder about how much real-world consequence it might have leading into the Harvard musings.
Depends on the mean and standard deviations of the 2 distributions, and then you could estimate how often the male sample average will be higher than the female sample average and vice versa.
The question should be ‘if we retest these 1200-1400 scorers, what will happen?’ The scores will probably drop as they regress to their mean due to an imperfect test. That’s the point.
Ahhh, that makes the statistics click in my brain, thanks :)
Do you know if there is much data out there on real-world gender differences vis-a-vis regression to the mean on IQ / SAT / etc. tests? i.e. is this based on statistics, or is it born out in empirical observations?
I haven’t seen any, offhand. Maybe the testing company provides info about retests, but then you’re going to have different issues: anyone who takes the second test may be doing so because they had a bad day (giving you regression to a mean from the other direction) and may’ve boned up on test prep since, and there’s the additional issue of test-retest effect—now that they know what the test is like, they will be less anxious and will know what to do, and test-takers in general may score better. (Since I’m looking at that right now, my DNB meta-analysis offers a case in point: in many of the experiments, the controls have slightly higher post-test IQ scores. Just the test-retest effect.)
First off, I have to say, just asking this sets off a serious, serious troll alert.
So, we have 5 players, and 50 utilions to divide between them. Players all value utilions equally, and utilions have linear value (i.e. 5 utilions is five times better than 1). Fairness says we give each player 10 utilions. Let’s make our unfair distribution 8, 8, 10, 12, 12.
How to express this mathematically? You could have a factor in your utility equation that is based on deviation from the mean (least-square immediately strikes me as elegant), or one which values the absolute difference between best and worst, or which averages against the lowest value.
For the first technique, the distribution 8,8,10,12,12, has 2^2 = 4 x 4 = −16 utility compared to ideal.
For the second technique, you lose −4 utility (12-8)
For the third technique, the utility for each player is 8, 8 (10+8/2 = 9), (12+8/2 = 10), (12+8/2 = 10), for a total penalty of −5 against ideal.
And that’s all assuming that fairness is a terminal value, not something that generates utility. That’s all assuming we’re playing with Platonic Utilions with linear value, rather than money (which seems to fall in value the more you get).
I mean this sincerely: if you’re not a troll, I am genuinely and deeply confused how you could possibly think this is the slightest bit incompatible with VNM utilitarianism.
Ok, let’s apply these functions to a different scenario:
There are two people A and B, A has utility 5 and B has utility 10. We have no way of increasing their utilities but we can make thinks worse for them. Your term suggests we should lower B’s utility as a deadweight loss to make things more fair. This seems wrong.
Technique C already handles this: 10+5/2 = 7.5. 5+5/2 = 5. So clearly going from 10->5 is bad, but having both of them be at 7.5 would be better, and having both of them at 10 would be even better still.
For technique B, yes, you will get results that say power imbalances are unfair and should be destroyed. The simplest example I could give is a world where Hitler has a million soldiers and everyone else has 100,000 combined. That power imbalance is dangerous, because Hitler can leverage that advantage to gain an even larger advanage, and so, over time, that inequality gets worse, and it can even reduce net utility (after the war, Hitler has 950,000 soldiers and everyone else has 50,000 − 100K people died, and the world is more unfair!)
One of the big stumbling blocks for me with social justice was understanding that power imbalances can be bad in and of themselves. It’s not just soldiers, either. This happens rather vividly with money and many other resources (“spoons” seem to work this way, if you’re familiar with “spoon theory”)
Of course technique C doesn’t address the weasel example.
When did we switch from talking about utility to talking about power? I agree power imbalances are dangerous; however, this fact doesn’t seem to bear on the weasel example.
Have you considered using full thoughts… ooooh. What the hell is with all the trolls these days? :(
For the audience at home: That’s because out in “reality”, we can’t measure utilions, so we use things like power and money as proxies. In an ideal utopia with perfectly calibrated Utili-meters, this would not be as relevant.
I’m not sure how to read this. I’m leaning towards, “I don’t have a counter argument so I’m going to resort to insults.”
To get back to the point, the problem with technique C is that it doesn’t address the case of adjusting test scores based on demographic priors, since the lowest utility (the people not accepted) is the same either way.
You’re the one who just dropped the discussion to DH level 1 or 2.
You have a repeated pattern of not offering real responses: “Is this a parody?” “Is this?” being the biggest red flag I’ve encountered in this thread.
You are correct that I didn’t have a refutation, because “I don’t see how this ties in to the weasels” doesn’t give me enough information to try and resolve your confusion. In short, lately you seem to be putting near-zero effort in to your replies: you’re not attempting to explain your position, just offering pithy one-sentence objections that don’t seem to contribute anything.
Given you have 2K karma and a few +50 rated comments, I’m willing to assume you’ve just had a bad week and actually explain this, but I still see no point in actually continuing the conversation, since your replies are all “taxing” me the same way a troll does: you put in minimal effort, and force the other person to hold it all afloat.
It’s the very definition of skilled trolling, to force other people to spend paragraphs defending themselves while you resort to easily misinterpreted one-sentence replies that do nothing to advance actual discourse.
The idea that I must maintain quality discourse, or even that it’s more productive, is a trap that ends up with a bunch of well-fed trolls.
It’s as real a response as the question it’s a response to and I give a substantive response to Nisan’s more substantive sentence.
You could give some indication of what addition information would help. Here are some possibilities:
1) You didn’t get what the weasels were referring to. Arguably I should have linked to this comment in the great-grandparent, but since the comment in question is yours, I assumed you’d get the reference.
2) You think the technique does in fact address the weasel example, in that case you could have said so as well as possibly how you think it applies.
3) Something I haven’t thought of.
People care about fairness, and get negative utility from feeling like they are being treated unfairly.
So let’s apply Eliezer’s “murder pill” thought experiment to this:
If I offered people a pill to make not care about being treated unfairly would they take it?
If the answer is no, that means they care about fairness beyond the bad feeling it generates.
I’d have to think about it but if I didn’t think it would involve being severely taken advantage of to the point where it impacts what I want to do I’d probably take it.
here
Thanks, but I’m pretty sure that isn’t it. The one I remember had an allegory and originated at LW.
How about this?
Thanks. That’s at least a plausible candidate—not an exact match for what I remember, but awfully close. How did you find it?
like this
Cool. I’d been wondering about how to search for links.
Commenting to state a disagreement with a LW narrative (you’re okay with the emotional tone / lack thereof) on a LW narratives thread will chip away at anonymity. If enough LW women were to do that, then people may figure out who wrote which narratives by process of elimination. I acknowledge that it would be way infeasible for all of us to memorize all the narratives and never say something that disagrees, and that’s not what I’m suggesting. I’m saying that adding a comment on the LW narratives thread itself that’s in clear disagreement with one of the narratives is poor anonymity strategy.
Could you give some examples? I’m having trouble thinking of any.
The general idea that women not being attracted to men who are attracted to them is just some arbitrary wrongness in the universe that any sensible man should try to get the women to ignore.
Fixing the man (as opposed to confusing the woman) seems like a good intervention, if it’s possible to a sufficient extent. The difficulty is that behavior and appearance are important aspects of a person, so fixing someone might involve fixing their behavior and appearance, which will be superficially similar to changing their behavior and appearance with the goal of confusion/deception. This apparently inescapable superficial similarity opens benevolent self-improvement in this area to the charge of deception, and it looks like it’s often hard for both sides to avoid mixing up the categories.
o.O
Seriously? I mean, everyone wants to be more attractive, but … that’s a very, well, psychopath-y way of looking at it.
I think I’ve somehow managed not to run into this, do you have any links?
Well, if they were attracted to the men attracted to them this would increase total utility. One of the less pleasant implications of utilitarianism.
On the other hand, it’s interesting that people are willing to swallow pushing people in front of trolleys, but not swallow the above. Probably related to this.
This is only an implication of utilitarianism to the extent that forcibly wireheading everyone is an implication of utilitarianism. However, given some of your other remarks about unpleasant truths conflicting with social conformity, I doubt if you intended your comment as an argument against utilitarianism, but rather as an argument for PUA. Am I reading the tea-leaves correctly here?
Well, one can deal with wireheading by declaring that wireheads don’t count towards utility and/or have negative utility. That approach doesn’t work in this case since we don’t want to assign negative utility to the state of two people being attracted to each other.
Why can’t I do both? After all, the correct Bayesian response to discovering that two ideas seem to contradict is decrease one’s confidence in both.
One can deal with any counterexample by declaring that it “doesn’t count”. That does not make it not count. Wireheads, by definition, experience huge utility. That is what the word means, in discussions of utilitarianism.
We might very well want to assign negative utility to the process whereby that happened, for the same reasons as for forcible wireheading.
That is just a way of not saying what you do. Do, you, in fact, do both, and how much of each?
The correct rational response is to resolve the contradiction, not to ignore it and utter platitudes about the truth lying between extremes. Dressing the latter up in rationalist jargon does not change that.
That’s my point, you need to assign utility to processes rather than just outcomes.
I am in fact doing both, in this case mostly against utilitarianism.
There is a difference between assuming the truth lies between two extremes, and assigning significant probability (say ~50%) to each of the two extremes. I’m trying to do the latter.
This thing allows you to see all contributions by a given user on the same page, so you can Ctrl-F through them. (OTOH, it is quite slow, at least on my system.)
Thank you. I found the thread about the video, but I’m not sure I replied to the discussion of discounting excellent results from people who aren’t expected to produce them. On the other hand, there doesn’t seem to be a problem with not finding it since there’s a consensus that it’s the sort of thing which would be plausible to find at LW.
Can you provide a link to this?
I don’t think I’ve seen that on LW, but I also haven’t looked for it.
The version of the argument I’m familiar with boils down to ‘regression to the mean.’ Because tests provide imperfect estimates of the true ability, our final posterior is a combination of the prior (i.e. population ability distribution) and the new evidence.
Suppose someone scores 600 on a test whose mean is 500, and the test scores and underlying ability are normally distributed. Our prior belief that someone’s true ability is 590 is higher than our prior belief that their true ability is 600, which is higher than our prior belief that their true ability is 610, because the normal distribution is decreasing as you move away from the mean. If the test was off by 10, then it’s more likely to overestimate than underestimate. That is, our posterior is that it’s more likely that their real ability is 590 than 610. (Assuming it’s as easy to be positively lucky as negatively lucky, which is questionable.)
The same happens in the reverse direction: abnormally low scores are more likely to underestimate than overestimate the true ability (again, assuming it’s equally easy for luck to push up and down). Depending on the precision of the test, the end effect is probably small, but the size of the effect increases the more extreme the results are.
On math scores in particular, both the male mean and the male standard deviation are higher than the female mean and female standard deviation. The difference in standard deviations is discussed much less than the difference in means, but it turns out to be very important when calculating this effect. Thus, the chance that a female got an 800 on the Math SAT due to luck is higher than the chance that a male got an 800 on the Math SAT due to luck. Of course, the true ability necessary to get an 800 by luck is rather high, but could still be below some meaningful cutoff, and like Nancy points out, getting more evidence should make the posterior better reflect the true ability.
So the better a woman does, the less you believe she can actually do it. At what point do you update your prior about what women can do?
This is reminding me of How to Suppress Women’s Writing.
Not quite. (Saving assumptions for the end of the comment.) If a female got a 499 on the Math SAT, then my estimate of her real score is centered on 499. If she scores a 532, then my estimate is centered on 530; a 600, 593; an 800, 780. A 20 point penalty is bigger than a 7 point penalty, but 780 is bigger than 593, so if by “it” you mean “math” that’s not the right way to look at it, but if by “it” you mean “that particular score” then yes.
Note that this should also be done to male scores, with the appropriate means and standard deviations. (The std difference was smaller than I remembered it being, so the mean effect will probably dominate.) Males scoring 499, 532, 600, and 800 would be estimated as actually getting 501, 532, 596, and 784. So at the 800 level, the relative penalty for being female would only be 4 points, not the 20 it first appears to be.
Note that I’m pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the standard measurement error is 30, and I’m multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumptions are good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can’t tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that’s too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
Another way to think about this is that a 2.25 sigma male mathematician will score 800, but a 2.66 sigma female mathematician is necessary to score 800, and >2.25 sigmas are 12 out of a thousand, whereas >2.66 sigmas are 4 out of a thousand.
This isn’t necessary if the prior comes from data that includes the individual in question, and is practically unnecessary in cases where the individual doesn’t appreciably change the distribution. Enough females take the SAT that one more female scorer won’t move the mean or std enough to be noticeable at the precision that they report it.
In the writing example, where we’re dealing with a long tail, then it’s not clear how to deal with the sampling issues. You’d probably make an estimate for the current individual under consideration just using historical data as your prior, and then incorporate them in the historical data for the next individual under consideration, but you might include them before doing the estimation. I’m sure there’s a statistician who’s thought about this much longer and more rigorously than I have.
Thanks for the details.
Can you see how this sort of thing, applied through a whole educational career, would tend to discourage learning and accomplishment?
Even if it’s true (at least until transhumanism really gets going) that the best mathematicians will always be men, it’s not as though second rank mathematicians are useless.
Yes. In general, I recommend that people try to do the best they can with themselves, and not feel guilty about relative performance unless that guilt is motivating for them. If gatekeepers want to use this sort of effect in their reasoning, they should make it quantitative, rather than a verbal justification for a bias.
It’s not clear how desirable accurate expectations of future success are. To use startups as an example, 10% of startups succeed, but founders seem to put their chance of success at over 90%, and this may be better than more realistic expectations and less startups. For clever women, though, there seems to be a significant amount of pressure to go into STEM fields followed by high rates of burnout and transfer away from STEM work. What rate of burnout would be strong evidence for overencouragement? I’m not sure.
Having to deal with biased gatekeepers isn’t the same thing as feeling guilty about relative ability, even if some of the same internal strategies would help with both.
How likely is this?
Agreed; that phrase was more appropriate in an earlier draft of the comment, and became less appropriate when I deleted other parts which mused about how much people should expect themselves to regress towards the population mean. They have a lot of private information about themselves, but it’s not clear to me that they have good information about the rest of the population, and so it seems easier to judge one’s absolute than one’s relative competence.
On topic to dealing with biased gatekeepers, it seems self-defeating to use the presence of obstacles as a discouraging rather than encouraging factor, conditioned on the opportunity being worth pursuing. Since the probability of success is an input to the calculation of whether or not an opportunity is worth pursuing, it’s not clear when and how much accuracy in expectations is desirable.
I don’t know enough about the population of gatekeepers to comment on the likelihood of finding it in the field, but I am confident in it as a prescription.
Burnout might be related to factors other than not being able to do the work well enough. It could be a matter of hostile work environment.
From what I’ve read, women are apt to do more housework and childcare than their spouses, so there might be a matter of total work hours—or that one might be balanced out by men taking jobs with longer commutes.
I find it interesting that you site evidence that is exactly what traditionalist theories of gender would predict, and not even mention them as a possible explanation.
I’m less and less surprised to see interesting comments like this at 0 karma.
I took your “apt” at first to mean “more able to”!
As this sort of think becomes more common, it will be necessary to take into account the fact that others are also doing this when making these calculations.
And once transhumanism gets going it will be the case that the best mathematicians will be the people who received intelligence upgrade “Euler” as children. My point is that if you’re hoping for transhumanism because it will solve problems with inequality of ability, you should be careful what you wish for.
I just threw in the bit about transhumanism for completeness.
Needing to get the implants in childhood is probably an early phase—I’m expecting that more and better plasticity for adults will also get developed.
Well, unconstrained self-modification can have even more unpleasant results.
It seems to me that, given people are already sexist, and given that telling someone their group has a lower average directly lowers their performance, such a re-weighting should never ever be used.
I’m not sure you’re using the right numbers for the variability. The material I’m finding online indicates that ’30 points with 67% confidence’ is not the meaningful number, but simply the r correlation between 2 administrations of the SAT: the percent of regression is 100*(1-r).
The 2011 SAT test-retest reliabilities are all around 0.9 (the math section is 0.91-0.93), so that’s 10%.
Using your female math mean of 499, a female score of 800 would be regressed to 800 - ((800 − 499) 0.1) = 769.9. Using your male math mean of 532, then a male score of 800 would regress down to 800 - ((800 − 532) 0.1) = 773.2.
Hmm. You’re right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I’ll edit the grandparent to use the correct terms.
I’m not sure I agree with your method because it seems odd to me that the standard deviation doesn’t impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that’d probably be the simpler way to calculate it.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn’t come into it because we’ve already singled out a specific datapoint; we’re not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I’ve run through the math, I agree with your method. Supposing the measurement error is independent of score (which can’t be true because of the bounds, and in general probably isn’t true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn’t take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don’t think that using a linear correction based on the reliability coefficient would get that right, but I haven’t worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.
Yes; “extraordinary claims require extraordinary evidence, but ordinary claims require only ordinary evidence.” If a random person tells me that they are a Rhodes Scholar and a certified genius, I will be more skeptical than if they told me they merely went to Harvard, and more skeptical of that than if they told me they went to community college. And at some level of ‘better’ I will stop believing them entirely.
To go back to the multilevel model framework: a single high data point/group will be pulled back down to the mean of the population data points/group (how much will depend on the quality of the test), while the combined mean will slightly increase.
However, this increase may be extremely small, as makes sense. If you know from the official SAT statistics that 3 million women took the SAT last year and scored an average of 1200 (or whatever a medium score looks like these days, they keep changing the test), then that’s an extremely informative number which will be hard to change since you already know of how millions of women have done in the past: so whatever you learn from a single random woman scoring 800 this year will be diluted like 1 in 3 million...
Nifty: I’ve found an explanation of Stein’s paradox, and it turns out to be basically shrinkage!
Ahh… “Expect regression to the mean ”.
The funny thing is this kind of discrimination can lead to (or appear to lead to)the average elite woman being MORE qualified than the average man at a similar level.
Only if you over do it.
What are the odds?
Also, do you apply a downwards adjustment to your evaluation of a woman’s original mathematics?
As randomness* would have it, I just ran into an example of women doing that to a woman for her fiction.
*On the radio as I was catching up on the thread.
Just read the article. Given the information presented my prior is that Jamaica Kincaid got her job due to (possibly informal) affirmative action, i.e., the New Yorker felt like they needed a black female writer to be “diverse”.
You don’t know how many black female authors they’ve got, and you haven’t read any of her work.
True. This is my prior for “black female author gets extremely fast tracked” and the article didn’t say anything that would make me update away from it.
Depends on what other evidence I have.
It occurs to me that from Vaniver’s explanation one could also derive the sentence “So the better a man does, the less you believe he can actually do it.” As far as I can tell, the processes of drawing either of the two conclusions are isomorphic. For that matter, the same reasoning would also lead to the derivation “So the worse a woman does, the more you believe she is actually better.” (With an analogous statement for men. This is explicitly pointed out in the explanation.)
The difference between the men and the women is point where we switch from “better/less” to “worse/more”, and the magnitude of the effect as we get further away from that point. (That is, the mean and the standard deviation.)
I can’t figure out a way of saying this without making me sound bad even to myself, but it seems… I don’t know, annoying at least, that you picked a logical conclusion that aplies exactly the same to both genders, but apply that to women, don’t mention at all what appears to be the only factual assertion of an actual difference between the abilities of women and men (and which I haven’t seen actually contested in neither this nor the earlier discussion on the subject), did not in fact criticise Vaniver’s explanation—which, by the way, as far as I can tell from his post, is just an explanation for beo’s benefit, I can’t deduce from its text that he’s actually endorsing using the procedure—and at the same time you manage to make both him and me, even before I participate, seem that we should be ashamed of ourselves, by sort of implying that he’ll also do something else not mentioned by him, and not logically implied by the explanation, and that would have a bad consequence if done very badly. (Well, it feels that way to me, I can’t tell if Vaniver took umbrage nor if I’m actually reading correctly the society around me with respect to which the shame relates.)
I’m not sure if I have a point, exactly, I’m sort of just sharing my feelings in case it generates some insight. I don’t think you did this as an intentional dishonesty. It’s weird, it looks like there’s a blind spot exactly in the direction you’re looking at (after all, this is exactly the topic of the discussion).
But then again I also feel like I have such a blind spot, like it’s impolite that I should have noticed this, or even that I’m a bad person for not agreeing with your conotation and I can’t tell why. (And I’m some sort of misoginistic pig because I can’t see it.)
I seem to have that reaction quite often around this kind of discussion. I usually get sort of angry, go away, and dismiss the particular person that caused the reaction, but (I like to think) that’s only because I have low priors on people in general, which doesn’t apply here, and it seems worse somehow.
As far as I can tell I actually like men much less than women (in the “being around them” sense), it feels as if I’m very inclined to equality, but somehow this kind of feminism seem very annoying. (I’m not exactly sure what I mean when I say “this kind of feminism”. The kind that argues for better women rights in some islamic countries isn’t annoying, except in the sense that it gets me angry at humanity, but that again that’s kind of expected in my society, so it doesn’t say much.)
Shouldn’t it be possible to estimate the magnitude of this effect by comparing score distributions on tests with differently sized question pools, or write-in versus multiple choice, or which are otherwise more or less susceptible to luck?
You’d need a model of how much luck depends on those factors. Test-retest variability gives a good measure of how much one person’s scores vary from test to test; apparently for the SAT the test-retest standard deviation is about 30 points. (We can’t quite apply this number, since it might not be independent of score, but it’s better than nothing.)
That’s part of the whole “getting more information” thing.
I think.
The regression to the mean adjustment can be seen as a limited form of hierarchical/multilevel models with a fixed population mean, so any one score gets shrunk toward the population mean.
(I was reading about them because apparently the pooling eliminates multiple comparison problems, and Gelman is a big fan of them.)
Sorry, but no. I was hoping that someone with a better memory and/or better search skills would be able to find the links.
I attempted a few searches with things like “test results, luck, lucky, group, prior, blues, gatekeeper, good day, second test”, etc.
Found nothing that fits what you were describing, unfortunately. Perhaps a few less-common terms from the discussion if you remember any, or even better any sentence / specific formulation used there, might help when combined together.