These sorts of people are not very interested in actually developing substantive theory or testing their claims in strong ways which might disprove them.
Instead they are mainly interested in providing a counternarrative to progressive theories.
They often use superficial or invalid psychometric methods.
They often make insinuations that they have some deep theory or deep studies, but really actually don’t.
These things are bad, but, apart from point 2, I would ask: how do they compare to the average quality of social science research? Do you have high standards, or do you just have high standards for one group? I think most of us spend at least some time in environments where the incentive gradients point towards the latter. Beware isolated demands for rigor.
Research quality being what it is, I would recommend against giving absolute trust to anyone, even if they appear to have earned it. If there’s a result you really care about, it’s good to pick at least one study and dig into exactly what they did, and to see if there are other replications; and the prior probability of “fraud” probably shouldn’t go below 1%.
As for point 2—if you were a researcher with heretical opinions, determined to publish research on at least some of them, what would you do? It seems like a reasonable strategy is to pick something heretical that you’re confident you can defend, and do a rock-solid study on it, and brace for impact. Is it still the case that disproving the blank-slate hypothesis would constitute progress in some academic subfields? If so, then expect people to continue trying it.
The study says there was “a meta-analysis concluding that small monetary incentives could improve test scores by 0.64 SDs” (roughly 10 IQ points); looks to be Duckworth et all 2011. The guy says it seemed sketchy—the studies had small N, weird conditions, and/or fraudulent researchers. Looking at table S1 from Duckworth, indeed, N is <100 on most of the studies; “Bruening and Zella (1978)” sticks out as having a large effect size and a large N, and, when I google for more info about that, I find that Bruening was convicted by an NIMH panel of scientific fraud. Checks out so far.
The guy ran a series of studies, the last of which offered incentives of nil, £2, and £5-£10 for test performance, with the smallest subgroup being N=150, taken from the adult population via “prolific academic”. He found that £2 and £5-£10 had similar effects, those being apparently 0.2 SD and 0.15 SD respectively, which would be 3 IQ points or a little less. (Were the “small monetary incentives” from Duckworth of that size? The Duckworth table shows most of the studies as being in the $1-$9 or <$1 range; looks like yes.) So, at least as a “We suspected these results were bogus, tried to reproduce them, and got a much smaller effect size”, this seems all in order.
Now, you say:
IQ test effort correlates with IQ scores, and they investigate whether it is causal using incentives. However, as far as I can tell, their data analysis is flawed, and when performed correctly the conclusion reverses.
[...] Incentives increase effort, but they only have marginal effects on performance. Does this show that effort doesn’t matter? No, because incentives also turn out to only have marginal effects on effort! Surely if you only improve effort a bit, you wouldn’t expect to have much influence on scores. We can solve this by a technique called instrumental variables. Basically, we divide the effect of incentives on scores by the effect of incentives on effort.
Your analysis essentially proposes that, if there were some method of increasing effort by 3-4x as much as he managed to increase it, then maybe you could in fact increase IQ scores by 10 points. This assumes that the effort-to-performance causation would stay constant as you step outside the tested range. That’s possible, but… I’m quite confident there’s a limit to how much “effort” can increase your results on a timed multiple-choice test, that you’ll hit diminishing marginal returns at some point (probably even negative marginal returns, if the incentive is strong enough to make many test-takers nervous), and extrapolating 3-4x outside the achieved effect seems dubious. (I also note that the 1x effect here means increasing your self-evaluated effort from 4.13 to 4.28 on a scale that goes up to 5, so a 4x effect would mean going to 4.73, approaching the limits of the scale itself.)
You say, doing your analysis:
For study 2, I get an effect of 0.54. For study 3, I get an effect of 0.37. For study 4, I get an effect of 0.39. The numbers are noisy for various reasons, but this all seems to be of a similar order of magnitude to the correlation in the general population, so this suggests the correlation between IQ and test effort is due to a causal effect of test effort increasing IQ scores.
That is interesting… Though the correlation between test effort and test performance in the studies is given as 0.27 and 0.29 in different samples, so, noise notwithstanding, your effects are consistently larger by a decent margin. That would suggest that there’s something else going on than the simple causation.
The authors say:
6.1. Correlation and direction of causality
Across all three samples and cognitive ability tests (sentence verification, vocabulary, visual-spatial reasoning), the magnitude of the association between effort and test performance was approximately 0.30, suggesting that higher levels of motivation are associated better levels of test performance. Our results are in close accord with existing literature [...]
As is well-known, the observation of a correlation is a necessary but not sufficient condition for causality. The failure to observe concomitant increases in test effort and test performance, when test effort is manipulated, suggests the absence of a causal effect between test motivation and test performance.
That last sentence is odd, since there was in fact an increase in both test effort and test performance. Perhaps they’re equivocating between “low effect” and “no effect”? (Which is partly defensible in that the effect was not statistically significant in most of the studies they ran. I’d still count it as a mark against them.) The authors continue:
Consequently, the positive linear assocation between effort and performance may be considered either spurious or the direction of causation reversed – flowing from ability to motivation. Several investigations have shown that the correlation between test-taking anxiety and test performance likely flows from ability to test-anxiety, not the other way around (Sommer & Arendasy, 2015; Sommer, Arendasy, Punter, Feldhammer-Kahr, & Rieder, 2019). Thus, if the direction of causation flows from ability to test motivation, it would help explain why effort is so difficult to shift via incentive manipulation.
6.2. Limitations & future research
We acknowledge that the evidence for the causal direction between effort and ability remains equivocal, as our evidence rests upon the absence of evidence (absence of experimental incentive effect). Ideally, positive evidence would be provided. Indirect positive evidence may be obtained by conducting an experiment, whereby half the subjects are given a relatively easy version of the paper folding task (10 easiest items) and the other half are given a relatively more difficult version (10 most difficult items). It is hypothesized that those given the relatively easier version of the paper folding task would then, on average, self-report greater levels of test-taking effort. Partial support for such a hypothesis is apparent in Table 1 of this investigation. Specifically, it can be seen that there is a perfect correspondence between the difficulty of the test (synonyms mean 73.4% correct; sentence verification mean 53.8% correct; paper folding mean 43.3%) and the mean level of reported effort (synonyms mean effort 4.42; sentence verification mean 4.11; paper folding mean 3.83).
That is a pretty interesting piece of evidence for the “ability leads to self-reported effort” theory.
Overall… The study seems to be a good one: doing a large replication study on prior claims. The presentation of it… The author on Twitter said “testing over N= 4,000 people”, which is maybe what you get if you add up the N from all the different studies, but each study is considerably smaller; I found that somewhat misleading, but suspect that’s a common thing when authors report multiple studies at once. On Twitter he says “We conclude that effort has unequivocally small effects”, which omits caveats like “our results are accurate to the degree that alternative incentives do not yield appreciably larger effects” which are in the paper; this also seems like par for the course for science journalism (not to mention Twitter discourse). And they seem to have equivocated in places between “low effect” and “no effect”. (Which I suspect is also not rare, unfortunately.)
Now. You presented this as:
Here’s a classical example; an IQ researcher who is so focused on providing a counternarrative to motivational theories that he uses methods which are heavily downwards biased to “prove” that IQ test scores don’t depend on effort.
The “focused on providing a counternarrative” part is plausibly correct. However, the “uses methods which are heavily downwards biased to “prove” [...]” is not. The “downwards biased methods” are “offering a monetary incentive of £2-£10, which turned out to be insufficient to change effort much”. The authors were doing a replication of Duckworth, in which most of the cited studies had a monetary incentive of <$10—so that part is correctly matched—and they used high enough N that Duckworth’s claimed effect size should have shown up easily. They also preregistered the first of their incentive-based studies (with the £2 incentive), and the later ones were the same but with increased sample size, then increased incentive. In other words, they did exactly what they should have done in a replication. To claim that they chose downwards-biased methods for the purpose of proving their point seems quite unfair; those methods were chosen by Duckworth.
This seems to be a data point of the form “your priors led you to assume bad faith (without having looked deeply enough to discover this was unjustified), which then led you to take this as a case to justify those priors for future cases”. (We will see more of these later.) Clearly this could be a self-reinforcing loop that, over time, could lead one’s priors very far astray. I would hope anyone who posts here would recognize the danger of such a trap.
Second example. “Simon Baron-Cohen playing Motte-Bailey with the “extreme male brain” theory of autism.” Let’s see… It seems uncontroversial (among the participants in this discussion) that there are dimensions on which male and female brains differ (on average), and on which autists are (on average) skewed towards the male side, and that this includes the empathizing and systematizing dimensions.
You quote Baron-Cohen as saying “According to the ‘extreme male brain’ theory of autism, people with autism or AS should always fall in the [extreme systematizing range]”, and say that this is obviously false, since there exist autists who are not extreme systematizers—citing a later study coauthored by Baron-Cohen himself, which puts only ~10% of autists into the “Extreme Type S” category. You say he’s engaging in a motte-and-bailey.
After some reading, this looks to me like a case of “All models are wrong, but some are useful.” The same study says “Finally, we demonstrate that D-scores (difference between EQ and SQ) account for 19 times more of the variance in autistic traits (43%) than do other demographic variables including sex. Our results provide robust evidence in support of both the E-S and EMB theories.” So, clearly he’s aware that 57% of the variance is not explained by empathizing-systematizing. I think it would be reasonable to cast him as saying “We know this theory is not exactly correct, but it makes some correct predictions.” Indeed, he counts the predictions made by these theories:
An extension of the E-S theory is the Extreme Male Brain (EMB) theory (11). This proposes that, with regard to empathy and systemizing, autistic individuals are on average shifted toward a more “masculine” brain type (difficulties in empathy and at least average aptitude in systemizing) (11). This may explain why between two to three times more males than females are diagnosed as autistic (12, 13). The EMB makes four further predictions: (vii) that more autistic than typical people will have an Extreme Type S brain; (viii) that autistic traits are better predicted by D-score than by sex; (ix) that males on average will have a higher number of autistic traits than will females; and (x) that those working in science, technology, engineering, and math (STEM) will have a higher number of autistic traits than those working in non-STEM occupations.
Note also that he states the definition of EMB theory as saying “autistic individuals are on average shifted toward a more “masculine” brain type”. You say “Sometimes EMB proponents say that this isn’t really what the EMB theory says. Instead, they make up some weaker predictions, that the theory merely asserts differences “on average”.” This is Baron-Cohen himself defining it that way.
Would it be better if he used a word other than “theory”? “Model”? You somewhat facetiously propose “If the EMB theory had instead been named the “sometimes autistic people are kinda nerdy” theory, then it would be a lot more justified by the evidence”. How about, say, the theory that “There are processes that masculinize the brain in males; and some of those processes going into overdrive is a thing that causes autism”? (Which was part of the original paper: “What causes this shift remains unclear, but candidate factors include both genetic differences and prenatal testosterone.”) That is, in fact, approximately what I found when I googled for people talking about the EMB theory—and note that the article is critical of the theory:
This hypothesis, called the ‘extreme male brain’ theory, postulates that males are at higher risk for autism as a result of in-utero exposure to steroid hormones called androgens. This exposure, the theory goes, accentuates the male-like tendency to recognize patterns in the world (systemizing behavior) and diminishes the female-like capacity to perceive social cues (socializing behavior). Put simply, boys are already part way along the spectrum, and if they are exposed to excessive androgens in the womb, these hormones can push them into the diagnostic range.
That is the sense in which an autistic brain is, hypothetically, an “extreme male brain”. I guess “extremely masculinized brain” would be a bit more descriptive to someone who doesn’t know the context.
The problem with a motte-and-bailey is that someone gets to go around advancing an extreme position, and then, when challenged by someone who would disprove it, he avoids the consequences by claiming he never said that, he only meant the mundane position. According to you, the bailey is “they want to talk big about how empathizing-systematizing is the explanation for autism”. According to the paper, it was 43% of the explanation for autism, and the biggest individual factor? Seems pretty good.
Has Baron-Cohen gone around convincing people that empathizing-systematizing is the only factor involved in autism? I suspect that he doesn’t believe it, he didn’t mean to claim it, almost no one (except you) understood him as claiming it, and pretty much no one believes it. Maybe he picked a suboptimal name, which lent itself to misinterpretation. Do you have examples of Baron-Cohen making claims of that kind, which aren’t explainable as him taking the “This theory is not exactly correct, but it makes useful predictions” approach?
The context here is explaining why you’ve “become horrified at what [you] once trusted”, which you now call “supposed science”. I’m… underwhelmed by what I’ve seen.
Back to Damore...
I think Damore’s point, in bringing it up, was that the stress in (some portion of) tech jobs may be a reason there are fewer women than men in tech.
You may or may not be right that this is what he meant.
...I thought it was overkill to cite four quotes on that issue, but apparently not. Such priors!
(I think it’s a completely wrong position, because the sex difference in neuroticism is much smaller (by something like 2x) than the sex difference in tech interests and tech abilities, and presumably the selection effect for neuroticism on career field is also much smaller than that of interests. So I’m not sure your reading on it is particularly more charitable, only uncharitable in a different direction; assuming a mistake rather than a conflict.)
It seems you’re saying Damore mentions A but not B, and B is bigger, therefore Damore’s “comprehensive” writeup is not so, and this omission is possibly ill-motivated. But, erm, Damore does mention B, twice:
[Women, on average have more] Openness directed towards feelings and aesthetics rather than ideas. Women generally also have a stronger interest in people rather than things, relative to men (also interpreted as empathizing vs. systemizing). ○ These two differences in part explain why women relatively prefer jobs in social or artistic areas. More men may like coding because it requires systemizing and even within SWEs, comparatively more women work on front end, which deals with both people and aesthetics.
[...]
Women on average show a higher interest in people and men in things ○ We can make software engineering more people-oriented with pair programming and more collaboration. Unfortunately, there may be limits to how people-oriented certain roles at Google can be and we shouldn’t deceive ourselves or students into thinking otherwise (some of our programs to get female students into coding might be doing this).
This suggests that casting aspersions on Damore’s motives is not gated by “Maybe I should double-check what he said to see if this is unfair”.
I think the anxiety/stress thing is more relevant for top executive roles than for engineer roles; a population-level difference is more important at the extremes. Damore does talk about leadership specifically:
We always ask why we don’t see women in top leadership positions, but we never ask why we see so many men in these jobs. These positions often require long, stressful hours that may not be worth it if you want a balanced and fulfilling life.
Next:
(Incidentally, imagine if Damore had claimed the opposite—”Women are less prone to anxiety and can handle stress more easily.” Wouldn’t that also lead to accusations that Damore was saying we can ignore women’s problems?)
The correct thing to claim is “We should investigate what people are anxious/stressed about”. Jumping to conclusions that people’s states are simply a reflection of their innate traits is the problem.
Well, he lists one source of stress above, and he does recommend to “Make tech and leadership less stressful”.
I don’t think this is at the heart of Zack’s adventure? Zack’s issues were mainly about leading rationalists jumping in to rationalize things in the name of avoiding conflicts.
And why would these rationalists care so much about avoiding these conflicts, to the point of compromising the intellectual integrity that seems so dear to them? Fear that they’d face the kind of hostility and career-ruining accusations directed at Damore, and things downstream of fears like that, seems like a top candidate explanation.
Anyway, making weighty claims about people is core to what differential psychology is about.
Um. Accusations are things you make about individuals, occasionally organizations. I hope that the majority of differential psychology papers don’t consist of “Bob Jones has done XYZ bad thing”.
It’s possible that some of my claims about Damore are false, in which case we should discuss that and fix the mistakes. However, the position that one should just keep quiet about claims about people simply because they are weighty would also seem to imply that we should keep quiet about claims about trans people and masculinity/femininity, or race and IQ, or, to make the Damore letter more relevant, men/women and various traits related to performance in tech.
You are equivocating between reckless claims of misconduct / malice by an individual, and heavily cited claims about population-level averages that are meant to inform company policy. Are you seriously stating an ethical principle that anyone who makes the latter should expect to face the former and it’s justified?
Somewhat possible this is true. I think nerdy communities like LessWrong should do a better job at communicating the problems with various differential psychology findings and communicating how they are often made by conservatives to promote an agenda. If they did this, perhaps Damore would not have been in this situation.
I think Damore was aware that there are people who use population-level differences to justify discriminating against individuals, and that’s why he took pains to disavow that. As for “the problems with various differential psychology findings”—do you think that some substantial fraction, say at least 20%, of the findings he cited were false?
Second example. “Simon Baron-Cohen playing Motte-Bailey with the “extreme male brain” theory of autism.” Let’s see… It seems uncontroversial (among the participants in this discussion) that there are dimensions on which male and female brains differ (on average), and on which autists are (on average) skewed towards the male side, and that this includes the empathizing and systematizing dimensions.
Quick update!
I found that OpenPsychometrics has a dataset for the EQ/SQ tests. Unfortunately, there seems to be a problem for the data with the EQ items, but I just ran a factor analysis for the SQ items to take a closer look at your claims here.
There appeared to be 3 or 4 factors underlying the correlations on the SQ test, which I’d roughly call “Technical interests”, “Nature interests”, “Social difficulties” and “Jockyness”. I grabbed the top loading items for each of the factors, and got this correlation matrix:
The correlations between the technical interests and nature interests plausibly reflects the notion that Systematizing is a thing, though I suspect that it could also be found to correlate with all sorts of other things that would not be considered Systematizing? Like non-Systematizing ways of interacting with nature. Idk though.
The sex differences in the items was limited to the technical interests, rather than than also covering the nature interests. This does not fit a simple model of a sex difference in general Systematizing, but it does fit a model where the items are biased towards men but there is not much sex difference in general Systematizing.
I would be inclined to think that the Social difficulties items correlate negatively with Empathizing Quotient or positively with Autism Spectrum Quotient. If we are interested in the correlations between general Systematizing and these other factors, then this could bias the comparisons. On the other hand, the Social difficulties items were not very strongly correlated with the overall SQ score, so maybe not.
I can’t immediately think of any comments for the Jockyness items.
Overall, I strongly respect the fact that he made many of the items very concrete, but I now also feel like I have proven that the gender differences on Systematizing to be driven by psychometric shenanigans, and I strongly expect to find that many of the other associations are also driven by psychometric shenanigans.
I’ve sent an email asking OpenPsychometrics to export the Empathizing Quotient items too. If he does so, I hope to write a top-level post explaining my issues with the psychometrics here.
Hm, actually I semi-retract this; the OpenPsychometrics data seems to be based on the original Systematizing Quotient, whereas there seems to be a newer one called Systematizing Quotient-Revised, which is supposedly more gender-neutral. Not sure where I can get data on this, though. Will go looking.
Edit: Like I am still pretty suspicious about the SQ-R. I just don’t have explicit proof that it is flawed.
Oops, upon reading more about the SQ, I should correct myself:
Some of the items, such as S16, are “filler items” which are not counted as part of the score; these are disproportionately part of the “Social difficulties” and “Jockyness” factors, so that probably reduces the amount of bias that can be introduced by those items, and it also also explains why they don’t correlate very much with the overall SQ scores.
But some of the items for these factors, such as S31, are not filler items, and instead get counted for the test, presumably because they have cross-loadings on the Systematizing factor. So the induced bias is probably not zero.
If I get the data from OpenPsychometrics, I will investigate in more detail.
Since I don’t have data on the EQ, here’s a study where someone else worked with it. They found that the EQ had three factors, which they named “Cognitive Empathy”, “Emotional Empathy” and “Social Skills”. The male-female difference was driven by “Emotional Empathy” (d=1), whereas the autistic-allistic difference was driven by “Social Skills” (d=1.3). The converse differences were much smaller, 0.24 and 0.66. As such, it seems likely that the EQ lumps together two different kinds of “empathizing”, one of which is feminine and one of which is allistic.
As for point 2—if you were a researcher with heretical opinions, determined to publish research on at least some of them, what would you do? It seems like a reasonable strategy is to pick something heretical that you’re confident you can defend, and do a rock-solid study on it, and brace for impact. Is it still the case that disproving the blank-slate hypothesis would constitute progress in some academic subfields? If so, then expect people to continue trying it.
I should also say, in the context of IQ and effort, some of the true dispute is about whether effort differences can explain race differences in scores. And for that purpose, what I would do is to go more directly into that.
In fact, I have done so. Quoting some discussion I had on Discord:
Me: Oh look at this thing I just saw
(correlation matrix with 0 correlation between race and test effort highlighted)
Other person: That is a really good find. Where’s it from?
Me: from the supplementary info to one of the infamous test motivation studies:
Me: Despite implying that test motivation explains racial gaps in the study text:
On the other hand, test motivation may be a serious confound instudies including participants who are below-average in IQ and wholack external incentives to perform at their maximal potential.Consider, for example, the National Longitudinal Survey of Youth(NLSY), a nationally representative sample of more than 12,000adolescents who completed an intelligence test called the ArmedForces Qualifying Test (AFQT). As is typical in social science re-search, NLSY participants were not rewarded in any way for higherscores. The NLSY data were analyzed inThe Bell Curve,inwhichHerrnstein and Murray (44) summarily dismissed test motivation asa potential confound in their analysis of black–white IQ disparities.
(This was way after I became critical of differential psychology btw. Around 2 months ago.)
These things are bad, but, apart from point 2, I would ask: how do they compare to the average quality of social science research? Do you have high standards, or do you just have high standards for one group? I think most of us spend at least some time in environments where the incentive gradients point towards the latter. Beware isolated demands for rigor.
I don’t know for sure as I am only familiar with certain subsets of social science, but a lot of it is in fact bad. I also often criticize normal social science, but in this context it was this specific area of social science that came up.
As for point 2—if you were a researcher with heretical opinions, determined to publish research on at least some of them, what would you do? It seems like a reasonable strategy is to pick something heretical that you’re confident you can defend, and do a rock-solid study on it, and brace for impact. Is it still the case that disproving the blank-slate hypothesis would constitute progress in some academic subfields? If so, then expect people to continue trying it.
I would try to perform studies that yield much more detailed information. For instance, mixed qualitative and quantitative studies where one qualitatively inspects the data points that are above-average or below-average for the regressions, to see whether there are identifiable missing factors.
So, at least as a “We suspected these results were bogus, tried to reproduce them, and got a much smaller effect size”, this seems all in order.
If he had phrased his results purely as disproving the importance of incentives, rather than effort, I think it would have been fine.
Your analysis essentially proposes that, if there were some method of increasing effort by 3-4x as much as he managed to increase it, then maybe you could in fact increase IQ scores by 10 points. This assumes that the effort-to-performance causation would stay constant as you step outside the tested range. That’s possible, but… I’m quite confident there’s a limit to how much “effort” can increase your results on a timed multiple-choice test, that you’ll hit diminishing marginal returns at some point (probably even negative marginal returns, if the incentive is strong enough to make many test-takers nervous), and extrapolating 3-4x outside the achieved effect seems dubious. (I also note that the 1x effect here means increasing your self-evaluated effort from 4.13 to 4.28 on a scale that goes up to 5, so a 4x effect would mean going to 4.73, approaching the limits of the scale itself.)
I prefer to think of it as “if you increase your effort from being one of the lowest-effort people to being one of the highest-effort people, you can increase your IQ score by 17 IQ points”. This doesn’t seem too implausible to me, though admittedly I’m not 100% sure what the lowest-effort people are doing.
It’s valid to say that extrapolating outside of the tested range is dubious, but IMO this means that the study design is bad.
I think it’s likely that the limited returns to effort would be reflected in the limited bounds of the scale. So I don’t think my position is in tension with the intuition that there’s limits on what effort can do for you. Under this model, it is also worth noting that the effort scores were negatively skewed, so this implies that lack of effort is a bigger cause of low scores than extraordinary effort is of high scores.
That is interesting… Though the correlation between test effort and test performance in the studies is given as 0.27 and 0.29 in different samples, so, noise notwithstanding, your effects are consistently larger by a decent margin. That would suggest that there’s something else going on than the simple causation.
I don’t think my results are statistically significantly different from 0.3ish; in the ensuing discussion, people pointed out that the IV results had huge error bounds (because the original study was only barely significant).
But also if there is measurement error in the instrument (effort), then that would induce an upwards bias in the IV estimated effect. So that might also contribute.
However, the “uses methods which are heavily downwards biased to “prove” [...]” is not. The “downwards biased methods” are “offering a monetary incentive of £2-£10, which turned out to be insufficient to change effort much”. The authors were doing a replication of Duckworth, in which most of the cited studies had a monetary incentive of <$10—so that part is correctly matched—and they used high enough N that Duckworth’s claimed effect size should have shown up easily. They also preregistered the first of their incentive-based studies (with the £2 incentive), and the later ones were the same but with increased sample size, then increased incentive. In other words, they did exactly what they should have done in a replication. To claim that they chose downwards-biased methods for the purpose of proving their point seems quite unfair; those methods were chosen by Duckworth.
Shitty replications of shitty environmentalist research is still shitty.
Like this sort of thing makes sense to do as a personal dispute between the researchers, but for all of us who’d hope to actually use or build on the research for substantial purposes, it’s no good if the researchers use shitty methods because they are trying to build a counternarrative against other researchers using shitty methods.
Let’s see… It seems uncontroversial (among the participants in this discussion) that there are dimensions on which male and female brains differ (on average), and on which autists are (on average) skewed towards the male side, and that this includes the empathizing and systematizing dimensions.
I wouldn’t confidently disagree with this, but I do have some philosophical nitpicks/uncertainties.
(“Brain” connotes neurology to me, yet I am not sure if empathizing and especially systematizing are meaningful variables on a neurological level. I would also need to double-check whether EQ/SQ are MI for sex and autism because I don’t remember whether they are. I suspect in particular the EQ is not, and it is the biggest drive of the EQ/SQ-autism connection, so it is pretty important to consider. But for the purposes of the Motte-Bailey situation, we can ignore that. Just tagging it as a potential area of disagreement.)
Would it be better if he used a word other than “theory”? “Model”? You somewhat facetiously propose “If the EMB theory had instead been named the “sometimes autistic people are kinda nerdy” theory, then it would be a lot more justified by the evidence”. How about, say, the theory that “There are processes that masculinize the brain in males; and some of those processes going into overdrive is a thing that causes autism”? (Which was part of the original paper: “What causes this
shift remains unclear, but candidate factors include both genetic differences and prenatal testosterone.”)
I think what would be better would be if he clarified his models and reasoning. (Not positions, as that opens up the whole Motte-Bailey thing and also is kind of hard to engage with.) What is up with the original claim about autists always being extreme type S? Was this just a mistake that he would like to retract? If he only considers it to be a contributor that leads to half the variance, does he have any opinion on the nature of the other contributors to autism? Does he have any position on the relationship between autistic traits as measured by the AQ, and autism diagnosis? What should we make of the genetic contributors to autism being basically unrelated to the EQ/SQ? (And if the EQ/SQ are not MI for sex/autism, what does he make of that?)
Do you have examples of Baron-Cohen making claims of that kind, which aren’t explainable as him taking the “This theory is not exactly correct, but it makes useful predictions” approach?
This is part of the trouble, these areas do not have proper discussions.
It seems you’re saying Damore mentions A but not B, and B is bigger, therefore Damore’s “comprehensive” writeup is not so, and this omission is possibly ill-motivated.
...
This suggests that casting aspersions on Damore’s motives is not gated by “Maybe I should double-check what he said to see if this is unfair”.
No, I meant that under your interpretation, Damore mentions A when A is of negligible effect, and so that indicates a mistake. I didn’t mean to imply that he didn’t mention B, and I read this part of his memo multiple times prior to sending my original comment, so I was fully aware that he mentioned B.
Well, he lists one source of stress above, and he does recommend to “Make tech and leadership less stressful”.
But again the “Make tech and leadership less stressful” point boiled down to medicalizing it.
And why would these rationalists care so much about avoiding these conflicts, to the point of compromising the intellectual integrity that seems so dear to them? Fear that they’d face the kind of hostility and career-ruining accusations directed at Damore, and things downstream of fears like that, seems like a top candidate explanation.
Valid point.
Um. Accusations are things you make about individuals, occasionally organizations. I hope that the majority of differential psychology papers don’t consist of “Bob Jones has done XYZ bad thing”.
Differential psychology papers tend to propose ways to measure traits that they consider important, to extend preciously created measures with new claims of importance, and to rank demographics by importance.
You are equivocating between reckless claims of misconduct / malice by an individual, and heavily cited claims about population-level averages that are meant to inform company policy. Are you seriously stating an ethical principle that anyone who makes the latter should expect to face the former and it’s justified?
I think in an ideal world, the research and the discourse would be more rational. For people who are willing to discuss and think about these matters rationally, it seems inappropriate to accuse them of misconduct/malice simply for agreeing with them. However if people have spent a long time trying to bring up rational discussion and failed, then it is reasonable for these people to assume misconduct/malice.
I think Damore was aware that there are people who use population-level differences to justify discriminating against individuals, and that’s why he took pains to disavow that.
Using population-level differences to justify discriminating against individuals can be fine and is not what I have been objecting to.
As for “the problems with various differential psychology findings”—do you think that some substantial fraction, say at least 20%, of the findings he cited were false?
I don’t know. My problem with this sort of research typically isn’t that it is wrong (though it sometimes may be) but instead that it is of limited informative value.
I should probably do a top-level review post where I dig through all his cites to look at which parts of his memo are unjustified and which parts are wrong. I’ll tag you if I do that.
I’ll address this first:
These things are bad, but, apart from point 2, I would ask: how do they compare to the average quality of social science research? Do you have high standards, or do you just have high standards for one group? I think most of us spend at least some time in environments where the incentive gradients point towards the latter. Beware isolated demands for rigor.
Research quality being what it is, I would recommend against giving absolute trust to anyone, even if they appear to have earned it. If there’s a result you really care about, it’s good to pick at least one study and dig into exactly what they did, and to see if there are other replications; and the prior probability of “fraud” probably shouldn’t go below 1%.
As for point 2—if you were a researcher with heretical opinions, determined to publish research on at least some of them, what would you do? It seems like a reasonable strategy is to pick something heretical that you’re confident you can defend, and do a rock-solid study on it, and brace for impact. Is it still the case that disproving the blank-slate hypothesis would constitute progress in some academic subfields? If so, then expect people to continue trying it.
Now, digging into the examples:
The study says there was “a meta-analysis concluding that small monetary incentives could improve test scores by 0.64 SDs” (roughly 10 IQ points); looks to be Duckworth et all 2011. The guy says it seemed sketchy—the studies had small N, weird conditions, and/or fraudulent researchers. Looking at table S1 from Duckworth, indeed, N is <100 on most of the studies; “Bruening and Zella (1978)” sticks out as having a large effect size and a large N, and, when I google for more info about that, I find that Bruening was convicted by an NIMH panel of scientific fraud. Checks out so far.
The guy ran a series of studies, the last of which offered incentives of nil, £2, and £5-£10 for test performance, with the smallest subgroup being N=150, taken from the adult population via “prolific academic”. He found that £2 and £5-£10 had similar effects, those being apparently 0.2 SD and 0.15 SD respectively, which would be 3 IQ points or a little less. (Were the “small monetary incentives” from Duckworth of that size? The Duckworth table shows most of the studies as being in the $1-$9 or <$1 range; looks like yes.) So, at least as a “We suspected these results were bogus, tried to reproduce them, and got a much smaller effect size”, this seems all in order.
Now, you say:
Your analysis essentially proposes that, if there were some method of increasing effort by 3-4x as much as he managed to increase it, then maybe you could in fact increase IQ scores by 10 points. This assumes that the effort-to-performance causation would stay constant as you step outside the tested range. That’s possible, but… I’m quite confident there’s a limit to how much “effort” can increase your results on a timed multiple-choice test, that you’ll hit diminishing marginal returns at some point (probably even negative marginal returns, if the incentive is strong enough to make many test-takers nervous), and extrapolating 3-4x outside the achieved effect seems dubious. (I also note that the 1x effect here means increasing your self-evaluated effort from 4.13 to 4.28 on a scale that goes up to 5, so a 4x effect would mean going to 4.73, approaching the limits of the scale itself.)
You say, doing your analysis:
That is interesting… Though the correlation between test effort and test performance in the studies is given as 0.27 and 0.29 in different samples, so, noise notwithstanding, your effects are consistently larger by a decent margin. That would suggest that there’s something else going on than the simple causation.
The authors say:
That last sentence is odd, since there was in fact an increase in both test effort and test performance. Perhaps they’re equivocating between “low effect” and “no effect”? (Which is partly defensible in that the effect was not statistically significant in most of the studies they ran. I’d still count it as a mark against them.) The authors continue:
That is a pretty interesting piece of evidence for the “ability leads to self-reported effort” theory.
Overall… The study seems to be a good one: doing a large replication study on prior claims. The presentation of it… The author on Twitter said “testing over N= 4,000 people”, which is maybe what you get if you add up the N from all the different studies, but each study is considerably smaller; I found that somewhat misleading, but suspect that’s a common thing when authors report multiple studies at once. On Twitter he says “We conclude that effort has unequivocally small effects”, which omits caveats like “our results are accurate to the degree that alternative incentives do not yield appreciably larger effects” which are in the paper; this also seems like par for the course for science journalism (not to mention Twitter discourse). And they seem to have equivocated in places between “low effect” and “no effect”. (Which I suspect is also not rare, unfortunately.)
Now. You presented this as:
The “focused on providing a counternarrative” part is plausibly correct. However, the “uses methods which are heavily downwards biased to “prove” [...]” is not. The “downwards biased methods” are “offering a monetary incentive of £2-£10, which turned out to be insufficient to change effort much”. The authors were doing a replication of Duckworth, in which most of the cited studies had a monetary incentive of <$10—so that part is correctly matched—and they used high enough N that Duckworth’s claimed effect size should have shown up easily. They also preregistered the first of their incentive-based studies (with the £2 incentive), and the later ones were the same but with increased sample size, then increased incentive. In other words, they did exactly what they should have done in a replication. To claim that they chose downwards-biased methods for the purpose of proving their point seems quite unfair; those methods were chosen by Duckworth.
This seems to be a data point of the form “your priors led you to assume bad faith (without having looked deeply enough to discover this was unjustified), which then led you to take this as a case to justify those priors for future cases”. (We will see more of these later.) Clearly this could be a self-reinforcing loop that, over time, could lead one’s priors very far astray. I would hope anyone who posts here would recognize the danger of such a trap.
Second example. “Simon Baron-Cohen playing Motte-Bailey with the “extreme male brain” theory of autism.” Let’s see… It seems uncontroversial (among the participants in this discussion) that there are dimensions on which male and female brains differ (on average), and on which autists are (on average) skewed towards the male side, and that this includes the empathizing and systematizing dimensions.
You quote Baron-Cohen as saying “According to the ‘extreme male brain’ theory of autism, people with autism or AS should always fall in the [extreme systematizing range]”, and say that this is obviously false, since there exist autists who are not extreme systematizers—citing a later study coauthored by Baron-Cohen himself, which puts only ~10% of autists into the “Extreme Type S” category. You say he’s engaging in a motte-and-bailey.
After some reading, this looks to me like a case of “All models are wrong, but some are useful.” The same study says “Finally, we demonstrate that D-scores (difference between EQ and SQ) account for 19 times more of the variance in autistic traits (43%) than do other demographic variables including sex. Our results provide robust evidence in support of both the E-S and EMB theories.” So, clearly he’s aware that 57% of the variance is not explained by empathizing-systematizing. I think it would be reasonable to cast him as saying “We know this theory is not exactly correct, but it makes some correct predictions.” Indeed, he counts the predictions made by these theories:
Note also that he states the definition of EMB theory as saying “autistic individuals are on average shifted toward a more “masculine” brain type”. You say “Sometimes EMB proponents say that this isn’t really what the EMB theory says. Instead, they make up some weaker predictions, that the theory merely asserts differences “on average”.” This is Baron-Cohen himself defining it that way.
Would it be better if he used a word other than “theory”? “Model”? You somewhat facetiously propose “If the EMB theory had instead been named the “sometimes autistic people are kinda nerdy” theory, then it would be a lot more justified by the evidence”. How about, say, the theory that “There are processes that masculinize the brain in males; and some of those processes going into overdrive is a thing that causes autism”? (Which was part of the original paper: “What causes this shift remains unclear, but candidate factors include both genetic differences and prenatal testosterone.”) That is, in fact, approximately what I found when I googled for people talking about the EMB theory—and note that the article is critical of the theory:
That is the sense in which an autistic brain is, hypothetically, an “extreme male brain”. I guess “extremely masculinized brain” would be a bit more descriptive to someone who doesn’t know the context.
The problem with a motte-and-bailey is that someone gets to go around advancing an extreme position, and then, when challenged by someone who would disprove it, he avoids the consequences by claiming he never said that, he only meant the mundane position. According to you, the bailey is “they want to talk big about how empathizing-systematizing is the explanation for autism”. According to the paper, it was 43% of the explanation for autism, and the biggest individual factor? Seems pretty good.
Has Baron-Cohen gone around convincing people that empathizing-systematizing is the only factor involved in autism? I suspect that he doesn’t believe it, he didn’t mean to claim it, almost no one (except you) understood him as claiming it, and pretty much no one believes it. Maybe he picked a suboptimal name, which lent itself to misinterpretation. Do you have examples of Baron-Cohen making claims of that kind, which aren’t explainable as him taking the “This theory is not exactly correct, but it makes useful predictions” approach?
The context here is explaining why you’ve “become horrified at what [you] once trusted”, which you now call “supposed science”. I’m… underwhelmed by what I’ve seen.
Back to Damore...
...I thought it was overkill to cite four quotes on that issue, but apparently not. Such priors!
It seems you’re saying Damore mentions A but not B, and B is bigger, therefore Damore’s “comprehensive” writeup is not so, and this omission is possibly ill-motivated. But, erm, Damore does mention B, twice:
This suggests that casting aspersions on Damore’s motives is not gated by “Maybe I should double-check what he said to see if this is unfair”.I think the anxiety/stress thing is more relevant for top executive roles than for engineer roles; a population-level difference is more important at the extremes. Damore does talk about leadership specifically:
Next:
Well, he lists one source of stress above, and he does recommend to “Make tech and leadership less stressful”.
And why would these rationalists care so much about avoiding these conflicts, to the point of compromising the intellectual integrity that seems so dear to them? Fear that they’d face the kind of hostility and career-ruining accusations directed at Damore, and things downstream of fears like that, seems like a top candidate explanation.
Um. Accusations are things you make about individuals, occasionally organizations. I hope that the majority of differential psychology papers don’t consist of “Bob Jones has done XYZ bad thing”.
You are equivocating between reckless claims of misconduct / malice by an individual, and heavily cited claims about population-level averages that are meant to inform company policy. Are you seriously stating an ethical principle that anyone who makes the latter should expect to face the former and it’s justified?
I think Damore was aware that there are people who use population-level differences to justify discriminating against individuals, and that’s why he took pains to disavow that. As for “the problems with various differential psychology findings”—do you think that some substantial fraction, say at least 20%, of the findings he cited were false?
Quick update!
I found that OpenPsychometrics has a dataset for the EQ/SQ tests. Unfortunately, there seems to be a problem for the data with the EQ items, but I just ran a factor analysis for the SQ items to take a closer look at your claims here.
There appeared to be 3 or 4 factors underlying the correlations on the SQ test, which I’d roughly call “Technical interests”, “Nature interests”, “Social difficulties” and “Jockyness”. I grabbed the top loading items for each of the factors, and got this correlation matrix:
The correlations between the technical interests and nature interests plausibly reflects the notion that Systematizing is a thing, though I suspect that it could also be found to correlate with all sorts of other things that would not be considered Systematizing? Like non-Systematizing ways of interacting with nature. Idk though.
The sex differences in the items was limited to the technical interests, rather than than also covering the nature interests. This does not fit a simple model of a sex difference in general Systematizing, but it does fit a model where the items are biased towards men but there is not much sex difference in general Systematizing.
I would be inclined to think that the Social difficulties items correlate negatively with Empathizing Quotient or positively with Autism Spectrum Quotient. If we are interested in the correlations between general Systematizing and these other factors, then this could bias the comparisons. On the other hand, the Social difficulties items were not very strongly correlated with the overall SQ score, so maybe not.
I can’t immediately think of any comments for the Jockyness items.
Overall, I strongly respect the fact that he made many of the items very concrete, but I now also feel like I have proven that the gender differences on Systematizing to be driven by psychometric shenanigans, and I strongly expect to find that many of the other associations are also driven by psychometric shenanigans.
I’ve sent an email asking OpenPsychometrics to export the Empathizing Quotient items too. If he does so, I hope to write a top-level post explaining my issues with the psychometrics here.
Hm, actually I semi-retract this; the OpenPsychometrics data seems to be based on the original Systematizing Quotient, whereas there seems to be a newer one called Systematizing Quotient-Revised, which is supposedly more gender-neutral. Not sure where I can get data on this, though. Will go looking.
Edit: Like I am still pretty suspicious about the SQ-R. I just don’t have explicit proof that it is flawed.
Am I gonna have to collect the data myself? I might have to collect the data myself...
Oops, upon reading more about the SQ, I should correct myself:
Some of the items, such as S16, are “filler items” which are not counted as part of the score; these are disproportionately part of the “Social difficulties” and “Jockyness” factors, so that probably reduces the amount of bias that can be introduced by those items, and it also also explains why they don’t correlate very much with the overall SQ scores.
But some of the items for these factors, such as S31, are not filler items, and instead get counted for the test, presumably because they have cross-loadings on the Systematizing factor. So the induced bias is probably not zero.
If I get the data from OpenPsychometrics, I will investigate in more detail.
Since I don’t have data on the EQ, here’s a study where someone else worked with it. They found that the EQ had three factors, which they named “Cognitive Empathy”, “Emotional Empathy” and “Social Skills”. The male-female difference was driven by “Emotional Empathy” (d=1), whereas the autistic-allistic difference was driven by “Social Skills” (d=1.3). The converse differences were much smaller, 0.24 and 0.66. As such, it seems likely that the EQ lumps together two different kinds of “empathizing”, one of which is feminine and one of which is allistic.
I should also say, in the context of IQ and effort, some of the true dispute is about whether effort differences can explain race differences in scores. And for that purpose, what I would do is to go more directly into that.
In fact, I have done so. Quoting some discussion I had on Discord:
(This was way after I became critical of differential psychology btw. Around 2 months ago.)
I don’t know for sure as I am only familiar with certain subsets of social science, but a lot of it is in fact bad. I also often criticize normal social science, but in this context it was this specific area of social science that came up.
I would try to perform studies that yield much more detailed information. For instance, mixed qualitative and quantitative studies where one qualitatively inspects the data points that are above-average or below-average for the regressions, to see whether there are identifiable missing factors.
If he had phrased his results purely as disproving the importance of incentives, rather than effort, I think it would have been fine.
I prefer to think of it as “if you increase your effort from being one of the lowest-effort people to being one of the highest-effort people, you can increase your IQ score by 17 IQ points”. This doesn’t seem too implausible to me, though admittedly I’m not 100% sure what the lowest-effort people are doing.
It’s valid to say that extrapolating outside of the tested range is dubious, but IMO this means that the study design is bad.
I think it’s likely that the limited returns to effort would be reflected in the limited bounds of the scale. So I don’t think my position is in tension with the intuition that there’s limits on what effort can do for you. Under this model, it is also worth noting that the effort scores were negatively skewed, so this implies that lack of effort is a bigger cause of low scores than extraordinary effort is of high scores.
I don’t think my results are statistically significantly different from 0.3ish; in the ensuing discussion, people pointed out that the IV results had huge error bounds (because the original study was only barely significant).
But also if there is measurement error in the instrument (effort), then that would induce an upwards bias in the IV estimated effect. So that might also contribute.
Shitty replications of shitty environmentalist research is still shitty.
Like this sort of thing makes sense to do as a personal dispute between the researchers, but for all of us who’d hope to actually use or build on the research for substantial purposes, it’s no good if the researchers use shitty methods because they are trying to build a counternarrative against other researchers using shitty methods.
I wouldn’t confidently disagree with this, but I do have some philosophical nitpicks/uncertainties.
(“Brain” connotes neurology to me, yet I am not sure if empathizing and especially systematizing are meaningful variables on a neurological level. I would also need to double-check whether EQ/SQ are MI for sex and autism because I don’t remember whether they are. I suspect in particular the EQ is not, and it is the biggest drive of the EQ/SQ-autism connection, so it is pretty important to consider. But for the purposes of the Motte-Bailey situation, we can ignore that. Just tagging it as a potential area of disagreement.)
I think what would be better would be if he clarified his models and reasoning. (Not positions, as that opens up the whole Motte-Bailey thing and also is kind of hard to engage with.) What is up with the original claim about autists always being extreme type S? Was this just a mistake that he would like to retract? If he only considers it to be a contributor that leads to half the variance, does he have any opinion on the nature of the other contributors to autism? Does he have any position on the relationship between autistic traits as measured by the AQ, and autism diagnosis? What should we make of the genetic contributors to autism being basically unrelated to the EQ/SQ? (And if the EQ/SQ are not MI for sex/autism, what does he make of that?)
This is part of the trouble, these areas do not have proper discussions.
No, I meant that under your interpretation, Damore mentions A when A is of negligible effect, and so that indicates a mistake. I didn’t mean to imply that he didn’t mention B, and I read this part of his memo multiple times prior to sending my original comment, so I was fully aware that he mentioned B.
But again the “Make tech and leadership less stressful” point boiled down to medicalizing it.
Valid point.
Differential psychology papers tend to propose ways to measure traits that they consider important, to extend preciously created measures with new claims of importance, and to rank demographics by importance.
I think in an ideal world, the research and the discourse would be more rational. For people who are willing to discuss and think about these matters rationally, it seems inappropriate to accuse them of misconduct/malice simply for agreeing with them. However if people have spent a long time trying to bring up rational discussion and failed, then it is reasonable for these people to assume misconduct/malice.
Using population-level differences to justify discriminating against individuals can be fine and is not what I have been objecting to.
I don’t know. My problem with this sort of research typically isn’t that it is wrong (though it sometimes may be) but instead that it is of limited informative value.
I should probably do a top-level review post where I dig through all his cites to look at which parts of his memo are unjustified and which parts are wrong. I’ll tag you if I do that.