The Jussim et al review of that literature is worth reading. Expectations do seem to have causal impact, but the effect is usually small relative to measures of past performance and ability, and teacher expectations tend to reflect past performance more.
The review covers some serious challenges to the effect sizes claimed by Rosenthal and coauthors, such as effect sizes declining with sample size and publication bias. Or, regarding the original Pygmalion/Oak School experiment:
Snow (1995) also pointed out that the intelligence test used in Pygmalion was only normed for scores between 60 and 160. If one excluded all scores outside this range, the expectancy effect disappeared. Moreover, there were five “bloomers” with wild IQ score gains: 17-110, 18-122, 133-202, 111-208, and 113-211. If one simply excluded these five bizarre gains, the difference between the bloomers and the controls evaporated.
As an aside, Rosenthal pioneered meta-analysis in psychology because the effect only replicated a third of the time in the published literature (despite the presence of publication bias and QRPs). In doing so he promulgated a test for publication bias which implicitly assumed the absence of any publication bias, and so almost always output the conclusion that no publication bias was present. These methods were eagerly adopted by the parapsychology community, as the same methodology that appeared to show strong expectancy effects also appeared to show ESP in the ganzfeld psychic experiment, as Rosenthal (1986) agreed.
Since I think that the ESP literature reflects the scale of apparent effect that can be shown in the absence of a real effect, purely through publication bias, experimenter bias, optional stopping, and other questionable research practices, this makes me suspicious of the stronger claims about expectation effects.
I don’t think the sample of experiments reviewed is large enough to evaluate sample size versus effect size; throw out the outliers and there’s nothing left.
I’m now heavily concerned about the validity of the IQ test used; however, that’s more due to the 8 point increase in the control group, when no increase is expected. I’ll have to dig further, exclude any of the controls with out-of-band scores and redo the math.
One result of the meta-analysis, however, is that experimentally-induced changes to teacher expectation have a small casual effect on student performance; another result is that non-induced teacher expectations correlate well with performance in the same year, and less well with long term performance. I would rephrase that as ‘Teacher expectations of student performance in their class tend to be accurate, but correlate poorly with student performance in other classes.’
In any case, thanks for the link. I’m going to have to spend some time determining how much I should change my mind with this new evidence, but my gut feeling is that the objectively worst possible data (my own experience with performing well when expected to perform well, and performing poorly when expected to perform poorly), will continue to dominate my personal opinion on the matter.
I’m going to have to spend some time determining how much I should change my mind with this new evidence, but my gut feeling is that the objectively worst possible data (my own experience with performing well when expected to perform well, and performing poorly when expected to perform poorly), will continue to dominate my personal opinion on the matter.
I don’t think the sample of experiments reviewed is large enough to evaluate sample size versus effect size; throw out the outliers and there’s nothing left.
The first Rosenthal meta-analysis used 345 studies. That is pretty big. And the individual studies listed in table 17.1 have large n, ranging from 79 to 5000+.
I’m now heavily concerned about the validity of the IQ test used; however, that’s more due to the 8 point increase in the control group, when no increase is expected.
No, that’s not a problem that should concern you. Children IQ scores are less stable than older people’s scores, test-retest effects will give you a number of IQ points (that’s why one uses controls), and children are constantly growing.
What should concern you is that the researchers involved were willing to pass on and champion a result driven solely by obviously impossible nonsensical meaningless data. A kid going from 18 IQ to 122? or 113 to 211? This can’t even be explained by incompetence in failing to exclude scores from kids refusing to cooperate, because tests in general (much less the specific test they used!) are never normed from 18 to 211. (How do you get a sample big enough to norm as high as 7.4 standard deviations?)
Worrying about the control’s gains and not the actual data is like reading a physics paper reporting that they measured the speed of several neutrinos at 50 hogsheads per milifortnight, and saying ’Hm, yes, but are they sure they properly corrected for GPS clock skew and did accurately record the flight time of their control photons?”
Unstable IQ scores should provide a net zero; an average increase of half a standard deviation across the entire population already means that the norms are fucked.
Therefore, the IQ test used simply wasn’t properly normed; if we assume that it was equally improperly normed for all students in the study, we still see an increase of 4 points based on teachers being told to expect more. Whether an increase of 4 points is statistically significant on that (improperly normed) test is a new question.
Unstable IQ scores should provide a net zero; an average increase of half a standard deviation across the entire population already means that the norms are fucked.
Only if you make the very strong assumptions that there is no systematic bias or selection effect or regression to the mean or anything which might cause the unstability to favor an increase.
Plus you ignored my other points.
Plus we already know from the pairs of before-afters that these researchers are either incredibly incompetent or actively dishonest.
Plus we already know biases in analysis or design or data collection can be introduced much more subtly. Gould’s brainpacking problems is only the latest example.
Therefore, the IQ test used simply wasn’t properly normed; if we assume that it was equally improperly normed for all students in the study,
Which claim and assumption we will make because we are terminally optimistic, and to borrow from the ’90s, “I want to believe!”
we still see an increase of 4 points based on teachers being told to expect more. Whether an increase of 4 points is statistically significant on that (improperly normed) test is a new question.
Wow, you still aren’t giving up on the Pygmalion study? Just let it go already. You don’t even have to give up on your wish for self-fulfilling expectations—there are plenty of followup studies which turned in your desired significant effects.
Only if you make the very strong assumptions that there is no systematic bias or selection effect or regression to the mean or anything which might cause the unstability to favor an increase.
What effects could cause an increase of 8 points on a properly normed test across the board? Why would there a significant benefit to being in the control group of this study?
Plus we already know from the pairs of before-afters that these researchers are either incredibly incompetent or actively dishonest.
You can rule out that they were using a test which produced the scores that they recorded, perhaps by using raw score rather than normed output. You can rule out every other explanation for why the recorded results aren’t valid scores. You can even rule out that they were competently dishonest, since competent dishonesty would be nontrivial to detect; your only possible conclusion is incompetence, which isn’t evidence which should change your priors.
Incompetence is the social equivalent of the null hypothesis, and there is very rarely any significant evidence against it.
Therefore, the IQ test used simply wasn’t properly normed; if we assume that it was equally improperly normed for all students in the study,
Which claim and assumption we will make because we are terminally optimistic, and to borrow from the ’90s, “I want to believe!”
Assuming only incompetence as you have, the expected result would be equally erratic for all students. You can assign any likelihood to the assumption that the incompetence was the primary factor and that dishonesty doesn’t modify it significantly, but you have already concluded systemic incompetent dishonesty across a large number of studies.
Wow, you still aren’t giving up on the Pygmalion study? Just let it go already. You don’t even have to give up on your wish for self-fulfilling expectations—there are plenty of followup studies which turned in your desired significant effects.
As you say, it’s been confirmed by other studies. I’m not insisting that a particular study was done correctly, I’m explaining why their conclusions being true is consistent with the errors in their study. (Which means that a study with those flaws would be expected to reach the same conclusions, if those conclusions were true)
What effects could cause an increase of 8 points on a properly normed test across the board? Why would there a significant benefit to being in the control group of this study?
I already gave you three separate explanations for why an increase is possible, even in controls.
your only possible conclusion is incompetence, which isn’t evidence which should change your priors. Incompetence is the social equivalent of the null hypothesis, and there is very rarely any significant evidence against it.
I have no idea what you mean by this, and I think that if one accepts their incompetence, the best thing to do is to ignore their data as having been poisoned in unknown ways—maliciousness, ideology, and stupidity often being difficult to tell apart.
Assuming only incompetence as you have, the expected result would be equally erratic for all students.
Why is that? The competent result is, since IQ interventions almost universally fail (our prior for any result like ‘we increased IQ by 8 points’ ought to be very low, as in, well below 1%, because hundreds of interventions have failed to pan out and 8 points is astounding and practically on the level of iodization) and the followups confirm that there is only a much much smaller effect, that there is no or a small effect. Any incompetence is going to lead to an extreme result. Like what they found.
As you say, it’s been confirmed by other studies.
‘Confirmed’? Well, this is an active debate as to what counts as a replication. Near the same magnitude or just having the same sign? If someone publishes a study claiming to find a weight loss drug that will drop 100 pounds, and exhaustive replications find that the true estimate is actually 1 pound, has the original claim been “confirmed”? After all, both estimates are non-zero and both estimates have the same sign...
So, “systematic bias or selection effect or regression to the mean” can result in average properly normed IQ scores increasing by 8 points? Doesn’t the normalizing process (when done properly) force the average score to remain constant?
Doesn’t the normalizing process (when done properly) force the average score to remain constant?
What normalizing process? You mean the one the paid psychometricians go through years before any specific test is purchased by researchers like the ones doing the Pygmalion study? Yeah, I suppose so, but that’s irrelevant to the discussion.
Right- because the entire population going up half a SD in a year isn’t unusual at all, and the test purchased for use in this study was normalized the way one would expect it to be, despite the fact that it had results that are impossible if it was normalized in that manner.
Alright, I have to admit I have no idea what test you are now referring to. I thought we were discussing the Pygmalion results in which a small sample of elementary school students turned in increased IQ scores, which could be explained by a number of well-known and perfectly ordinary processes.
But it seems like you’re talking about something completely else and may be thinking of country-level Flynn effects or something, I have no idea what.
The PitC study showed an 8 point IQ increase in the control group. You offered those three explanations and said that they explained why that wasn’t particularly unusual, and my understanding of normed IQ tests is that they are expected to remain constant over short times.
If the control group isn’t at least representative, there is a different methodology flaw. If the confounding factor of prior IQ tests wasn’t measured, given that there is apparently a significant increase in scores on the first retest (and presumably a diminishing increase in scores at some point; the expected result of taking the test very many times isn’t to become the highest scorer ever), there is an unaccounted confounding factor.
I’m still trying to figure out what questions to ask before I dig up as much primary source as I can. Is “points of normed IQ” the right thing to measure? That would imply that going from an IQ of 140 to 152 is equally as much a gain as going from 94 to 106. Is raw score the right thing to measure? That would imply that going from being able to answer 75% of the questions accurately to 80% is equally as much gain as going from 25% to 30%. Is the percentage decrease in incorrect answers the correct metric? 75%-80% would be the same as 25%-40%. The percentage increase in correct answers? 25%-30% (20% increase) would be equivalent to 75%-90%.
I’m still reluctant to accept class grades and state-mandated graduation test scores as measuring primarily intelligence or even mastery of the material, rather than the specific skill of taking the test. That makes my error bars larger than those of someone who does accept them as accurate measurements of something important.
Is “points of normed IQ” the right thing to measure?
No, usually in these cases you will be using an effect size like Cohen’s d: expressing the difference in standard deviations (on the raw score) between the two groups. You can convert it back to IQ points if you want; if you discover a d of 1.0, that’s boosting scores by 1 standard deviation which is usually defined as something like 15 IQ points, and so on.
So if you have your standard paradigmatic experiment (an equal number of controls and experimentals, the two groups having exactly the same beginning mean IQ and standard deviation of the scores), you’d do your intervention, do a retest of IQ, and your effect size would be ‘IQ(bigger) - IQ(smaller) / standard deviation of experimentals & controls’. Some of the things that these approaches do:
test-retest concerns disappear, because you’re not looking at the difference between the first test and the second test within groups, but just the difference in the second test between groups. Did the practice effect give them all 1 point, 5 points, 10 points? Doesn’t matter, as long as it applies to both groups equally and their pre-tests were also equal. The first test is there to make sure you aren’t accidentally picking a group of geniuses and a group of dunces and that the two groups started off equivalent. (Fun fact: the single strongest effect in my n-back meta-analysis is from when a group on the pre-test answered like 4 questions more than any of the others; even though their score dropped on the post-test, because the assumption that the groups were equivalent is built into the meta-analysis, they still look like n-back had an effect size of like d=3 or something crazy like that.)
you’re not converting to IQ points, but using the raw score. This avoids the discreteness issue (suppose the test has 10 questions on it. What does it then mean to convert scores on it to its normed range of 70-130 IQ or whatever? getting even a single additional question right is worth 10 points!)
you avoid the issues of IQ points being ‘worth’ different amounts at different parts of the range. Suppose you took a bunch of IQ 130 kids and did something to boost their scores by 5 points. Is this easier, as hard, or harder than taking a bunch of IQ 100 kids and boosting them 5 points? If there’s any differences, we might expect to see them reflected in the standard deviation being larger or narrower, and so this’ll be reflected in our effect size.
Effect sizes are also the sine qua non of meta-analyses, so by thinking in effect sizes, you can more easily run a meta-analysis yourself if you want (like my own dual n-back meta-analysis on a widely-touted intervention which is supposed to increase IQ), you can interpret meta-analyses better, and you can draw on previous meta-analyses as priors (example: Jaeggi et al 2008 found n-back had an effect size on IQ of something like d=0.8. If one had seen one of the psychology-wide compilations of previous meta-analyses, one would know that replicated & verified effect sizes that large are pretty rare in every area of psychology, and so it was highly likely that their result was being overstated somehow, as indeed it turned out to have been overstated due to a use of passive control groups, and the current best estimate is closer to half that size or d=0.4).
I’m still reluctant to accept class grades and state-mandated graduation test scores as measuring primarily intelligence or even mastery of the material, rather than the specific skill of taking the test.
If IQ is the main cause of getting high class grades and passing cutoffs on tests and being able to learn test-taking skills (like learning any other skill), then couldn’t the tests be measuring all of them simultaneously?
… For some reason I thought the first test was used to evenly distribute performance on the pretest between the two groups. Aren’t the control and experimental groups supposed to be as close to identical as possible, and to help analysis identify which subgroups, if any, had effects different from other subgroups? If an intervention showed significantly different results for tall people than for short people, then a study of that intervention on people based on height may be indicated.
I’m still reluctant to accept class grades and state-mandated graduation test scores as measuring primarily intelligence or even mastery of the material, rather than the specific skill of taking the test.
If IQ is the main cause of getting high class grades and passing cutoffs on tests and being able to learn test-taking skills (like learning any other skill), then couldn’t the tests be measuring all of them simultaneously?
Aren’t the control and experimental groups supposed to be as close to identical as possible, and to help analysis identify which subgroups, if any, had effects different from other subgroups?
Ideally, yes, but if you shuffle people around, you’re not necessarily doing yourself any favors. (I think. This seems to be related to an old debate in experimental design going back to Gosset and Fisher over ‘balanced’ versus ‘randomized’ designs, which I don’t understand very well.)
If an intervention showed significantly different results for tall people than for short people, then a study of that intervention on people based on height may be indicated.
This is part of the randomized vs balanced design debate. Suppose tall people did better, but you just randomly allocated people; with a small sample like, say, 10 total and 5 in each, you would expect to wind up with different numbers of tall people in your control and experimentals (eg a 4-1 split of 5 tall people), and now that may be driving the difference. If you were using a large sample like 5000 people, then you’d expect the random allocation to be very even between the two groups of 2500.
If you specify in advance that tall people are a possibility, you can try to ‘balance’ the groups by additional steps: for example, you might randomize short people as usual, but block (randomize) pairs of tall people—if heads, the guy on the left is in the experimental and right in control, if tails, other way around—where by definition you get an even split of tall people (and maybe 1 guy left over). This is fine, sensible, and efficient use of your sample, and if you’re testing additional hypotheses like ‘tall people score better, even on top of the intervention’, you’ll take appropriate measures like increasing your sample size to reach your desired statistical power / alpha parameters. No problems there.
But any post hoc analysis can be abused. If after you run your study you decide to look at how tall people did, you may have an unbalanced split driving any result, you’re increasing how many hypotheses you’re testing, and so on. Post hoc analyses are untrustworthy and suspicious; here’s an example where a post hoc analysis was done: http://lesswrong.com/lw/68k/nback_news_jaeggi_2011_or_is_there_a/
I was saying that if there was any reason to suspect height might be a factor, then height should be added to the factors considered when trying to make the groups indistinguishable from each other. If height isn’t suspected to be a factor, adding height to those factors with a low weight does almost no harm to the rest of the distribution.
Is there any excuse for the measured variable to notably differ between the control and experimental groups in a well-executed experiment?
I was saying that if there was any reason to suspect height might be a factor, then height should be added to the factors considered when trying to make the groups indistinguishable from each other.
In a perfect world, perhaps. But every variable is more effort, and you need to do it from the start or else you might wind up screwing things up (imagine processing people one by one over a few weeks and starting their intervention, and half-way through, noticing that height is differing between the groups...?)
Is there any excuse for the measured variable to notably differ between the control and experimental groups in a well-executed experiment?
If you didn’t balance them, it may easily happen. And the more variables that describe each person, the more likely the groups will be unbalanced by some variable. People are complex like that. If you’re interested in the topic, I’ve already pointed you at the Wikipedia articles, but you could also check out Ziliak’s papers.
I see where gathering information about all participants before starting the intervention might not be possible. It should still be possible to maximize balance with each batch added, but that means a tradeoff between balancing each batch and balancing the experiment as a whole. For a given experiment, we would have to decide the relative likelihood that that there would be a confounding variable which in the batches or a confounding variable in the demographics.
The undetected confounding variable is always a possibility. That doesn’t mean that we can’t or shouldn’t do as much about it as the expected gains offset the expected costs, and doing some really complicated math to divide the sample into two groups isn’t much more expensive than collecting the data to go into it.
The Jussim et al review of that literature is worth reading. Expectations do seem to have causal impact, but the effect is usually small relative to measures of past performance and ability, and teacher expectations tend to reflect past performance more.
The review covers some serious challenges to the effect sizes claimed by Rosenthal and coauthors, such as effect sizes declining with sample size and publication bias. Or, regarding the original Pygmalion/Oak School experiment:
As an aside, Rosenthal pioneered meta-analysis in psychology because the effect only replicated a third of the time in the published literature (despite the presence of publication bias and QRPs). In doing so he promulgated a test for publication bias which implicitly assumed the absence of any publication bias, and so almost always output the conclusion that no publication bias was present. These methods were eagerly adopted by the parapsychology community, as the same methodology that appeared to show strong expectancy effects also appeared to show ESP in the ganzfeld psychic experiment, as Rosenthal (1986) agreed.
Since I think that the ESP literature reflects the scale of apparent effect that can be shown in the absence of a real effect, purely through publication bias, experimenter bias, optional stopping, and other questionable research practices, this makes me suspicious of the stronger claims about expectation effects.
I don’t think the sample of experiments reviewed is large enough to evaluate sample size versus effect size; throw out the outliers and there’s nothing left.
I’m now heavily concerned about the validity of the IQ test used; however, that’s more due to the 8 point increase in the control group, when no increase is expected. I’ll have to dig further, exclude any of the controls with out-of-band scores and redo the math.
One result of the meta-analysis, however, is that experimentally-induced changes to teacher expectation have a small casual effect on student performance; another result is that non-induced teacher expectations correlate well with performance in the same year, and less well with long term performance. I would rephrase that as ‘Teacher expectations of student performance in their class tend to be accurate, but correlate poorly with student performance in other classes.’
In any case, thanks for the link. I’m going to have to spend some time determining how much I should change my mind with this new evidence, but my gut feeling is that the objectively worst possible data (my own experience with performing well when expected to perform well, and performing poorly when expected to perform poorly), will continue to dominate my personal opinion on the matter.
Upvoted for candor.
The first Rosenthal meta-analysis used 345 studies. That is pretty big. And the individual studies listed in table 17.1 have large n, ranging from 79 to 5000+.
No, that’s not a problem that should concern you. Children IQ scores are less stable than older people’s scores, test-retest effects will give you a number of IQ points (that’s why one uses controls), and children are constantly growing.
What should concern you is that the researchers involved were willing to pass on and champion a result driven solely by obviously impossible nonsensical meaningless data. A kid going from 18 IQ to 122? or 113 to 211? This can’t even be explained by incompetence in failing to exclude scores from kids refusing to cooperate, because tests in general (much less the specific test they used!) are never normed from 18 to 211. (How do you get a sample big enough to norm as high as 7.4 standard deviations?)
Worrying about the control’s gains and not the actual data is like reading a physics paper reporting that they measured the speed of several neutrinos at 50 hogsheads per milifortnight, and saying ’Hm, yes, but are they sure they properly corrected for GPS clock skew and did accurately record the flight time of their control photons?”
Unstable IQ scores should provide a net zero; an average increase of half a standard deviation across the entire population already means that the norms are fucked.
Therefore, the IQ test used simply wasn’t properly normed; if we assume that it was equally improperly normed for all students in the study, we still see an increase of 4 points based on teachers being told to expect more. Whether an increase of 4 points is statistically significant on that (improperly normed) test is a new question.
Only if you make the very strong assumptions that there is no systematic bias or selection effect or regression to the mean or anything which might cause the unstability to favor an increase.
Plus you ignored my other points.
Plus we already know from the pairs of before-afters that these researchers are either incredibly incompetent or actively dishonest.
Plus we already know biases in analysis or design or data collection can be introduced much more subtly. Gould’s brainpacking problems is only the latest example.
Which claim and assumption we will make because we are terminally optimistic, and to borrow from the ’90s, “I want to believe!”
Wow, you still aren’t giving up on the Pygmalion study? Just let it go already. You don’t even have to give up on your wish for self-fulfilling expectations—there are plenty of followup studies which turned in your desired significant effects.
What effects could cause an increase of 8 points on a properly normed test across the board? Why would there a significant benefit to being in the control group of this study?
You can rule out that they were using a test which produced the scores that they recorded, perhaps by using raw score rather than normed output. You can rule out every other explanation for why the recorded results aren’t valid scores. You can even rule out that they were competently dishonest, since competent dishonesty would be nontrivial to detect; your only possible conclusion is incompetence, which isn’t evidence which should change your priors.
Incompetence is the social equivalent of the null hypothesis, and there is very rarely any significant evidence against it.
Assuming only incompetence as you have, the expected result would be equally erratic for all students. You can assign any likelihood to the assumption that the incompetence was the primary factor and that dishonesty doesn’t modify it significantly, but you have already concluded systemic incompetent dishonesty across a large number of studies.
As you say, it’s been confirmed by other studies. I’m not insisting that a particular study was done correctly, I’m explaining why their conclusions being true is consistent with the errors in their study. (Which means that a study with those flaws would be expected to reach the same conclusions, if those conclusions were true)
I already gave you three separate explanations for why an increase is possible, even in controls.
I have no idea what you mean by this, and I think that if one accepts their incompetence, the best thing to do is to ignore their data as having been poisoned in unknown ways—maliciousness, ideology, and stupidity often being difficult to tell apart.
Why is that? The competent result is, since IQ interventions almost universally fail (our prior for any result like ‘we increased IQ by 8 points’ ought to be very low, as in, well below 1%, because hundreds of interventions have failed to pan out and 8 points is astounding and practically on the level of iodization) and the followups confirm that there is only a much much smaller effect, that there is no or a small effect. Any incompetence is going to lead to an extreme result. Like what they found.
‘Confirmed’? Well, this is an active debate as to what counts as a replication. Near the same magnitude or just having the same sign? If someone publishes a study claiming to find a weight loss drug that will drop 100 pounds, and exhaustive replications find that the true estimate is actually 1 pound, has the original claim been “confirmed”? After all, both estimates are non-zero and both estimates have the same sign...
So, “systematic bias or selection effect or regression to the mean” can result in average properly normed IQ scores increasing by 8 points? Doesn’t the normalizing process (when done properly) force the average score to remain constant?
What normalizing process? You mean the one the paid psychometricians go through years before any specific test is purchased by researchers like the ones doing the Pygmalion study? Yeah, I suppose so, but that’s irrelevant to the discussion.
Right- because the entire population going up half a SD in a year isn’t unusual at all, and the test purchased for use in this study was normalized the way one would expect it to be, despite the fact that it had results that are impossible if it was normalized in that manner.
...‘entire population’?
Alright, I have to admit I have no idea what test you are now referring to. I thought we were discussing the Pygmalion results in which a small sample of elementary school students turned in increased IQ scores, which could be explained by a number of well-known and perfectly ordinary processes.
But it seems like you’re talking about something completely else and may be thinking of country-level Flynn effects or something, I have no idea what.
The PitC study showed an 8 point IQ increase in the control group. You offered those three explanations and said that they explained why that wasn’t particularly unusual, and my understanding of normed IQ tests is that they are expected to remain constant over short times.
Over the general average population when tested once, yes. But the control group is neither general nor average nor the population nor tested once.
If the control group isn’t at least representative, there is a different methodology flaw. If the confounding factor of prior IQ tests wasn’t measured, given that there is apparently a significant increase in scores on the first retest (and presumably a diminishing increase in scores at some point; the expected result of taking the test very many times isn’t to become the highest scorer ever), there is an unaccounted confounding factor.
I’m still trying to figure out what questions to ask before I dig up as much primary source as I can. Is “points of normed IQ” the right thing to measure? That would imply that going from an IQ of 140 to 152 is equally as much a gain as going from 94 to 106. Is raw score the right thing to measure? That would imply that going from being able to answer 75% of the questions accurately to 80% is equally as much gain as going from 25% to 30%. Is the percentage decrease in incorrect answers the correct metric? 75%-80% would be the same as 25%-40%. The percentage increase in correct answers? 25%-30% (20% increase) would be equivalent to 75%-90%.
I’m still reluctant to accept class grades and state-mandated graduation test scores as measuring primarily intelligence or even mastery of the material, rather than the specific skill of taking the test. That makes my error bars larger than those of someone who does accept them as accurate measurements of something important.
No, usually in these cases you will be using an effect size like Cohen’s d: expressing the difference in standard deviations (on the raw score) between the two groups. You can convert it back to IQ points if you want; if you discover a d of 1.0, that’s boosting scores by 1 standard deviation which is usually defined as something like 15 IQ points, and so on.
So if you have your standard paradigmatic experiment (an equal number of controls and experimentals, the two groups having exactly the same beginning mean IQ and standard deviation of the scores), you’d do your intervention, do a retest of IQ, and your effect size would be ‘IQ(bigger) - IQ(smaller) / standard deviation of experimentals & controls’. Some of the things that these approaches do:
test-retest concerns disappear, because you’re not looking at the difference between the first test and the second test within groups, but just the difference in the second test between groups. Did the practice effect give them all 1 point, 5 points, 10 points? Doesn’t matter, as long as it applies to both groups equally and their pre-tests were also equal. The first test is there to make sure you aren’t accidentally picking a group of geniuses and a group of dunces and that the two groups started off equivalent. (Fun fact: the single strongest effect in my n-back meta-analysis is from when a group on the pre-test answered like 4 questions more than any of the others; even though their score dropped on the post-test, because the assumption that the groups were equivalent is built into the meta-analysis, they still look like n-back had an effect size of like d=3 or something crazy like that.)
you’re not converting to IQ points, but using the raw score. This avoids the discreteness issue (suppose the test has 10 questions on it. What does it then mean to convert scores on it to its normed range of 70-130 IQ or whatever? getting even a single additional question right is worth 10 points!)
you avoid the issues of IQ points being ‘worth’ different amounts at different parts of the range. Suppose you took a bunch of IQ 130 kids and did something to boost their scores by 5 points. Is this easier, as hard, or harder than taking a bunch of IQ 100 kids and boosting them 5 points? If there’s any differences, we might expect to see them reflected in the standard deviation being larger or narrower, and so this’ll be reflected in our effect size.
Effect sizes are also the sine qua non of meta-analyses, so by thinking in effect sizes, you can more easily run a meta-analysis yourself if you want (like my own dual n-back meta-analysis on a widely-touted intervention which is supposed to increase IQ), you can interpret meta-analyses better, and you can draw on previous meta-analyses as priors (example: Jaeggi et al 2008 found n-back had an effect size on IQ of something like d=0.8. If one had seen one of the psychology-wide compilations of previous meta-analyses, one would know that replicated & verified effect sizes that large are pretty rare in every area of psychology, and so it was highly likely that their result was being overstated somehow, as indeed it turned out to have been overstated due to a use of passive control groups, and the current best estimate is closer to half that size or d=0.4).
If IQ is the main cause of getting high class grades and passing cutoffs on tests and being able to learn test-taking skills (like learning any other skill), then couldn’t the tests be measuring all of them simultaneously?
… For some reason I thought the first test was used to evenly distribute performance on the pretest between the two groups. Aren’t the control and experimental groups supposed to be as close to identical as possible, and to help analysis identify which subgroups, if any, had effects different from other subgroups? If an intervention showed significantly different results for tall people than for short people, then a study of that intervention on people based on height may be indicated.
That’s carryover from a different branch, sorry.
Ideally, yes, but if you shuffle people around, you’re not necessarily doing yourself any favors. (I think. This seems to be related to an old debate in experimental design going back to Gosset and Fisher over ‘balanced’ versus ‘randomized’ designs, which I don’t understand very well.)
This is part of the randomized vs balanced design debate. Suppose tall people did better, but you just randomly allocated people; with a small sample like, say, 10 total and 5 in each, you would expect to wind up with different numbers of tall people in your control and experimentals (eg a 4-1 split of 5 tall people), and now that may be driving the difference. If you were using a large sample like 5000 people, then you’d expect the random allocation to be very even between the two groups of 2500.
If you specify in advance that tall people are a possibility, you can try to ‘balance’ the groups by additional steps: for example, you might randomize short people as usual, but block (randomize) pairs of tall people—if heads, the guy on the left is in the experimental and right in control, if tails, other way around—where by definition you get an even split of tall people (and maybe 1 guy left over). This is fine, sensible, and efficient use of your sample, and if you’re testing additional hypotheses like ‘tall people score better, even on top of the intervention’, you’ll take appropriate measures like increasing your sample size to reach your desired statistical power / alpha parameters. No problems there.
But any post hoc analysis can be abused. If after you run your study you decide to look at how tall people did, you may have an unbalanced split driving any result, you’re increasing how many hypotheses you’re testing, and so on. Post hoc analyses are untrustworthy and suspicious; here’s an example where a post hoc analysis was done: http://lesswrong.com/lw/68k/nback_news_jaeggi_2011_or_is_there_a/
I was saying that if there was any reason to suspect height might be a factor, then height should be added to the factors considered when trying to make the groups indistinguishable from each other. If height isn’t suspected to be a factor, adding height to those factors with a low weight does almost no harm to the rest of the distribution.
Is there any excuse for the measured variable to notably differ between the control and experimental groups in a well-executed experiment?
In a perfect world, perhaps. But every variable is more effort, and you need to do it from the start or else you might wind up screwing things up (imagine processing people one by one over a few weeks and starting their intervention, and half-way through, noticing that height is differing between the groups...?)
If you didn’t balance them, it may easily happen. And the more variables that describe each person, the more likely the groups will be unbalanced by some variable. People are complex like that. If you’re interested in the topic, I’ve already pointed you at the Wikipedia articles, but you could also check out Ziliak’s papers.
I see where gathering information about all participants before starting the intervention might not be possible. It should still be possible to maximize balance with each batch added, but that means a tradeoff between balancing each batch and balancing the experiment as a whole. For a given experiment, we would have to decide the relative likelihood that that there would be a confounding variable which in the batches or a confounding variable in the demographics.
The undetected confounding variable is always a possibility. That doesn’t mean that we can’t or shouldn’t do as much about it as the expected gains offset the expected costs, and doing some really complicated math to divide the sample into two groups isn’t much more expensive than collecting the data to go into it.