My primary objection is: perhaps some of the students in both groups got smarter (these are 8-9 year olds and still developing) for reasons independent of the interventions, which caused them to improve on the n-back training task AND on the other intelligence tests (fluid intelligence, Gf). If you separated the “active control” group into high and low improvers post-hoc just like was done for the n-back group, you might see that the active control “high improvers” are even smarter than the n-back “high improvers”. We should expect some 8-9 year olds to improve in intelligence or motivation over the course of a month or two, without any intervention.
Basically, this result sucks, because of the artificial post-hoc division into high- and low- responders to n-back training, needed to show a strong “effect”. I’m not certain that the effect is artificial; I’d have to spend a lot of time doing some kind of sampling to show how well the data is explained by my alternative hypothesis.
It’s definitely legitimate to look at the whole n-back group vs. the whole active control group. Those results there aren’t impressive at all. I just can’t give any credit for the post-hoc division because I don’t know how to properly penalize it and it’s clearly self-serving for Jaeggi. It’s borderline deceptive that the graphs don’t show the unsplit n-back population.
It’s unsurprising (probably offering no evidence against my explanation) that the initial average n-back score for the low improvers is higher than the initial average for the high improvers; this is what you’d expect if you split a set of paired samples drawn from the same distribution with no change at all, for example.
Also, on pg 2⁄6, I don’t understand how the t statistics line up with the group sizes.
The groups are ((16 high improvement+16 low improvement)+30 control), so why is it (15), t(15), t(30), and then later t(16)? Does t(n) not mean that it’s a t statistic over a population of n? I’m guessing so. I assume the t is an unpaired student’s t-test, which of course assumes the distributions compared are normal. I’m not sure if that’s demonstrated, but it may be obvious to experts (it’s not to me).
Disclaimer: I did dual n-back for a month or so, and got stuck at 5. I haven’t resumed, though I may do so in the future.
You are way too underconfident. If an intervention is equally likely to raise or lower the score with respect to the control group, without increasing variation, it does nothing.
When you say that the aggregate results “aren’t impressive,” you imply that they are positive, but if I read table 1 correctly, the aggregate results are often negative.
(by the way, the “active control” group practiced vocab and trivia, which should have no overlap to what’s tested by SPM and TONI, which are completely nonverbal)
You’re right. I didn’t actually locate and compare the unsplit numbers from table 1; I just visually estimated (from the pretty bar chart, Fig 4) the average of the two n-back subgroups, since they’re equal-sized. It looks like the n-backers (compared to the trivia/vocab studiers) a non-significantly superior improvement short term, and a non-significantly worse improvement long term.
I’m also puzzled as to why there’s no passive control. Even though there’s no obvious overlap in vocabulary/trivia learning and SPM/TONI, I’d expect some generalized training effect, at least in motivation/focus.
I guess my overall view of the evidence is: don’t expect single n-back to do much better than any other form of same-effort mental exercise, for any purpose except the exact task trained.
There’s no passive control because there are only 62 kids. Only spend as many kids as it takes to publish.
I would not expect a generalized training effect. Almost nothing exhibits cross-test training. People are excited about n-back because it is the only test that is said to.
If you believed single n-back was going to definitively beat the active control, then you wouldn’t pay for a passive control. I buy that. But now that it hasn’t, it’s worth adding a passive control.
Some apparently randomly chosen training task (vocabulary and trivia memorization) exhibited just as much generalized training as single n-back. In your interpretation, neither had any generalized benefit, then—the improvement is just due to normal ~9yr old child development over the timespan.
I do recall hearing some credible evidence that dual n-back (whatever configuration was in some older Jaeggi study) gave a boost to “fluid intelligence”. (thus the interest in the topic). But now I’m given to mistrust Jaeggi more than I would the average influential researcher.
I said “spend kids,” so the cost of acquiring them is irrelevant. I’m sure they’re expensive, so I keep them fixed. If there were half as many studies each with twice as many subjects, they would be much more valuable. But they wouldn’t be publishable, because they’d all have negative results.
The groups are ((16 high improvement+16 low improvement)+30 control), so why is it (15), t(15), t(30), and then later t(16)? Does t(n) not mean that it’s a t statistic over a population of n?
Not usually. Numbers in brackets after a well-known statistic normally represent parameters for that statistic’s distribution; in the case of a t-test the bracketed number would be the number of degrees of freedom, which might be one less than the sample size (for a one-sample t-test) or two less than the sum of sample sizes (for an equal variances two-sample t-test).
My primary objection is: perhaps some of the students in both groups got smarter (these are 8-9 year olds and still developing) for reasons independent of the interventions, which caused them to improve on the n-back training task AND on the other intelligence tests (fluid intelligence, Gf). If you separated the “active control” group into high and low improvers post-hoc just like was done for the n-back group, you might see that the active control “high improvers” are even smarter than the n-back “high improvers”. We should expect some 8-9 year olds to improve in intelligence or motivation over the course of a month or two, without any intervention.
Basically, this result sucks, because of the artificial post-hoc division into high- and low- responders to n-back training, needed to show a strong “effect”. I’m not certain that the effect is artificial; I’d have to spend a lot of time doing some kind of sampling to show how well the data is explained by my alternative hypothesis.
It’s definitely legitimate to look at the whole n-back group vs. the whole active control group. Those results there aren’t impressive at all. I just can’t give any credit for the post-hoc division because I don’t know how to properly penalize it and it’s clearly self-serving for Jaeggi. It’s borderline deceptive that the graphs don’t show the unsplit n-back population.
It’s unsurprising (probably offering no evidence against my explanation) that the initial average n-back score for the low improvers is higher than the initial average for the high improvers; this is what you’d expect if you split a set of paired samples drawn from the same distribution with no change at all, for example.
Also, on pg 2⁄6, I don’t understand how the t statistics line up with the group sizes.
The groups are ((16 high improvement+16 low improvement)+30 control), so why is it (15), t(15), t(30), and then later t(16)? Does t(n) not mean that it’s a t statistic over a population of n? I’m guessing so. I assume the t is an unpaired student’s t-test, which of course assumes the distributions compared are normal. I’m not sure if that’s demonstrated, but it may be obvious to experts (it’s not to me).
Disclaimer: I did dual n-back for a month or so, and got stuck at 5. I haven’t resumed, though I may do so in the future.
You are way too underconfident. If an intervention is equally likely to raise or lower the score with respect to the control group, without increasing variation, it does nothing.
When you say that the aggregate results “aren’t impressive,” you imply that they are positive, but if I read table 1 correctly, the aggregate results are often negative.
(by the way, the “active control” group practiced vocab and trivia, which should have no overlap to what’s tested by SPM and TONI, which are completely nonverbal)
You’re right. I didn’t actually locate and compare the unsplit numbers from table 1; I just visually estimated (from the pretty bar chart, Fig 4) the average of the two n-back subgroups, since they’re equal-sized. It looks like the n-backers (compared to the trivia/vocab studiers) a non-significantly superior improvement short term, and a non-significantly worse improvement long term.
I’m also puzzled as to why there’s no passive control. Even though there’s no obvious overlap in vocabulary/trivia learning and SPM/TONI, I’d expect some generalized training effect, at least in motivation/focus.
I guess my overall view of the evidence is: don’t expect single n-back to do much better than any other form of same-effort mental exercise, for any purpose except the exact task trained.
There’s no passive control because there are only 62 kids. Only spend as many kids as it takes to publish.
I would not expect a generalized training effect. Almost nothing exhibits cross-test training. People are excited about n-back because it is the only test that is said to.
If you believed single n-back was going to definitively beat the active control, then you wouldn’t pay for a passive control. I buy that. But now that it hasn’t, it’s worth adding a passive control.
Some apparently randomly chosen training task (vocabulary and trivia memorization) exhibited just as much generalized training as single n-back. In your interpretation, neither had any generalized benefit, then—the improvement is just due to normal ~9yr old child development over the timespan.
I do recall hearing some credible evidence that dual n-back (whatever configuration was in some older Jaeggi study) gave a boost to “fluid intelligence”. (thus the interest in the topic). But now I’m given to mistrust Jaeggi more than I would the average influential researcher.
That’s unfair. Getting 62 kids for this study must have been difficult. You don’t know what the costs would have been to get a few dozen more.
I said “spend kids,” so the cost of acquiring them is irrelevant. I’m sure they’re expensive, so I keep them fixed. If there were half as many studies each with twice as many subjects, they would be much more valuable. But they wouldn’t be publishable, because they’d all have negative results.
Not usually. Numbers in brackets after a well-known statistic normally represent parameters for that statistic’s distribution; in the case of a t-test the bracketed number would be the number of degrees of freedom, which might be one less than the sample size (for a one-sample t-test) or two less than the sum of sample sizes (for an equal variances two-sample t-test).
(Disclaimer: I haven’t read the paper.)
[Edited for unambiguity.]
Yes, that sounds familiar. Thanks.