People to help me get better psychometrics, the variance in my dataset is huge and my tests stop working at 3 STDs of IQ, for the most part. I’d love to have one or two more comprehensive tests that are sensitive to analyses up to 5 STDs
A friend of mine made https://quantified-mind.appspot.com/ for measuring experiments like this (I helped with the website). It sounds like a good fit for what you’re doing. You can have create an experiment, invite subjects to it, and have them test daily, at the same time of day, for perhaps 5-15 minutes a day, for at least a few weeks. Ideally you cycle the subjects in and out of the experimental condition multiple times, so the controls are the off-protocol subjects, rather than using other people as the controls, because the interpersonal variance is so high.
… not taking into account memorization happening on the IQ tests
Practice effects on cognitive testing are high, to the point where gains from practice usually dominate gains from interventions until tens of hours of practice for most tests. This effect is higher the more complicated the test: practice effects attenuate faster with simple reaction time than with choice reaction time than with Stroop than with matrices than with SATs. This means you typically want to test on low-level basic psychometric tests, have subjects practice all the tests quite a bit before you start measuring costly interventions, and include time or test number as one of the variables you’re analyzing.
Apart from practice, the biggest typical confounders are things like caffeine/alcohol, time of day, amount of sleep, and timing of meals, so you’d either want to hold those variables constant or make sure they’re measured as part of your experiment.
These are my recollections from what we learned–my friend did most of the actual experiments and knows much more. If you want to go deep on experimental design, I can ask him.
Awesome! It’s on an old version of Google App Engine, so not very vulnerable to that form of data loss, but it is very vulnerable to code rot, and needs to be migrated. (It was originally running on quantified-mind.com, but he hasn’t thought about it a long time and let the domain expire.)
Is that upgrade process something you could help with? The underlying platform is pretty good, and he put a lot time into adapting gold-standard psychometric tests in a way that allows for easy, powerful Quantified-Self-style experimentation, but the project doesn’t have a maintainer.
The problem with dyi tests is that they have no external validation—during my initial experiment I actually had a 5 min test I did 2x a day (generated so it was new problems each time) -- but the results from that don’t really make sense to anyone but myself, hence why I’ve chosen to forgo doing it.
In terms of saturating the learning effect, that’s a better approach, but getting people to put their time into doing that makes it even harder.
Right, Quantified Mind tests are not normed, so you couldn’t say “participants added 10 IQ points” or even “this participant went from 130 to 140″.
However, they do have a lot of data from other test-takers, so you can say, “participants increased 0.7 SDs [amidst the population of other QM subjects]” or “this participant went from +2.0 to +2.7 SDs”, broken down very specifically by subskill. You are not going to get any real statistical power using full IQ tests.
In terms of saturating the learning effect, that’s a better approach, but getting people to put their time into doing that makes it even harder.
It sounds like the protocols involve hours of daily participant effort over multiple weeks. Compared to that, it seems doable to have them do 5-10 minutes of daily baseline psychometrics (which double as practice) for 2-4 weeks before the experimental protocols begin? This amount of practice washout might not be enough, but if your effects are strong, it might.
In reality, that’s table stakes for measuring cognitive effects from anything short of the strongest of interventions (like giving vs. withholding caffeine to someone accustomed to having it). I recall the founder of Soylent approached us at the beginning, wanting to test whether it had cognitive benefits. When we told him how much testing he would need to have subjects do, he shelved the idea. A QM-like approach reduces the burden of cognitive testing as much as possible, but you can’t reduce it further than this, or you can’t power your experiments.
On a more positive note, if you have a small number of participants who are willing to cycle your protocols for a long time, you can get a lot of power by comparing the on- and off-protocol time periods. So if this level of testing and implementation of protocols would be too daunting to consider for dozens of participants, but you have four hardcore people who can do it all for half a year, then you can likely get some very solid results.
If I sound skeptical about expected measured effects from cognitive testing due to various interventions, it’s because, as I recall, virtually none of the experiments we ran (on our selves, with academic collaborators from Stanford, from QS volunteers, etc.) ever led to any significant increases. The exceptions were all around removing negative interventions (being tired, not having your normal stimulants, alcohol, etc.); the supposed positives (meditation, nootropics, music, exercise, specific nutrients, etc.) consistently either did roughly nothing or had a surprising negative effect (butter). What this all reinforced:
it’s easy to fool yourself with self-reports of cognitive performance (unreliable)
it’s easy to fool yourself with underpowered experiments (especially due to practice effects in longer and more complicated tests)
virtually no one does well-powered experiments (because, as above, it’s hard)
This gives me a strong prior against most of the “intervention X boosts cognition!” claims. (“How would you know?”)
Still, I’m fascinated by this area and would love to see someone do it right and find the right interventions. If you offset different interventions in your protocols, you can even start to measure which pieces of your overall cocktail work, in general and for specific participants, and which can be skipped or are even hurting performance. I have a very old and poorly recorded talk on a lazy way to do this.
One last point: all of this kind of psychometric testing, like IQ tests, only measures subjects’ alert, “aroused” performance, which is close to peak performance and is very hard to affect. Even if you’re tired and not at your best but just plodding along, when someone puts a cognitive test in front of you, boom, let’s go, wake up, it’s time–energy levels go up, test goes well, and then back to your slump. Most interventions that might make you generally more alert and significantly increase average, passive performance will end up having a negligible impact on the peak, active performance that the tests are measuring. If I were building more cognitive testing tools these days, I would try to build things that infer mental performance passively, without triggering this testing arousal. Perhaps that is where the real impacts from interventions are plentiful, strong, and useful.
I think it might be easier to improve on high-level IQ tests than low-level ones in a way that’s still real and valuable. I am not sure how one would design more practice-resistant high-level tests. It might be too hard.
A friend of mine made https://quantified-mind.appspot.com/ for measuring experiments like this (I helped with the website). It sounds like a good fit for what you’re doing. You can have create an experiment, invite subjects to it, and have them test daily, at the same time of day, for perhaps 5-15 minutes a day, for at least a few weeks. Ideally you cycle the subjects in and out of the experimental condition multiple times, so the controls are the off-protocol subjects, rather than using other people as the controls, because the interpersonal variance is so high.
Practice effects on cognitive testing are high, to the point where gains from practice usually dominate gains from interventions until tens of hours of practice for most tests. This effect is higher the more complicated the test: practice effects attenuate faster with simple reaction time than with choice reaction time than with Stroop than with matrices than with SATs. This means you typically want to test on low-level basic psychometric tests, have subjects practice all the tests quite a bit before you start measuring costly interventions, and include time or test number as one of the variables you’re analyzing.
Apart from practice, the biggest typical confounders are things like caffeine/alcohol, time of day, amount of sleep, and timing of meals, so you’d either want to hold those variables constant or make sure they’re measured as part of your experiment.
These are my recollections from what we learned–my friend did most of the actual experiments and knows much more. If you want to go deep on experimental design, I can ask him.
I would love to help your friend set up a backup system at a secondary datacenter if the data is all in one datacenter!
Awesome! It’s on an old version of Google App Engine, so not very vulnerable to that form of data loss, but it is very vulnerable to code rot, and needs to be migrated. (It was originally running on quantified-mind.com, but he hasn’t thought about it a long time and let the domain expire.)
Is that upgrade process something you could help with? The underlying platform is pretty good, and he put a lot time into adapting gold-standard psychometric tests in a way that allows for easy, powerful Quantified-Self-style experimentation, but the project doesn’t have a maintainer.
Yes I can help with that. Will DM
The problem with dyi tests is that they have no external validation—during my initial experiment I actually had a 5 min test I did 2x a day (generated so it was new problems each time) -- but the results from that don’t really make sense to anyone but myself, hence why I’ve chosen to forgo doing it.
In terms of saturating the learning effect, that’s a better approach, but getting people to put their time into doing that makes it even harder.
Right, Quantified Mind tests are not normed, so you couldn’t say “participants added 10 IQ points” or even “this participant went from 130 to 140″.
However, they do have a lot of data from other test-takers, so you can say, “participants increased 0.7 SDs [amidst the population of other QM subjects]” or “this participant went from +2.0 to +2.7 SDs”, broken down very specifically by subskill. You are not going to get any real statistical power using full IQ tests.
It sounds like the protocols involve hours of daily participant effort over multiple weeks. Compared to that, it seems doable to have them do 5-10 minutes of daily baseline psychometrics (which double as practice) for 2-4 weeks before the experimental protocols begin? This amount of practice washout might not be enough, but if your effects are strong, it might.
In reality, that’s table stakes for measuring cognitive effects from anything short of the strongest of interventions (like giving vs. withholding caffeine to someone accustomed to having it). I recall the founder of Soylent approached us at the beginning, wanting to test whether it had cognitive benefits. When we told him how much testing he would need to have subjects do, he shelved the idea. A QM-like approach reduces the burden of cognitive testing as much as possible, but you can’t reduce it further than this, or you can’t power your experiments.
On a more positive note, if you have a small number of participants who are willing to cycle your protocols for a long time, you can get a lot of power by comparing the on- and off-protocol time periods. So if this level of testing and implementation of protocols would be too daunting to consider for dozens of participants, but you have four hardcore people who can do it all for half a year, then you can likely get some very solid results.
If I sound skeptical about expected measured effects from cognitive testing due to various interventions, it’s because, as I recall, virtually none of the experiments we ran (on our selves, with academic collaborators from Stanford, from QS volunteers, etc.) ever led to any significant increases. The exceptions were all around removing negative interventions (being tired, not having your normal stimulants, alcohol, etc.); the supposed positives (meditation, nootropics, music, exercise, specific nutrients, etc.) consistently either did roughly nothing or had a surprising negative effect (butter). What this all reinforced:
it’s easy to fool yourself with self-reports of cognitive performance (unreliable)
it’s easy to fool yourself with underpowered experiments (especially due to practice effects in longer and more complicated tests)
virtually no one does well-powered experiments (because, as above, it’s hard)
This gives me a strong prior against most of the “intervention X boosts cognition!” claims. (“How would you know?”)
Still, I’m fascinated by this area and would love to see someone do it right and find the right interventions. If you offset different interventions in your protocols, you can even start to measure which pieces of your overall cocktail work, in general and for specific participants, and which can be skipped or are even hurting performance. I have a very old and poorly recorded talk on a lazy way to do this.
One last point: all of this kind of psychometric testing, like IQ tests, only measures subjects’ alert, “aroused” performance, which is close to peak performance and is very hard to affect. Even if you’re tired and not at your best but just plodding along, when someone puts a cognitive test in front of you, boom, let’s go, wake up, it’s time–energy levels go up, test goes well, and then back to your slump. Most interventions that might make you generally more alert and significantly increase average, passive performance will end up having a negligible impact on the peak, active performance that the tests are measuring. If I were building more cognitive testing tools these days, I would try to build things that infer mental performance passively, without triggering this testing arousal. Perhaps that is where the real impacts from interventions are plentiful, strong, and useful.
I think it might be easier to improve on high-level IQ tests than low-level ones in a way that’s still real and valuable. I am not sure how one would design more practice-resistant high-level tests. It might be too hard.