CFAR’s 2015 Longitudinal Study measured the Big 5 and some other standard psychology metrics. It did find changes including decreased neuroticism and increased conscientiousness.
For people ignorant of statistics (like I am), I checked and d measures the change of mean between two distributions, proportional to the variance (i.e. (mean1−mean2)÷standardDeviation).
An effect size in the range 0.2−0.5 is considered ‘small’ by Cohen, and conscientiousness was increased (with p<0.001) with d=0.24 which seems to me notable for such a small intervention (~4 days), though I haven’t got much intuition around these things and maybe it’s actually quite a big intervention. Neuroticism decreased with d=0.17.
If you want to jump to the bit in the above article that lists all the effect sizes measured, search for ‘The table below summarizes the study results’. The biggest effect sizes were in the places (I think) where CFAR tried to come up with their own metrics to measure what they’re trying to do better.
The biggest effect size was on one of these measures, for the metric ‘Emotions Help Rather Than Hinder’, and I’ll quote that section in full below.
Emotions Help Rather Than Hinder
We created a single item measure of how participants relate to their emotions: “When you feel emotions, do they mostly help or hinder you in pursuing your goals?” The CFAR workshop emphasizes the value of emotions as sources of data and motivation, and includes techniques for increasing the alignment between one’s emotions, goals, and behavior. Research on emotion regulation provides some of the theoretical background for this approach, with the findings that it generally is not helpful to suppress emotions and it often is helpful to reframe situations so that they elicit different emotions (Gross, 2002).
There was an increase in the extent to which participants evaluated their emotions as helping them rather than hindering them, d = 0.41 (t(129) = 4.13, p < .001).
I feel like I’ll be bringing this up in conversation any time that anyone says the rationality community is too focused on hyper-logical/S2 thinking.
Anyway, this is just me highlighting some stuff I found interesting, to overcome trivial inconveniences for others, but there’s a ton more interesting stuff in the post.
I think its very confusing to call d = 0.2 to 0.5 ‘small’, especially in the context of a 4 day workshop. Imagine the variable is IQ. Then a ‘small’ effect increases iq by 3 to 7.5 points. That boost in iq would be much better described as ‘huge’. However IQ has a relatively large standard deviation compared to its mean (roughly 15 and 100).
Lets look at male height. In the USA male height has a mean around 70 inches and a standard deviation around 4 inches. (Note 4⁄70 is 38% of 15⁄100). A d of 0.2 to 0.5 would correspond to an increase in height of 0.8 to 2 inches. Some people are willing to undergo costly, time consuming and painful length lengthening surgery to gain 4-5 inches of height. If a four day, 4000 dollar workshop gave increased your height by 0.8 to 2 inches millions of men would be on the waiting list. I know I would be. That doesnt really sound ‘small’ to me.
IQ has a relatively large standard deviation compared to its mean
No, the mean here is an arbitrary convention, so 15 and 100 don’t tell us anything relevant. The appropriate comparison is to what other interventions have accomplished.
Raven’s matrices would seem like an odd choice, since they measure fluid intelligence and basically every attempt to increase fluid intelligence by training has failed. Given that CFAR’s curriculum isn’t specifically targeting fluid intelligence and there’s no reason to expect their interventions to affect it in particular, it would be really unlikely for improvements to show up in Raven’s.
Another thing I’d be particularly interested in is longer term follow-up. It would be impressive if the changes to conscientiousness etc. observed in the 2015 study persist now.
Here are relatively brief responses on these 3 particular points; I’ve made a separate comment which lays out my thinking on metrics like the Big 5 which provides some context for these responses.
We have continued to collect measures like the ones in the 2015 longitudinal study. We are mainly analyzing them in large batches, rather than workshop to workshop, because the sample size isn’t big enough to distinguish signal from noise for single workshops. One of the projects that I’m currently working on is an analysis of a couple years of these data.
The 2017 impact report was not intended as a comprehensive account of all of CFAR’s metrics, it was just focused on CFAR’s EA impact. So it looked at the data that were most directly related to CFAR alums’ impact on the world, and “on average alums have some increase in conscientiousness” seemed less relevant than the information that we did include. The first few paragraphs of the report say more about this.
I’m curious why you’re especially interested in Raven’s Progressive Matrices. I haven’t looked closely at the literature on it, but my impression is that it’s one of many metrics which are loosely related to the thing that we mean by “rationality.” It has the methodological advantage of being a performance score rather than self-report (though this is partially offset by the possibility of practice effects and effort effects). The big disadvantage is the one that Kaj pointed to: it seems to track relatively stable aspects of a person’s thinking skills, and might not change much even if a person made large improvements. For instance, I could imagine a person developing MacGyver-level problem-solving ability while having little or no change in their Raven’s score.
Thanks for this comment; I found it really useful :-)
I’m curious why you’re especially interested in Raven’s Progressive Matrices.
In part interested because it’s a performance measure rather than self-report.
Also speaking from my experience, my performance on tests like Raven’s has been heavily mediated by things that don’t seem directly related to g, and that I’d imagine could be affected by CFAR’s curriculum.
e.g. I perform better on tests like Raven’s when I’m feeling low-anxiety & emotionally cohered. (Seems plausible that CFAR could lower anxiety & increase emotional coherence.)
(This is Dan from CFAR)
CFAR’s 2015 Longitudinal Study measured the Big 5 and some other standard psychology metrics. It did find changes including decreased neuroticism and increased conscientiousness.
For people ignorant of statistics (like I am), I checked and d measures the change of mean between two distributions, proportional to the variance (i.e. (mean1−mean2)÷standardDeviation).
An effect size in the range 0.2−0.5 is considered ‘small’ by Cohen, and conscientiousness was increased (with p<0.001) with d=0.24 which seems to me notable for such a small intervention (~4 days), though I haven’t got much intuition around these things and maybe it’s actually quite a big intervention. Neuroticism decreased with d=0.17.
If you want to jump to the bit in the above article that lists all the effect sizes measured, search for ‘The table below summarizes the study results’. The biggest effect sizes were in the places (I think) where CFAR tried to come up with their own metrics to measure what they’re trying to do better.
The biggest effect size was on one of these measures, for the metric ‘Emotions Help Rather Than Hinder’, and I’ll quote that section in full below.
I feel like I’ll be bringing this up in conversation any time that anyone says the rationality community is too focused on hyper-logical/S2 thinking.
Anyway, this is just me highlighting some stuff I found interesting, to overcome trivial inconveniences for others, but there’s a ton more interesting stuff in the post.
I think its very confusing to call d = 0.2 to 0.5 ‘small’, especially in the context of a 4 day workshop. Imagine the variable is IQ. Then a ‘small’ effect increases iq by 3 to 7.5 points. That boost in iq would be much better described as ‘huge’. However IQ has a relatively large standard deviation compared to its mean (roughly 15 and 100).
Lets look at male height. In the USA male height has a mean around 70 inches and a standard deviation around 4 inches. (Note 4⁄70 is 38% of 15⁄100). A d of 0.2 to 0.5 would correspond to an increase in height of 0.8 to 2 inches. Some people are willing to undergo costly, time consuming and painful length lengthening surgery to gain 4-5 inches of height. If a four day, 4000 dollar workshop gave increased your height by 0.8 to 2 inches millions of men would be on the waiting list. I know I would be. That doesnt really sound ‘small’ to me.
No, the mean here is an arbitrary convention, so 15 and 100 don’t tell us anything relevant. The appropriate comparison is to what other interventions have accomplished.
Here’s a nice visualisation of what d represents.
For an effect size of 0.4, this means that 66% of the study group will have results above the 50th percentile of the control.
Thanks, I didn’t know about the 2015 study :-)
Any plans to track the 2015 study measures on a rolling basis? (i.e. for every workshop cohort?) Seems useful to measure performance over time.
Why did the 2017 impact report move away from measuring Big 5 traits? (and the other measures looked at in 2015?)
Any thoughts re: using Raven’s Matrices?
Raven’s matrices would seem like an odd choice, since they measure fluid intelligence and basically every attempt to increase fluid intelligence by training has failed. Given that CFAR’s curriculum isn’t specifically targeting fluid intelligence and there’s no reason to expect their interventions to affect it in particular, it would be really unlikely for improvements to show up in Raven’s.
Another thing I’d be particularly interested in is longer term follow-up. It would be impressive if the changes to conscientiousness etc. observed in the 2015 study persist now.
Here are relatively brief responses on these 3 particular points; I’ve made a separate comment which lays out my thinking on metrics like the Big 5 which provides some context for these responses.
We have continued to collect measures like the ones in the 2015 longitudinal study. We are mainly analyzing them in large batches, rather than workshop to workshop, because the sample size isn’t big enough to distinguish signal from noise for single workshops. One of the projects that I’m currently working on is an analysis of a couple years of these data.
The 2017 impact report was not intended as a comprehensive account of all of CFAR’s metrics, it was just focused on CFAR’s EA impact. So it looked at the data that were most directly related to CFAR alums’ impact on the world, and “on average alums have some increase in conscientiousness” seemed less relevant than the information that we did include. The first few paragraphs of the report say more about this.
I’m curious why you’re especially interested in Raven’s Progressive Matrices. I haven’t looked closely at the literature on it, but my impression is that it’s one of many metrics which are loosely related to the thing that we mean by “rationality.” It has the methodological advantage of being a performance score rather than self-report (though this is partially offset by the possibility of practice effects and effort effects). The big disadvantage is the one that Kaj pointed to: it seems to track relatively stable aspects of a person’s thinking skills, and might not change much even if a person made large improvements. For instance, I could imagine a person developing MacGyver-level problem-solving ability while having little or no change in their Raven’s score.
Thanks for this comment; I found it really useful :-)
In part interested because it’s a performance measure rather than self-report.
Also speaking from my experience, my performance on tests like Raven’s has been heavily mediated by things that don’t seem directly related to g, and that I’d imagine could be affected by CFAR’s curriculum.
e.g. I perform better on tests like Raven’s when I’m feeling low-anxiety & emotionally cohered. (Seems plausible that CFAR could lower anxiety & increase emotional coherence.)