No standard metric for CFAR workshops?
Update: CFAR used standard metrics in its 2015 study, which I didn’t know about when drafting this post. It doesn’t appear that they tracked these metrics in their most recent impact report.
My outstanding questions are in this comment.
Update #2: CFAR replies to outstanding questions here.
It seems strange that CFAR doesn’t use a standardized metric to track the impact of its workshops over time.
From CFAR’s mission statement:
CFAR exists to try to make headway in this domain – the domain of understanding how human cognition already works, in practice, such that we can then start the process of making useful changes, such that we will be better positioned to solve the problems that really matter.
A couple of frameworks from psychology could serve as useful metrics for assessing progress towards this mission:
Was there a change in a participant’s big 5 personality traits after workshop attendance?
Was there a change in a participant’s Raven’s Progressive Matrices performance after workshop attendance?
As far as I can tell, CFAR’s thesis is that cognitive changes will drive changes in behavior (and correspondingly, impact on the world).
I’d expect big cognitive changes to result in changes in big 5
personality traits. Specifically, I’d expect improved cognition to result in decreased neuroticism & increased conscientiousness.
I’d also expect big cognitive changes to result in improved performance on the Raven’s Matrices.
In other words, if CFAR workshops drive big changes in cognition, I’d expect these changes to reflect on well-validated psychological measures. If there’s not a before-workshop/after-workshop change in these measures, that would be evidence that CFAR workshops are not causing big cognitive changes in workshop participants.
As far as I know, workshop participants aren’t being assessed on measures like this, so it’s hard to know what impact the workshops are actually having.
CFAR current metric is “increase in expected impact” or IEI:
In May 2016, we set out to count the number of alumni who have had an increase in expected impact due to their involvement with CFAR by sending out a survey to our alumni...
For each person’s responses, we manually coded whether it seemed like 1) their current path was high-impact, 2) their current path was substantially better than their old path, and 3) CFAR played a significant role in this change. We counted someone as having an “increase in expected impact” (IEI) if they met all three criteria.
18% of workshop participants surveyed had an IEI.
A metric like IEI is better than no metric at all, but it suffers from limitations:
IEI relies on retrospective self-report (i.e. at the time of survey, respondent thinks back to what they were doing before their workshop, what the workshop was like, and what they did afterwards. They then synthesize all this into a story about what change the workshop had.)
In contrast, a big 5 trait survey relies on immediate self-report (i.e. at time of survey, respondent answers about how things are for them right then).
This strikes me as more reliable than retrospective self-report.
A Raven’s Matrices test relies on cognitive performance at time of survey, which seems even higher signal than a big 5 survey.
IEI was developed in-house, so it is hard to compare the CFAR workshop to other interventions on the basis of IEI.
In contrast, many interventions measure change in big 5 traits & Raven’s Matrices performance.
I’m bringing this up because I found the lack of a standardized, well-validated metric surprising when I thought about it.
It seems plausible that CFAR is already tracking metrics like this privately. If that’s the case, I’m curious why they are held privately.
It could also be that CFAR isn’t tracking an outcome metric like big 5 trait change or Raven’s Matrices performance. If that’s case, I’m curious about why not – the surveys are cheap to administer, and it seems like they would yield valuable information about how CFAR is doing.
(This is Dan from CFAR)
CFAR’s 2015 Longitudinal Study measured the Big 5 and some other standard psychology metrics. It did find changes including decreased neuroticism and increased conscientiousness.
For people ignorant of statistics (like I am), I checked and d measures the change of mean between two distributions, proportional to the variance (i.e. (mean1−mean2)÷standardDeviation).
An effect size in the range 0.2−0.5 is considered ‘small’ by Cohen, and conscientiousness was increased (with p<0.001) with d=0.24 which seems to me notable for such a small intervention (~4 days), though I haven’t got much intuition around these things and maybe it’s actually quite a big intervention. Neuroticism decreased with d=0.17.
If you want to jump to the bit in the above article that lists all the effect sizes measured, search for ‘The table below summarizes the study results’. The biggest effect sizes were in the places (I think) where CFAR tried to come up with their own metrics to measure what they’re trying to do better.
The biggest effect size was on one of these measures, for the metric ‘Emotions Help Rather Than Hinder’, and I’ll quote that section in full below.
I feel like I’ll be bringing this up in conversation any time that anyone says the rationality community is too focused on hyper-logical/S2 thinking.
Anyway, this is just me highlighting some stuff I found interesting, to overcome trivial inconveniences for others, but there’s a ton more interesting stuff in the post.
I think its very confusing to call d = 0.2 to 0.5 ‘small’, especially in the context of a 4 day workshop. Imagine the variable is IQ. Then a ‘small’ effect increases iq by 3 to 7.5 points. That boost in iq would be much better described as ‘huge’. However IQ has a relatively large standard deviation compared to its mean (roughly 15 and 100).
Lets look at male height. In the USA male height has a mean around 70 inches and a standard deviation around 4 inches. (Note 4⁄70 is 38% of 15⁄100). A d of 0.2 to 0.5 would correspond to an increase in height of 0.8 to 2 inches. Some people are willing to undergo costly, time consuming and painful length lengthening surgery to gain 4-5 inches of height. If a four day, 4000 dollar workshop gave increased your height by 0.8 to 2 inches millions of men would be on the waiting list. I know I would be. That doesnt really sound ‘small’ to me.
No, the mean here is an arbitrary convention, so 15 and 100 don’t tell us anything relevant. The appropriate comparison is to what other interventions have accomplished.
Here’s a nice visualisation of what d represents.
For an effect size of 0.4, this means that 66% of the study group will have results above the 50th percentile of the control.
Thanks, I didn’t know about the 2015 study :-)
Any plans to track the 2015 study measures on a rolling basis? (i.e. for every workshop cohort?) Seems useful to measure performance over time.
Why did the 2017 impact report move away from measuring Big 5 traits? (and the other measures looked at in 2015?)
Any thoughts re: using Raven’s Matrices?
Raven’s matrices would seem like an odd choice, since they measure fluid intelligence and basically every attempt to increase fluid intelligence by training has failed. Given that CFAR’s curriculum isn’t specifically targeting fluid intelligence and there’s no reason to expect their interventions to affect it in particular, it would be really unlikely for improvements to show up in Raven’s.
Another thing I’d be particularly interested in is longer term follow-up. It would be impressive if the changes to conscientiousness etc. observed in the 2015 study persist now.
Here are relatively brief responses on these 3 particular points; I’ve made a separate comment which lays out my thinking on metrics like the Big 5 which provides some context for these responses.
We have continued to collect measures like the ones in the 2015 longitudinal study. We are mainly analyzing them in large batches, rather than workshop to workshop, because the sample size isn’t big enough to distinguish signal from noise for single workshops. One of the projects that I’m currently working on is an analysis of a couple years of these data.
The 2017 impact report was not intended as a comprehensive account of all of CFAR’s metrics, it was just focused on CFAR’s EA impact. So it looked at the data that were most directly related to CFAR alums’ impact on the world, and “on average alums have some increase in conscientiousness” seemed less relevant than the information that we did include. The first few paragraphs of the report say more about this.
I’m curious why you’re especially interested in Raven’s Progressive Matrices. I haven’t looked closely at the literature on it, but my impression is that it’s one of many metrics which are loosely related to the thing that we mean by “rationality.” It has the methodological advantage of being a performance score rather than self-report (though this is partially offset by the possibility of practice effects and effort effects). The big disadvantage is the one that Kaj pointed to: it seems to track relatively stable aspects of a person’s thinking skills, and might not change much even if a person made large improvements. For instance, I could imagine a person developing MacGyver-level problem-solving ability while having little or no change in their Raven’s score.
Thanks for this comment; I found it really useful :-)
In part interested because it’s a performance measure rather than self-report.
Also speaking from my experience, my performance on tests like Raven’s has been heavily mediated by things that don’t seem directly related to g, and that I’d imagine could be affected by CFAR’s curriculum.
e.g. I perform better on tests like Raven’s when I’m feeling low-anxiety & emotionally cohered. (Seems plausible that CFAR could lower anxiety & increase emotional coherence.)
Here’s a sketch of my thinking about the usefulness of metrics like the Big 5 for what CFAR is trying to do.
It would be convenient if there was a definitive measure of a person’s rationality which closely matched what we mean by the term and was highly sensitive to changes. But as far as I can tell there isn’t one, and there isn’t likely to be one anytime soon. So we rely on a mix of indicators, including some that are more like systematic metrics, some that are more like individuals’ subjective impressions, and some that are in between.
I think of the established psychology metrics (Big 5, life satisfaction, general self-efficacy, etc.) as primarily providing a sanity check on whether the workshop is doing something, along with a very very rough picture of some of what it is doing. They are quantitative measures that don’t rely on staff members’ subjective impressions of participants, they have been validated (at least to some extent) in existing psychology research, and they seem at least loosely related to the effects that CFAR hopes to have. And, compared to other ways of evaluating CFAR’s impact on individuals, they’re relatively easy for an outsider to make sense of.
A major limitation of these established psychology metrics is that they haven’t been that helpful as feedback loops. One of the main purposes of a metric is to provide input into CFAR’s day-to-day and workshop-to-workshop efforts to develop better techniques and refine the workshop. That is hard to do with metrics like the ones in the longitudinal study, because of a combination of a few factors:
The results aren’t available until several months after the workshop, which would make for very slow feedback loops and iteration.
The results are too noisy to tell if changes from one workshop to the next are just random variation. It takes several workshops worth of data to get a clear signal on most of the metrics.
These metrics are only loosely related to what we care about. If a change to the workshop leads to larger increases in conscientiousness that does not necessarily mean that we want to do it, and when a curriculum developer is working on a class they are generally not that interested in these particular metrics.
These metrics are relatively general/coarse indicators of the effect of the workshop as a whole, not tied to particular inputs. So (for example) if we make some changes to the TAPs class and want to see if the new version of the class works better or worse, there isn’t a metric that isolates the effects of the TAPs class from the rest of the workshop.
I suppose of late I’ve updated that pre-paradigmatic research is the very sort where you’re trying to come up with a metric to validate, not the type of research where you start with one. Note that if a project claims to be doing this kind of work, it means you need a way larger amount of evidence to determine that the project is doing useful things, and your prior will always be very low.
In the particular case, I expect that if CFAR did put in the work to measure e.g. big five changes, even if they found a notable effect size they’d say “Well, this doesn’t confirm for me that we’re on the right track, because this isn’t at all what I’m trying to measure” or something.
Added: Though I too would be really interested to know if CFAR workshops lead to any substantial changes in ravens matrices, and to a lesser-but-still-large extent any changes in Big 5.
This makes sense, though even if doing the pre-paradigmatic thing it seems useful & low-cost to benchmark your performance on existing metrics.
In this specific case, I bet workshop participants would actually find it fun + worthwhile to take before/after big 5 & Raven’s surveys, so it could be a value-add in addition to a benchmarking metric.
Note that workshop participants already do a fair amount of answering questions beforehand (and a year later) to give a sense of how they progress, which I think actually ties in more with what the program is supposed to teach.
(My recollection was that the survey approximately maxed out the amount of time/attention I was willing to spend on surveys, although I’m not sure)
Huh. I’m surprised that after finding significant changes on well-validated psychological instruments in the 2015 study, CFAR didn’t incorporate these instruments into their pre- / post-workshop assessments.
Also surprised that they dropped them from the 2017 impact analysis.
The 2017 impact analysis seems to be EA safety focused. When their theory of impact is about EA safety it’s plausible to me that this made analysis by standard metrics less important for them.
Do you mean “AI safety focused”?
It’s no excuse (for not testing), but worth mentioning. The last time I looked at this (a few years ago), I noticed that it’s really hard to test.
I spent some time (hours) trying to design a test and I struggled to get anything useful.
Do you have suggestions for where to acquire either of the tests mentioned? Are there expected differences in results if the tests are administered by laypeople rather than experienced practitioners?
Just want to note that I’m sorta disappointed by CFAR’s response here.
Dan responded very quickly to my initial post (very quick = within 30 minutes), pointing to CFAR’s 2015 impact assessment. (I didn’t know about the 2015 assessment and was grateful to be pointed to it.)
But as far as I can tell, no one affiliated with CFAR engaged with the discussion that followed. A bunch of follow-up questions fell out of that discussion; I’m sad that no one from CFAR fielded them.
I’m parsing the very quick initial response + lack of follow-on engagement as CFAR acting defensively, rather than engaging in open discourse. This is disappointing.
Also this is the internet – I’m open to the possibility that I’m misinterpreting things here.
Note: I think it should be very clear that responding to a LessWrong comment does not commit you to hours of discussion that you might or might not find valuable. If every comment I make is a commitment to LessWrong to spend multiple hours on follow-up comments, then I would not participate in commenting.
I definitely reserve the right to disengage from any thread if I don’t expect further commenting to be valuable, and strongly encourage others to do the same.
Makes sense – there’s really no commitment mechanism at play here.
I still find it disappointing though.
Normal explanation for these things is simply that people are busy, and good communication takes a lot of time. Of course, with this being the internet, it’s also the case that responses get easily misinterpreted which means you have to put in more time, further disincentivising response.
Added: Note that LW isn’t a project any of the CFAR team work on so they wouldn’t naturally be checking LW or something or be trying to use the platform to talk about their research, in case you were confused and expecting them to be actively interested in discussion. They’ve got jobs and public discussion mostly isn’t one of them right now, I think.
Eh, I think good communication takes time, but not an inordinate amount of time.
For a consumer-facing organization like CFAR, being able to clearly articulate why you’re doing what you’re doing is a core competency, so I’d expect them to bring that to bear in a place like LessWrong, where a lot of CFAR stakeholders hang out.
Sure, but they replied within a half hour of me posting.
The fast response + lack of follow-up feels more defensive than if they hadn’t replied at all.
“A quick, short response that mostly consists of a link, followed by delayed/no response when asked follow-up questions that would require a more time-consuming answer” is exactly the pattern I would expect from someone who was somewhat interested in answering but was busy.
If there’s a question that you can address in two minutes by just giving a link, then you will do that. But then you don’t have the time for anything more elaborate so you either do it later or forget about it entirely (either intentionally because it doesn’t seem like a high priority or unintentionally because you just had too many other things on your mind).
Agreed.
It’s also the pattern I’d expect from someone who wasn’t interested in engaging, but wanted to give the impression that they’ve got it covered / already thought about this / have good reasons for doing what they’re doing.
I’m not sure which is closer to the truth here.
“Uninterested in engaging” and “defensive” feel very different to me. “Defensive” implies that you’re motivated to prove something, either to yourself or others. But you can be uninterested in engaging despite being confident that you’re doing the right thing, simply because it doesn’t feel like you’d get a lot of value in explaining yourself in this particular discussion. There’s a lot of things that employees for organizations like CFAR could be doing; getting into long online discussions may not be the best use of their time, even if they had perfectly good answers.
The word “defensively” usually maps to active behavior. I think it’s a stretch to use it for talking about low engagement.