Mentors rated their enthusiasm for their scholars to continue with their research at 7⁄10 or greater for 94% of scholars.
What is it at 9⁄10 or greater? My understanding is that 7⁄10 and 8⁄10 are generally viewed as ‘neutral’ scores, and this is more like “6% of scholars failed” than it is “94% of scholars succeeded.” (It looks like averages of roughly 8 are generally viewed as ‘high’ in this postmortem so this population might be tougher raters than in other contexts, and so I’m wrong on what counts as ‘neutral’.)
Cheers, Vaniver! As indicated in the figure legend for “Mentor ratings of scholar research”, mentors were asked, “Taking the above [depth/breadth/taste ratings] into account, how strongly do you support the scholar’s research continuing?” and prompted with:
10⁄10 = Very disappointed if [the research] didn’t continue;
5⁄10 = On the fence, unsure what the right call is;
1⁄10 = Fine if research doesn’t continue.
Mentors rated 18% of scholar research projects as 10⁄10 and 28% as 9⁄10.
10⁄10 = Very disappointed if [the research] didn’t continue;
5⁄10 = On the fence, unsure what the right call is;
1⁄10 = Fine if research doesn’t continue.
fwiw that’s actually not that cruxy for me – questions like this are typically framed as if a 5 is “average”, but my understanding/experience is that people still tend to give somewhat inflated scores.
(i.e. the NPS score, “on a scale of 1-10 how likely are you to recommend this to a friend?” ranking system counts 9 and 10 as positive, 7 and 8 as neutral, and 6-and-below as negative. This is a different question than the one you asked here, but I think the same general principles apply, that there’s some natural grade inflation that you probably need to counteract in some way)
For what it’s worth, as a MATS mentor, I gave a bunch of 7s and 8s for people I’m excited about, and felt bad giving people 9s or 10s unless it was super obviously justified
(fyi, it looks like the overall outcome here is pretty good, i.e. 46% of scholars getting a 9 or 10 seems significant. But, the framing of the overview-section at the beginning feels like it’s trying to oversell me on something)
We also asked mentors to rate scholars’ “depth of technical ability,” “breadth of AI safety knowledge,” “research taste,” and “value alignment.” We ommitted these results from the report to prevent bloat, but your comment makes me think we should re-add them.
Congrats on another successful program!
What is it at 9⁄10 or greater? My understanding is that 7⁄10 and 8⁄10 are generally viewed as ‘neutral’ scores, and this is more like “6% of scholars failed” than it is “94% of scholars succeeded.” (It looks like averages of roughly 8 are generally viewed as ‘high’ in this postmortem so this population might be tougher raters than in other contexts, and so I’m wrong on what counts as ‘neutral’.)
Cheers, Vaniver! As indicated in the figure legend for “Mentor ratings of scholar research”, mentors were asked, “Taking the above [depth/breadth/taste ratings] into account, how strongly do you support the scholar’s research continuing?” and prompted with:
10⁄10 = Very disappointed if [the research] didn’t continue;
5⁄10 = On the fence, unsure what the right call is;
1⁄10 = Fine if research doesn’t continue.
Mentors rated 18% of scholar research projects as 10⁄10 and 28% as 9⁄10.
Thanks!
fwiw that’s actually not that cruxy for me – questions like this are typically framed as if a 5 is “average”, but my understanding/experience is that people still tend to give somewhat inflated scores.
(i.e. the NPS score, “on a scale of 1-10 how likely are you to recommend this to a friend?” ranking system counts 9 and 10 as positive, 7 and 8 as neutral, and 6-and-below as negative. This is a different question than the one you asked here, but I think the same general principles apply, that there’s some natural grade inflation that you probably need to counteract in some way)
For what it’s worth, as a MATS mentor, I gave a bunch of 7s and 8s for people I’m excited about, and felt bad giving people 9s or 10s unless it was super obviously justified
That does update me a bit.
FYI, the Net Promoter score is 38%.
(fyi, it looks like the overall outcome here is pretty good, i.e. 46% of scholars getting a 9 or 10 seems significant. But, the framing of the overview-section at the beginning feels like it’s trying to oversell me on something)
Do you think “46% of scholar projects were rated 9⁄10 or higher” is better? What about “scholar projects were rated 8.1/10 on average” ?
I think the practice that’d probably make most to me is just reporting the average for each thing, without making much of a claim about what it meant.
That does sound like a pretty good actual numbers for 9 and 10, although I’m confused about how it maps onto the graph:
Yeah, I just realized the graph is wrong; it seems like the 10⁄10 scores were truncated. We’ll upload a new graph shortly.
Ok, graph is updated!
We also asked mentors to rate scholars’ “depth of technical ability,” “breadth of AI safety knowledge,” “research taste,” and “value alignment.” We ommitted these results from the report to prevent bloat, but your comment makes me think we should re-add them.
Ok, added!