Vaniver comments on MATS Summer 2023 Retrospective

Vaniver 2 Dec 2023 2:37 UTC
4 points
0
Congrats on another successful program!
Mentors rated their enthusiasm for their scholars to continue with their research at ⁷⁄₁₀ or greater for 94% of scholars.
What is it at ⁹⁄₁₀ or greater? My understanding is that ⁷⁄₁₀ and ⁸⁄₁₀ are generally viewed as ‘neutral’ scores, and this is more like “6% of scholars failed” than it is “94% of scholars succeeded.” (It looks like averages of roughly 8 are generally viewed as ‘high’ in this postmortem so this population might be tougher raters than in other contexts, and so I’m wrong on what counts as ‘neutral’.)
- Ryan Kidd 2 Dec 2023 3:05 UTC
  3 points
  0
  Parent
  Cheers, Vaniver! As indicated in the figure legend for “Mentor ratings of scholar research”, mentors were asked, “Taking the above [depth/breadth/taste ratings] into account, how strongly do you support the scholar’s research continuing?” and prompted with:
  - ¹⁰⁄₁₀ = Very disappointed if [the research] didn’t continue;
  - ⁵⁄₁₀ = On the fence, unsure what the right call is;
  - ¹⁄₁₀ = Fine if research doesn’t continue.
  Mentors rated 18% of scholar research projects as ¹⁰⁄₁₀ and 28% as ⁹⁄₁₀.
  - Vaniver 3 Dec 2023 1:18 UTC
    2 points
    0
    Parent
    Thanks!
  - Raemon 2 Dec 2023 3:24 UTC
    2 points
    0
    Parent
    ¹⁰⁄₁₀ = Very disappointed if [the research] didn’t continue;
    ⁵⁄₁₀ = On the fence, unsure what the right call is;
    ¹⁄₁₀ = Fine if research doesn’t continue.
    fwiw that’s actually not that cruxy for me – questions like this are typically framed as if a 5 is “average”, but my understanding/experience is that people still tend to give somewhat inflated scores.
    (i.e. the NPS score, “on a scale of 1-10 how likely are you to recommend this to a friend?” ranking system counts 9 and 10 as positive, 7 and 8 as neutral, and 6-and-below as negative. This is a different question than the one you asked here, but I think the same general principles apply, that there’s some natural grade inflation that you probably need to counteract in some way)
    - Neel Nanda 2 Dec 2023 19:48 UTC
      5 points
      0
      Parent
      For what it’s worth, as a MATS mentor, I gave a bunch of 7s and 8s for people I’m excited about, and felt bad giving people 9s or 10s unless it was super obviously justified
      - Raemon 2 Dec 2023 20:27 UTC
        2 points
        0
        Parent
        That does update me a bit.
    - Ryan Kidd 2 Dec 2023 4:30 UTC
      3 points
      0
      Parent
      FYI, the Net Promoter score is 38%.
    - Raemon 2 Dec 2023 4:02 UTC
      2 points
      0
      Parent
      (fyi, it looks like the overall outcome here is pretty good, i.e. 46% of scholars getting a 9 or 10 seems significant. But, the framing of the overview-section at the beginning feels like it’s trying to oversell me on something)
      - Ryan Kidd 2 Dec 2023 4:09 UTC
        1 point
        0
        Parent
        Do you think “46% of scholar projects were rated ⁹⁄₁₀ or higher” is better? What about “scholar projects were rated 8.1/10 on average” ?
        Raemon 2 Dec 2023 20:32 UTC
        2 points
        0
        Parent
        I think the practice that’d probably make most to me is just reporting the average for each thing, without making much of a claim about what it meant.
    - Raemon 2 Dec 2023 3:30 UTC
      2 points
      0
      Parent
      Mentors rated 18% of scholar research projects as ¹⁰⁄₁₀ and 28% as ⁹⁄₁₀.
      That does sound like a pretty good actual numbers for 9 and 10, although I’m confused about how it maps onto the graph:
      - Ryan Kidd 2 Dec 2023 3:33 UTC
        1 point
        0
        Parent
        Yeah, I just realized the graph is wrong; it seems like the ¹⁰⁄₁₀ scores were truncated. We’ll upload a new graph shortly.
        Ryan Kidd 2 Dec 2023 4:17 UTC
        1 point
        0
        Parent
        Ok, graph is updated!
- Ryan Kidd 2 Dec 2023 3:45 UTC
  1 point
  0
  Parent
  We also asked mentors to rate scholars’ “depth of technical ability,” “breadth of AI safety knowledge,” “research taste,” and “value alignment.” We ommitted these results from the report to prevent bloat, but your comment makes me think we should re-add them.
  - Ryan Kidd 2 Dec 2023 19:14 UTC
    1 point
    0
    Parent
    Ok, added!