Sam Bowman comments on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Sam Bowman Nov 23, 2022, 11:02 PM
LW: 8 AF: 5
13
AF
+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular ‘core’ AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.
(To be fair, I think the Inverse Scaling Prize, which I’m helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)
- Akash Nov 24, 2022, 12:05 AM
  LW: 23 AF: 6
  8
  AF Parent
  Hastily written; may edit later
  Thanks for mentioning this, Jan! We’d be happy to hear suggestions for additional judges. Feel free to email us at akash@alignmentawards.com and olivia@alignmentawards.com.
  Some additional thoughts:
  1. We chose judges primarily based on their expertise and (our perception of) their ability to evaluate submissions about goal misgeneralization and corrigibility. Lauro, Richard, Nate, and John ade some of few researchers who have thought substantially about these problems. In particular, Lauro first-authored the first paper about goal misgeneralization and Nate first-authored a foundational paper about corrigibility.
  2. We think the judges do have some reasonable credentials (e.g., Richard works at OpenAI, Lauro is a PhD student at the University of Cambridge, Nate Soares is the Executive Director of a research organization & he has an h-index of 12, as well as 500+ citations). I think the contest meets the bar of “having reasonably well-credentialed judges” but doesn’t meet the bar of “having extremely well-credentialed judges (e.g., well-established professors with thousands of citations). I think that’s fine.
  3. We got feedback from several ML people before launching. We didn’t get feedback that this looks “extremely weird” (though I’ll note that research competitions in general are pretty unusual).
  4. I think it’s plausible that some people will find this extremely weird (especially people who judge things primarily based on the cumulative prestige of the associated parties & don’t think that OpenAI/500 citations/Cambridge are enough), but I don’t expect this to be a common reaction.
  Some clarifications + quick thoughts on Sam’s points:
  1. The contest isn’t aimed primarily/exclusively at established ML researchers (though we are excited to receive submissions from any ML researchers who wish to participate).
  2. We didn’t optimize our contest to attract established researchers. Our contests are optimized to take questions that we think are at the core of alignment research and present them in a (somewhat less vague) format that gets more people to think about them.
  3. We’re excited that other groups are running contests that are designed to attract established researchers & present different research questions.
  4. All else equal, we think that precise/quantifiable grading criteria & a diverse panel of reviewers are preferable. However, in our view, many of the core problems in alignment (including goal misgeneralization and corrigibility) have not been sufficiently well-operationalized to have precise/quantifiable grading criteria at this stage.
  - JanB Nov 24, 2022, 2:47 PM
    LW: 8 AF: 4
    10
    AF Parent
    This response does not convince me.
    
    Concretely, I think that if I’d show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I’d think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or … ).
    
    Your point 3.) about the feedback from ML researchers could convince me that I’m wrong, depending on whom exactly you got feedback from and how that looked like.
    
    By the way, I’m highlighting this point in particular not because it’s highly critical (I haven’t thought much about how critical it is), but because it seems relatively easy to fix.