Raemon comments on Review AI Alignment posts to help figure out how to make a proper AI Alignment review

Raemon 26 Jan 2023 20:34 UTC
LW: 4 AF: 2
2
AF
I’ve been trying to articulate some thoughts since Rohin’s original comment, and maybe going to just rant-something-out now.
On one hand: I don’t have a confident belief that writing in-depth reviews is worth Buck or Rohin’s time (or their immediate colleague’s time for that matter). It’s a lot of work, there’s a lot of other stuff worth doing. And I know at least Buck and Rohin have already spent quite a lot of time arguing about the conceptual deep disagreements for many of the top-voted posts.
On the other hand, the combination of “there’s stuff epistemically wrong or confused or sketchy about LW”, but “I don’t trust a review process to actually work because I don’t believe the it’ll get better epistemics than what have already been demonstrated” seems a combination of “self-defeatingly wrong” and “also just empirically (probably) wrong”.
Presumably Rohin and Buck and similar colleagues think they have at least (locally) better epistemics than the writers they’re frustrated by.
I’m guessing your take is like “I, Buck/Rohin, could write a review that was epistemically adequate, but I’m busy and don’t expect it to accomplish anything that useful.” Assuming that’s a correct characterization, I don’t necessarily disagree (at least not confidently). But something about the phrasing feels off.
Some reasons it feels off:
- Even if there are clusters of research that seem too hopeless to be worth engaging with, I’d be very surprised if there weren’t at least some clusters of research that Rohin/Buck/etc are more optimistic about. If what happens is “people write reviews of the stuff that feels real/important enough to be worth engaging with”, that still seems valuable to me.
- It seems like people are sort of treating this like a stag-hunt, and it’s not worth participating if a bunch of other effort isn’t going in. I do think there are network effects that make it more valuable as more people participate. But I also think “people incrementally do more review work each year as it builds momentum” is pretty realistic, and I think individual thoughtful reviews are useful in isolation for building clarity on individual posts.
- The LessWrong/Alignment Review process is pretty unopinionated at the moment. If you think a particular type of review is more valuable than other types, there’s nothing stopping you from doing that type of review.
- If the highest review-voted work is controversial, I think it’s useful for the field orienting to know that it’s controversial. It feels pretty reasonable to me to publish an Alignment Forum Journal-ish-thing that includes the top-voted content, with short reviews from other researchers saying “FYI I disagree conceptually here about this post being a good intellectual output”
  - (or, stepping out of the LW-review frame: if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process)
- I’m skeptical that the actual top-voted posts trigger this reaction. At the time of this post, the top voted posts were:
  - ARC’s first technical report: Eliciting Latent Knowledge
  - What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
  - Another (outer) alignment failure story
  - Finite Factored Sets
  - Ngo and Yudkowsky on alignment difficulty
    
    Of those, my guess is that only “Ngo and Yudkowsky on alignment difficulty” and maybe “Another (outer) alignment failure” story have the qualities I understood Rohin to be commenting on. (which isn’t to say I expect him to think the other posts are all valuable, I’m not sure, just that they don’t seem to have the particular failure modes he was frustrated by)
I do think a proper alignment review should likely have more content that wasn’t published on alignment forum. This was technically available this year (we allowed people to submit non-LW content during the nomination phase), but we didn’t promote it very heavily and it didn’t frame it as a “please submit all Alignment progress you think was particularly noteworthy” to various researchers.
I don’t know that the current review process is great, but, again, it’s fairly unopinionated and leaves plenty of room to be-the-change-you-want-to-see in the alignment scene meta-reflection.
(aside: I apologize for picking on Rohin and Buck when they bothered to stick their neck out and comment, presumably there are other people who feel similarly who didn’t even bother commenting. I appreciate you sharing your take, and if this feels like dragging you into something you don’t wanna deal with, no worries. But, I think having concrete people/examples is helpful. I also think a lot of what I’m saying applies to people I’d characterize as “in the MIRI camp”, who also haven’t done much reviewing, although I’d frame my response a bit differently)
- Rohin Shah 5 Feb 2023 10:08 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Sorry, didn’t see this until now (didn’t get a notification, since it was a reply to Buck’s comment).
  I’m guessing your take is like “I, Buck/Rohin, could write a review that was epistemically adequate, but I’m busy and don’t expect it to accomplish anything that useful.”
  In some sense yes, but also, looking at posts I’ve commented on in the last ~6 months, I have written several technical reviews (and nontechnical reviews). And these are only the cases where I wrote a comment that engaged in detail with the main point of the post; many of my other comments review specific claims and arguments within posts.
  (I would be interested in quantifications of how valuable those reviews are to people other than the post authors. I’d think it is pretty low.)
  I’d be very surprised if there weren’t at least some clusters of research that Rohin/Buck/etc are more optimistic about.
  Yes, but they’re usually papers, not LessWrong posts, and I do give feedback to their authors—it just doesn’t happen publicly.
  (And it would be maybe 10x more work to make it public, because (a) I have to now write the review to be understandable by people with wildly different backgrounds and (b) I would hold myself to a higher standard (imo correctly).)
  (Indeed if you look at the reviews I linked above one common thread is that they are responding to specific people whose views I have some knowledge of, and the reviews are written with those people in mind as the audience.)
  I also think “people incrementally do more review work each year as it builds momentum” is pretty realistic
  I broadly agree with this and mostly feel like it is the sort of thing that is happening amongst the folks who are working on prosaic alignment.
  If the highest review-voted work is controversial, I think it’s useful for the field orienting to know that it’s controversial. [...] if the alignment field is full of controversy and people who think each other are confused, I think this is a fairly reasonable fact to come out of any kind of review process
  We already know this though? You have to isolate particular subclusters (Nate/Eliezer, shard theory folks, IDA/debate folks, etc) before it’s even plausible to find pieces of work that might be uncontroversial. We don’t need to go through a review effort to learn that.
  (This is different from beliefs / opinions that are uncontroversial; there are lots of those.)
  (And when I say that they are controversial, I mean that people will disagree significantly on whether it makes progress on alignment, or what the value of the work is; often the work will make technical claims that are uncontroversial. I do think it could be good to highlight which of the technical claims are controversial.)
  I’m skeptical that the actual top-voted posts trigger this reaction.
  What is “this reaction”?
  If you mean “work being conceptually confused, or simply stating points rather than arguing for them, or being otherwise epistemically sketchy”, then I agree there are posts that don’t trigger this reaction (but that doesn’t seem too relevant to whether it is good to write reviews).
  If you mean “reviews of these posts by a randomly selected alignment person would not be very useful”, then I do still have that reaction to every single one of those posts.