Arthur Conmy comments on IAPS: Mapping Technical Safety Research at AI Companies

Arthur Conmy 2 Nov 2024 11:32 UTC
6 points
0
- Here are the other GDM mech interp papers missed:
- We have some blog posts of comparable standard to the Anthropic circuit updates listed:
  - https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team
  - https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
- You use a very wide scope for the “enhancing human feedback” (basically any post-training paper mentioning ‘align’-ing anything). So I will use a wide scope for what counts as mech interp and also include:
  - https://arxiv.org/abs/2401.06102
  - https://arxiv.org/abs/2304.14767
  - There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren’t counting pre-GDM merger Google Research/Brain work
- Oscar 5 Nov 2024 11:39 UTC
  3 points
  0
  Parent
  Thanks for that list of papers/posts. For most of the papers you linked, they’re not included because they did not feature in either of our search strategies: (1) titles containing specific keywords that we searched for on arXiv; (2) the paper is linked on the company’s website. I agree this is a limitation of our methodology. We won’t add these papers in now as that would be somewhat ad hoc, and inconsistent between the companies.
  Re the blog posts from Anthropic and what counts as a paper, I agree this is a tricky demarcation problem. We included the ‘Circuit Updates’ because it was linked to as a ‘paper’ on the Anthropic website. Even if GDM has a higher bar for what counts as a ‘paper’ than Anthropic, I think we don’t really want to be adjudicating this, so I feel comfortable just deferring to each company about what counts as a paper for them.
  - nc 5 Nov 2024 11:49 UTC
    1 point
    0
    Parent
    I would have found it helpful in your report for there to be a ROSES-type diagram or other flowchart showing the steps in your paper collation. This would bring it closer in line with other scoping reviews and would have made it easier to understand your methodology.