Daniel Paleka comments on Quintin’s alignment papers roundup—week 1

Daniel Paleka 10 Sep 2022 8:15 UTC
18 points
2
Do you intend for the comments section to be a public forum on the papers you collect?

I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.

They do not seem to claim “changing facts in a generalizable way” (it’s likely not robust to synonyms at all)”. I am also vary of “editing just one MLP for a given fact” being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et al. sometime in the future.

That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word “physics” itself. Just don’t overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.
- Quintin Pope 10 Sep 2022 23:10 UTC
  5 points
  0
  Parent
  I welcome any discussion of the linked papers in the comments section.
  
  I agree that the ROME edit method itself isn’t directly that useful. I think it matters more as a validation of how the ROME authors interpreted the structure / functions of the MLP layers.
- shreyansh26 12 Sep 2022 5:07 UTC
  3 points
  0
  Parent
  Refer to a writeup by Thibodeau et al
  Which writeup is this? Have a link?
  - jacquesthibs 30 Dec 2022 20:28 UTC
    1 point
    0
    Parent
    Hey, I’ve finally written it up here,