Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim “changing facts in a generalizable way” (it’s likely not robust to synonyms at all)”. I am also vary of “editing just one MLP for a given fact” being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers.
Refer to a writeup by Thibodeau et al. sometime in the future.
That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word “physics” itself. Just don’t overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.
I welcome any discussion of the linked papers in the comments section.
I agree that the ROME edit method itself isn’t directly that useful. I think it matters more as a validation of how the ROME authors interpreted the structure / functions of the MLP layers.
Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim “changing facts in a generalizable way” (it’s likely not robust to synonyms at all)”. I am also vary of “editing just one MLP for a given fact” being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et al. sometime in the future.
That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word “physics” itself. Just don’t overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.
I welcome any discussion of the linked papers in the comments section.
I agree that the ROME edit method itself isn’t directly that useful. I think it matters more as a validation of how the ROME authors interpreted the structure / functions of the MLP layers.
Which writeup is this? Have a link?
Hey, I’ve finally written it up here,