I expect to have more detailed thoughts worth sharing as I spend more time with this content, but one thing stands out brightly as a first: This is, head-and-shoulders, the best language model interpretability work to date. I’m impressed at the thoroughness of the theory combined with detailed real examples.
This also seems like a good motivation to go back and study layer reordering (a’la Sandwich Transformers) as a treatment affecting the induced circuits of a model.
(h/t Kevin Wang for pointing out the sandwich transformer paper to me recently)
I expect to have more detailed thoughts worth sharing as I spend more time with this content, but one thing stands out brightly as a first: This is, head-and-shoulders, the best language model interpretability work to date. I’m impressed at the thoroughness of the theory combined with detailed real examples.
This also seems like a good motivation to go back and study layer reordering (a’la Sandwich Transformers) as a treatment affecting the induced circuits of a model.
(h/t Kevin Wang for pointing out the sandwich transformer paper to me recently)