Arthur Conmy comments on Transcoders enable fine-grained interpretable circuit analysis for language models

Arthur Conmy 17 May 2024 23:33 UTC
LW: 3 AF: 1
0
AF
they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer
I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.
they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)