Arthur Conmy comments on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Arthur Conmy 2 Feb 2024 16:35 UTC
1 point
0
Thanks for the first sentence—I appreciate clearly stating a position.
measured over a single token the network layers will have representation rank 1
I don’t follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this?
If so, I agree—but I don’t see how this applies to much SAE^[1] or mech interp^[2] work. Where do we disagree?
1. ^
  E.g. in this post here we show in detail how an “inside a question beginning with which” SAE feature is computed from which and predicts question marks (I helped with this project but didn’t personally find this feature)
2. ^
  More generally, in narrow distribution mech interp work such as the IOI paper, I don’t think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the “Adversarial examples” in Section 4.4 Alexandre found, for example)