Thanks for the first sentence—I appreciate clearly stating a position.
measured over a single token the network layers will have representation rank 1
I don’t follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this?
If so, I agree—but I don’t see how this applies to much SAE[1] or mech interp[2] work. Where do we disagree?
E.g. in this post here we show in detail how an “inside a question beginning with which” SAE feature is computed from which and predicts question marks (I helped with this project but didn’t personally find this feature)
More generally, in narrow distribution mech interp work such as the IOI paper, I don’t think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the “Adversarial examples” in Section 4.4 Alexandre found, for example)
Thanks for the first sentence—I appreciate clearly stating a position.
I don’t follow this. Are you saying that the residual stream at position 0 in a transformer is a function of the first token only, or something like this?
If so, I agree—but I don’t see how this applies to much SAE[1] or mech interp[2] work. Where do we disagree?
E.g. in this post here we show in detail how an “inside a question beginning with which” SAE feature is computed from which and predicts question marks (I helped with this project but didn’t personally find this feature)
More generally, in narrow distribution mech interp work such as the IOI paper, I don’t think it makes sense to reduce the explanation to single-token perfect accuracy probes since our explanation generalises fairly well (e.g. the “Adversarial examples” in Section 4.4 Alexandre found, for example)