Have you looked at how the dictionaries represent positional information? I worry that the SAEs will learn semi-local codes that intermix positional and semantic information in a way that makes things less interpretable.
To investigate this one can take each feature and could calculate the variance in activations that can explained by the position. If this variance-explained is either ~0% or ~100% for every head I’d be satisfied that positional and semantic information are being well separated into separate features.
In general, I think it makes sense to special-case positional information. Even if positional information is well separated I expect converting it into SAE features probably hurts interpretability. This is easy to do in shortformers[1] or rotary models (as positional information isn’t added to the residual stream). One would have to work a bit harder for GPT2 but it still seems worthwhile imo.
We might also expect these circuits to take into account relative position rather than absolute position, especially using sinusoidal rather than learned positional encodings.
An interesting approach would be to encode the key and query values in a way that deliberately removes positional dependence (for example, run the base model twice with randomly offset positional encodings, train the key/query value to approximate one encoding from the other) then incorporate a relative positional dependence into the learned large QK pair dictionary.
Have you looked at how the dictionaries represent positional information? I worry that the SAEs will learn semi-local codes that intermix positional and semantic information in a way that makes things less interpretable.
To investigate this one can take each feature and could calculate the variance in activations that can explained by the position. If this variance-explained is either ~0% or ~100% for every head I’d be satisfied that positional and semantic information are being well separated into separate features.
In general, I think it makes sense to special-case positional information. Even if positional information is well separated I expect converting it into SAE features probably hurts interpretability. This is easy to do in shortformers[1] or rotary models (as positional information isn’t added to the residual stream). One would have to work a bit harder for GPT2 but it still seems worthwhile imo.
Position embeddings are trained but only added to the key and query calculation, see Section 5 of this paper.
We might also expect these circuits to take into account relative position rather than absolute position, especially using sinusoidal rather than learned positional encodings.
An interesting approach would be to encode the key and query values in a way that deliberately removes positional dependence (for example, run the base model twice with randomly offset positional encodings, train the key/query value to approximate one encoding from the other) then incorporate a relative positional dependence into the learned large QK pair dictionary.