the best vector for probing is not the best vector for steering
I don’t understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.
[edit: I’m now thinking that actually the optimal probe vector is also orthogonal to span{→vj|j≠i} so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as →a=∑ifi→vi where fi are feature values and →vi are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just →vi. To avoid off target effects, the vector →si you want to steer with for feature i might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to span{→vj|j≠i} which is only the same as →vi if the set {→vi} are orthogonal.
Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector →si≈→vi but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.
I don’t understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.
AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024
Also related to the idea that the best linear SAE encoder is not the transpose of the decoder.
[edit: I’m now thinking that actually the optimal probe vector is also orthogonal to span{→vj|j≠i} so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as →a=∑ifi→vi where fi are feature values and →vi are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just →vi. To avoid off target effects, the vector →si you want to steer with for feature i might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to span{→vj|j≠i} which is only the same as →vi if the set {→vi} are orthogonal.
Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector →si≈→vi but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.
Makes sense—agreed!