Nina Panickssery comments on Activation space interpretability may be doomed

Nina Panickssery Jan 10, 2025, 1:30 PM
3 points
0
the best vector for probing is not the best vector for steering
I don’t understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.
- Arthur Conmy Jan 12, 2025, 6:49 PM
  6 points
  0
  Parent
  the best vector for probing is not the best vector for steering
  
  AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024
  - bilalchughtai Jan 12, 2025, 7:24 PM
    5 points
    4
    Parent
    Also related to the idea that the best linear SAE encoder is not the transpose of the decoder.
- jake_mendel Jan 10, 2025, 1:39 PM
  4 points
  0
  Parent
  [edit: I’m now thinking that actually the optimal probe vector is also orthogonal to $span {{\to v}_{j} | j \neq i}$ so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
  Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as $\to a = \sum_{i} f_{i} {\to v}_{i}$ where $f_{i}$ are feature values and ${\to v}_{i}$ are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just ${\to v}_{i}$ . To avoid off target effects, the vector ${\to s}_{i}$ you want to steer with for feature $i$ might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to $span {{\to v}_{j} | j \neq i}$ which is only the same as ${\to v}_{i}$ if the set ${{\to v}_{i}}$ are orthogonal.
  Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector ${\to s}_{i} \approx {\to v}_{i}$ but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.
  - Nina Panickssery Jan 10, 2025, 1:50 PM
    3 points
    0
    Parent
    Makes sense—agreed!