wesg comments on Activation space interpretability may be doomed

wesg 9 Jan 2025 3:30 UTC
16 points
0
This seems like an easy experiment to do!
Here is Sonnet 3.6′s 1-shot output (colab) and plot below. I asked for PCA for simplicity.
Looking at the PCs vs x, PC2 is kinda close to giving you x^2, but indeed this is not an especially helpful interpretation of the network.
Good post!
- Experience Machine 9 Jan 2025 12:43 UTC
  3 points
  0
  Parent
  I played around with the $x^{2}$ example as well and got similar results. I was wondering why there are two more dominant PCs: If you assume there is no bias, then the activations will all look like $λ * R e L U (E)$
  or $λ * R e L U (- E)$ and I checked that the two directions found by the PC approximately span the same space as $< R e L U (E), R e L U (- E) >$ . I suspect something similar is happening with bias.
  
  In this specific example there is a way to get the true direction w_out from the activations: By doing a PCA on the gradient of the activations. In this case, it is easily explained by computing the gradients by hand: It will be a multiple of w_out.
  - Lucius Bushnaq 9 Jan 2025 13:07 UTC
    3 points
    0
    Parent
    See the second to last paragraph. The gradients of downstream quantities with respect to the activations contains information and structure that is not part of the activations. So in principle, there could be a general way to analyse the right gradients in the right way on top of the activations to find the features of the model. See e.g. this for an attempt to combine PCAs of activations and gradients together.
    - Experience Machine 9 Jan 2025 13:35 UTC
      1 point
      0
      Parent
      Thanks for the reference, I wanted to illuminate the value of gradients of activations in this toy example as I have been thinking about similar ideas.
      I personally would be pretty excited about attribuition dictionary learning, but it seems like nobody did that on bigger models yet.
      - Lucius Bushnaq 9 Jan 2025 13:41 UTC
        3 points
        0
        Parent
        In my limited experience, attribution-patching style attributions tend to be a pain to optimise for sparsity. Very brittle. I agree it seems like a good thing to keep poking at though.
        Experience Machine 9 Jan 2025 14:00 UTC
        1 point
        0
        Parent
        Did you use something like $L_{S A E}$ as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?
        Lucius Bushnaq 9 Jan 2025 14:04 UTC
        3 points
        0
        Parent
        The third term in that. Though it was in a somewhat different context related to the weight partitioning project mentioned in the last paragraph, not SAE training.
        
        Yes, brittle in hyperparameters. It was also just very painful to train in general. I wouldn’t straightforwardly extrapolate our experience to a standard SAE setup though, we had a lot of other things going on in that optimisation.
        Experience Machine 9 Jan 2025 14:05 UTC
        1 point
        0
        Parent
        I see, thanks for sharing!