Neel Nanda comments on Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Neel Nanda 22 Sep 2023 16:43 UTC
LW: 8 AF: 5
0
AF
Cool work! I really like the ACDC on the parenthesis feature part, I’d love to see more work like that, and work digging into exactly how things compose with each other in terms of the weights.
- Logan Riggs 22 Sep 2023 18:57 UTC
  LW: 4 AF: 2
  0
  AF Parent
  I’ve had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually.
  We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can’t straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK.
  I do think you could find Residual features that are sufficient to activate the MLP features^[1], but not all linear combinations from just the weights.
  Using a dataset-based method, you could find causal features in practice (the ACDC portion of the paper was a first attempt at that), and would be interested in an activation*gradient method here (though I’m largely ignorant).
  1. ^
    Specifically, I think you should scale the residual stream activations by their in-distribution max-activating examples.