CallumMcDougall comments on Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall 12 Dec 2023 16:24 UTC
3 points
0
Good question! In the first batch of exercises (replicating toy models of interp), we play around with different importances. There are some interesting findings here (e.g. when you decrease sparsity to the point where you no longer represent all features, it’s usually the lower-importance features which collapse first). I chose not to have the SAE exercises use varying importance, although it would be interesting to play around with this and see what you get!
As for what importance represents, it’s basically a proxy for “how much a certain feature reduces loss, when it actually is present.” This can be independent from feature probability. Anthropic included it in their toy models paper in order to make those models truer to reality, in the hope that the setup could tell us more interesting lessons about actual models. From the TMS paper:
Not all features are equally useful to a given task. Some can reduce the loss more than others. For an ImageNet model, where classifying different species of dogs is a central task, a floppy ear detector might be one of the most important features it can have. In contrast, another feature might only very slightly improve performance.
If we’re talking features in language models, then importance would be “average amount that this feature reduces cross entropy loss”. I open-sourced an SAE visualiser which you can find here. You can navigate through it and look at the effect of features on loss. It doesn’t actually show the “overall importance” of a feature, but you should be able to get an idea of the kinds of situations where a feature is super loss-reducing and when it isn’t. Example of a highly loss-reducing feature: feature #8, which fires on Django syntax and strongly predicts the “django” token. This seems highly loss-reducing because (although sparse) it’s very often correct when it fires with high magnitude. On the other hand, feature #7 seems less loss-reducing, because a lot of the time it’s pushing for something incorrect (maybe there exist other features which balance it out).
- wassname 29 Dec 2023 6:20 UTC
  1 point
  0
  Parent
  Thanks, that makes a lot of sense, I had skimmed the Anthropic paper and saw how it was used, but not where it comes from.
  
  If it’s the importance to the loss, then theoretically you could derive one using backprop I guess? E.g. the accumulated gradient to your activations, over a few batches.
  - CallumMcDougall 29 Dec 2023 15:49 UTC
    7 points
    0
    Parent
    Yep, definitely! If you’re using MSE loss then it’s got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you’re interested, I think Redwood’s paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature
    - wassname 16 Jan 2024 0:01 UTC
      1 point
      0
      Parent
      Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange.
      
      I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.
    - wassname 8 Jan 2024 0:09 UTC
      1 point
      0
      Parent
      Oh that’s very interesting, Thank you.
      - [ ]
        [deleted]