wassname comments on Interpretability with Sparse Autoencoders (Colab exercises)

wassname 29 Dec 2023 6:20 UTC
1 point
0
Thanks, that makes a lot of sense, I had skimmed the Anthropic paper and saw how it was used, but not where it comes from.

If it’s the importance to the loss, then theoretically you could derive one using backprop I guess? E.g. the accumulated gradient to your activations, over a few batches.
- CallumMcDougall 29 Dec 2023 15:49 UTC
  7 points
  0
  Parent
  Yep, definitely! If you’re using MSE loss then it’s got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you’re interested, I think Redwood’s paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature
  - wassname 16 Jan 2024 0:01 UTC
    1 point
    0
    Parent
    Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange.
    
    I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.
  - wassname 8 Jan 2024 0:09 UTC
    1 point
    0
    Parent
    Oh that’s very interesting, Thank you.
    - [ ]
      [deleted]