CallumMcDougall comments on Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall 29 Dec 2023 15:49 UTC
7 points
0
Yep, definitely! If you’re using MSE loss then it’s got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you’re interested, I think Redwood’s paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature
- wassname 16 Jan 2024 0:01 UTC
  1 point
  0
  Parent
  Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange.
  
  I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.
- wassname 8 Jan 2024 0:09 UTC
  1 point
  0
  Parent
  Oh that’s very interesting, Thank you.
  - [ ]
    [deleted]