Jordan Taylor comments on Activation Pattern SVD: A proposal for SAE Interpretability

Jordan Taylor 3 Jul 2024 19:25 UTC
1 point
0
This seems easy to try and a potential point to iterate from, so you should give it a go. But I worry that $U$ and $V$ will be dense and very uninterpretable:
- $A$ contains no information about which actual tokens each SAE feature activated on right? Just the token positions? So activations in completely different contexts but with the same features active in the same token positions cannot be distinguished by $A$ ?
- I’m not sure why you expect $A$ to have low-rank structure. Being low-rank is often in tension with being sparse, and we know that $A$ is a very sparse matrix.
- Perhaps it would be better to utilize the fact that $A$ is a very sparse matrix of positive entries? Maybe permutation matrices or sparse matrices would be more apt than general orthogonal matrices (which can be negative)? (Then you might have to settle for something like a block-diagonal central matrix, rather than a diagonal matrix of singular values).
I’m keen to see stuff in this direction though! I certainly think you could construct some matrix or tensor of SAE activations such that some decomposition of it is interpretable in an interesting way.