Activation Pattern SVD: A proposal for SAE Interpretability
Epistemic status: This is a rough-draft write-up about a thought experiment I did. Reasonably confident about the broad arguments being made here. That said, I haven’t spent a lot of time rigorously polishing or reviewing my writing, so minor inaccuracies may be present.
Interpretability Illusions from Max-Activating Examples
When interpreting an SAE feature, one common technique is to look at the max-activating examples, and try to extrapolate a pattern from there. However, this approach has two flaws, namely:
Premise 1: It can be hard to extrapolate the correct pattern. Correctly identifying a pattern relies on accurate understanding of the semantics present in text. Many hypotheses may be likely, given the data, and the actual truth could be non-obvious. It seems easy to make a mistake when doing this.
Premise 2: Max activating examples may be anomalous. The most likely element of a distribution can look very different from the typical set. Conclusions drawn based on one (or a few) highly activating examples may turn out to be incorrect when evaluated against the majority of examples.
In the following discussion, I outline a proposal to interpret SAE features using the singular value decomposition of SAE activation patterns, which I think neatly addresses both of these issues.
Activation Pattern SVD
Suppose we have a set of SAE features, which we would like to interpret using a dataset of unique context windows. To do this, we compute the activation , where describes the activation of feature on (the last token of) context window . We then compute the singular value decomposition and take the top elements. (Assume that we have a good way of choosing , e.g. by looking for an “elbow point” in a reconstruction loss curve)
Remark 3: The SVD defines activation prototypes. note that ; each column (in ) describes a “prototypical” activation pattern of a general SAE feature over all context windows in the dataset.
Remark 4: The activation patterns are linear combinations of prototypes. Define the coefficient matrix . Each column (in ) contains coefficients which are used to reconstruct the activation pattern of a given SAE feature as a linear combination of prototypes.
Conjecture 5: C is an isometric embedding. The Euclidean distance between columns is an approximation of the Jaccard distance between the activation patterns of features .
(Note: I am somewhat less certain about Conjecture 5 than the preceding remarks. Well-reasoned arguments for or against are appreciated.)
How does SVD help?
Now, let’s return to the problems discussed above. I will outline how this proposal solves both of them.
Re: hardness. Here, we have reduced the problem of interpreting activation patterns to interpreting prototypes, which we expect to be more tractable. This may also resolve issues with feature splitting. A counterpoint here is that we expect a prototype to be “broader” (e.g. activating on “sycophantic” or “hallucinatory” inputs in many contexts), and hence less interpretable, than an SAE feature, which is often highly context-specific (e.g. activating only on text where the Golden Gate Bridge was mentioned).
Re: anomalous-ness. Since we started out with full information about the SAE feature’s activations on every input in the dataset, we expect this problem to be largely resolved.
Concrete experiments.
Some specific experiments which could be run to validate ideas here.
C.f. Remark 3, look at the “prototypical” activation patterns and see whether they’re more interpretable than typical SAE features.
C.f. Conjecture 5, compute the coefficient matrix and pairwise Euclidean distances between columns, then correlate this with the ground-truth Jaccard distances between activation patterns.
Conclusion
The proposal outlined here is conceptually simple but also pretty computationally intensive, and I’m unsure as to whether it’s principled. Nonetheless, it seems like something simple that somebody should try. Feedback is greatly appreciated!
This seems easy to try and a potential point to iterate from, so you should give it a go. But I worry that U and V will be dense and very uninterpretable:
A contains no information about which actual tokens each SAE feature activated on right? Just the token positions? So activations in completely different contexts but with the same features active in the same token positions cannot be distinguished by A?
I’m not sure why you expect A to have low-rank structure. Being low-rank is often in tension with being sparse, and we know that A is a very sparse matrix.
Perhaps it would be better to utilize the fact that A is a very sparse matrix of positive entries? Maybe permutation matrices or sparse matrices would be more apt than general orthogonal matrices (which can be negative)? (Then you might have to settle for something like a block-diagonal central matrix, rather than a diagonal matrix of singular values).
I’m keen to see stuff in this direction though! I certainly think you could construct some matrix or tensor of SAE activations such that some decomposition of it is interpretable in an interesting way.
Interesting! I’d love to see a few concrete examples if it’s possible & reasonably easy to hand-compute some. I’ve recently seen it argued (can’t recall where unfortunately) that it’s easy with max-activating examples to get sensitivity, but extremely hard to also get specificity.