1/​ Sparse autoencoders trained on the embedding weights of a language model have very interpretable features! We can decompose a token into its top activating features to understand how the model represents the meaning of the token.🧵
2/​ To visualize each feature, we project the output direction of the feature onto the token embeddings to find the most similar tokens. We also show the bottom and median tokens by similarity, but they are not very interpretable.
3/​ The token “deaf” decomposes into features for audio and disability! None of the examples in this thread are cherry-picked – they were all (really) randomly chosen.
4/​ Usually SAEs are trained on the internal activations of a component for billions of different input strings. But here we just train on the rows of the embedding weight matrix (where each row is the embedding for one token).
5/​ Most SAEs have many thousands of features. But for our embedding SAE, we only use 2000 features because of our limited dataset. We are essentially compressing the embedding matrix into a smaller, sparser representation.
6/​ The reconstructions are not highly accurate – on average we have ~60% variance unexplained (~0.7 cosine similarity) with ~6 features active per token. So more work is needed to see how useful they are.
7/​ Note that for this experiment we used the subset of the token embeddings that correspond to English words, so the task is easier—but the results are qualitatively similar when you train on all embeddings.
8/​ We also compare to PCA directions and find that the SAE directions are in fact much more interpretable (as we would expect)!
9/​ I worked on embedding SAEs at an @apartresearch hackathon in April, with Sajjan Sivia and Chenxing (June) He. Embedding SAEs were also invented independently by @Michael Pearce.
Crossposted from https://​​x.com/​​JosephMiller_/​​status/​​1839085556245950552
1/​ Sparse autoencoders trained on the embedding weights of a language model have very interpretable features! We can decompose a token into its top activating features to understand how the model represents the meaning of the token.🧵
2/​ To visualize each feature, we project the output direction of the feature onto the token embeddings to find the most similar tokens. We also show the bottom and median tokens by similarity, but they are not very interpretable.
3/​ The token “deaf” decomposes into features for audio and disability! None of the examples in this thread are cherry-picked – they were all (really) randomly chosen.
4/​ Usually SAEs are trained on the internal activations of a component for billions of different input strings. But here we just train on the rows of the embedding weight matrix (where each row is the embedding for one token).
5/​ Most SAEs have many thousands of features. But for our embedding SAE, we only use 2000 features because of our limited dataset. We are essentially compressing the embedding matrix into a smaller, sparser representation.
6/​ The reconstructions are not highly accurate – on average we have ~60% variance unexplained (~0.7 cosine similarity) with ~6 features active per token. So more work is needed to see how useful they are.
7/​ Note that for this experiment we used the subset of the token embeddings that correspond to English words, so the task is easier—but the results are qualitatively similar when you train on all embeddings.
8/​ We also compare to PCA directions and find that the SAE directions are in fact much more interpretable (as we would expect)!
9/​ I worked on embedding SAEs at an @apartresearch hackathon in April, with Sajjan Sivia and Chenxing (June) He.
Embedding SAEs were also invented independently by @Michael Pearce.