1/ Sparse autoencoders trained on the embedding weights of a language model have very interpretable features! We can decompose a token into its top activating features to understand how the model represents the meaning of the token.🧵
2/ To visualize each feature, we project the output direction of the feature onto the token embeddings to find the most similar tokens. We also show the bottom and median tokens by similarity, but they are not very interpretable.
3/ The token “deaf” decomposes into features for audio and disability! None of the examples in this thread are cherry-picked – they were all (really) randomly chosen.
4/ Usually SAEs are trained on the internal activations of a component for billions of different input strings. But here we just train on the rows of the embedding weight matrix (where each row is the embedding for one token).
5/ Most SAEs have many thousands of features. But for our embedding SAE, we only use 2000 features because of our limited dataset. We are essentially compressing the embedding matrix into a smaller, sparser representation.
6/ The reconstructions are not highly accurate – on average we have ~60% variance unexplained (~0.7 cosine similarity) with ~6 features active per token. So more work is needed to see how useful they are.
7/ Note that for this experiment we used the subset of the token embeddings that correspond to English words, so the task is easier—but the results are qualitatively similar when you train on all embeddings.
8/ We also compare to PCA directions and find that the SAE directions are in fact much more interpretable (as we would expect)!
9/ I worked on embedding SAEs at an @apartresearch hackathon in April, with Sajjan Sivia and Chenxing (June) He. Embedding SAEs were also invented independently by @Michael Pearce.
There are people who treat the lock on a public bathroom as a tool for communicating occupancy and a safeguard against accidental attempts to enter when the room is unavailable. For these people the standard protocol is to discern the likely state of engagement of the inner room and then tentatively proceed inside if they detect no signs of human activity.
And there are people who view the lock on a public bathroom as a physical barricade with which to temporarily defend possessed territory. They start by giving the door a hearty push to test the tensile strength of the barrier. On meeting resistance they engage with full force, wringing the handle up and down and slamming into the door with their full body weight. Only once their attempts are thwarted do they reluctantly retreat to find another stall.
Crossposted from https://x.com/JosephMiller_/status/1839085556245950552
1/ Sparse autoencoders trained on the embedding weights of a language model have very interpretable features! We can decompose a token into its top activating features to understand how the model represents the meaning of the token.🧵
2/ To visualize each feature, we project the output direction of the feature onto the token embeddings to find the most similar tokens. We also show the bottom and median tokens by similarity, but they are not very interpretable.
3/ The token “deaf” decomposes into features for audio and disability! None of the examples in this thread are cherry-picked – they were all (really) randomly chosen.
4/ Usually SAEs are trained on the internal activations of a component for billions of different input strings. But here we just train on the rows of the embedding weight matrix (where each row is the embedding for one token).
5/ Most SAEs have many thousands of features. But for our embedding SAE, we only use 2000 features because of our limited dataset. We are essentially compressing the embedding matrix into a smaller, sparser representation.
6/ The reconstructions are not highly accurate – on average we have ~60% variance unexplained (~0.7 cosine similarity) with ~6 features active per token. So more work is needed to see how useful they are.
7/ Note that for this experiment we used the subset of the token embeddings that correspond to English words, so the task is easier—but the results are qualitatively similar when you train on all embeddings.
8/ We also compare to PCA directions and find that the SAE directions are in fact much more interpretable (as we would expect)!
9/ I worked on embedding SAEs at an @apartresearch hackathon in April, with Sajjan Sivia and Chenxing (June) He.
Embedding SAEs were also invented independently by @Michael Pearce.
BBC Tech News as far as I can tell has not covered any of the recent OpenAI drama about NDAs or employees leaving.
But Scarlett Johansson ‘shocked’ by AI chatbot imitation is now the main headline.
There are two types of people in this world.
There are people who treat the lock on a public bathroom as a tool for communicating occupancy and a safeguard against accidental attempts to enter when the room is unavailable. For these people the standard protocol is to discern the likely state of engagement of the inner room and then tentatively proceed inside if they detect no signs of human activity.
And there are people who view the lock on a public bathroom as a physical barricade with which to temporarily defend possessed territory. They start by giving the door a hearty push to test the tensile strength of the barrier. On meeting resistance they engage with full force, wringing the handle up and down and slamming into the door with their full body weight. Only once their attempts are thwarted do they reluctantly retreat to find another stall.