Nice post. I think this is a really interesting discovery.
[Copying from messages with Joseph Bloom] TLDR: I’m confused what is different about the SAE input that causes the absorbed feature not to fire.
Me:
Summary of your findings
Say you have a “starts with s” feature and a “snake” feature.
You find that for most words, “starts with s” correctly categorizes words that start with s. But for a few words that start with s, like snake, it doesn’t fire.
These exceptional tokens where it doesn’t fire, all have another feature that corresponds very closely to the token. For example, there is a “snake” feature that corresponds strongly to the snake token.
You say that the “snake” feature has absorbed the “starts with s” feature because the concept of snake also contains/entails the concept of ‘start with s’.
Most of the features that absorb other features correspond to common words, like “and”.
So why is this happening? Well it makes sense that the model can do better on L1 on the snake token by just firing a single “snake” feature (rather than the “starts with s” feature and, say, the “reptile” feature). And it makes sense it would only have enough space to have these specific token features for common tokens.
Joseph Bloom:
rather than the “starts with s” feature and, say, the “reptile” feature
We found cases of seemingly more general features getting absorbed in the context of spelling but they are more rare / probably the exception. It’s worth distinguishing that we suspect that feature absorption is just easiest to find for token aligned features but conceptually could occur any time a similar structure exists between features.
And it makes sense it would only have enough space to have these specific token features for common tokens.
I think this needs further investigation. We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token). I predict there is a strong density effect but it could be non-trivial.
Me:
We found cases of seemingly more general features getting absorbed in the context of spelling
What’s an example?
We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token)
You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?
Do this only happen if the french word also starts with s?
Joseph Bloom:
What’s an example?
Latent aligned for a few words at once. Eg: “assistance” but fires weakly on “help”. We saw it absorb both “a” and “h”!
You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?
Yes
Do this only happen if the french word also starts with s?
More likely. I think the process is stochastic so it’s all distributions.
↓[Key point]↓
Me:
But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? How is it able to fire on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature? I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?
Joseph Bloom:
I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?
I think the success of the linear probe is why we think the snake token does have the starts with s direction. The linear probe has much better recall and doesn’t struggle with obvious examples. I think the feature absorption work is not about how models really work, it’s about how SAEs obscure how models work.
But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? Like what is the mechanism by which it fires on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature?
Short answer, I don’t know. Long answer—some hypotheses:
Linear probes, can easily do calculations of the form “A AND B”. In large vector spaces, it may be possible to learn a direction of the form “(^S.*) AND not (snake) and not (sun) …”. Note that “snake” has a component seperate to starts with s so this is possible. To the extent this may be hard, that’s possibly why we don’t see more absorption but my own intuition says that in large vector spaces this should be perfectly possible to do.
Encoder weights and Decoder weights aren’t tied. If they were, you can imagine the choosing these exceptions for absorbed examples would damage reconstruction performance. Since we don’t tie the weights, the model can detect “(^S.*) AND not (snake) and not (sun) …” but write “(^S.*)”. I’m interested to explore this further and am sad we didn’t get to this in the project.
Also worth noting, in the paper we only classify something as “absorption” if the main latent fully doesn’t fire. We also saw cases which I would call “partial absorption” where the main latent fires, but weakly, and both the absorbing latent and the main latent have positive cosine sim with the probe direction, and both have ablation effect on the spelling task.
Another intuition I have is that when the SAE absorbs a dense feature like “starts with S” into a sparse latent like “snake”, it loses the ability to adjust the relative levels of the various component features relative to each other. So maybe the “snake” latent is 80% snakiness, and 20% starts with S, but then in a real activation the SAE needs to reconstruct 75% snakiness and 25% starts with S. So to do this, it might fire a proper “starts with S” latent but weakly to make up the difference.
Hopefully this is something we can validate with toy models. I suspect that the specific values of L1 penalty and feature co-occurrence rates / magnitudes will lead to different levels of absorption.
This thread reminds me that comparing feature absorption in SAEs with tied encoder / decoder weights and in end-to-end SAEs seems like valuable follow up.
Semi-relatedly, since most (all) of the SAE work since the original paper has gone into untied encoded/decoder weights, we don’t really know whether modern SAE architectures like Jump ReLU or TopK suffer as large of a performance hit as the original SAEs do, especially with the gains from adding token biases.
Nice post. I think this is a really interesting discovery.
[Copying from messages with Joseph Bloom]
TLDR: I’m confused what is different about the SAE input that causes the absorbed feature not to fire.
Me:
Joseph Bloom:
Me:
Joseph Bloom:
↓[Key point]↓
Me:
Joseph Bloom:
Also worth noting, in the paper we only classify something as “absorption” if the main latent fully doesn’t fire. We also saw cases which I would call “partial absorption” where the main latent fires, but weakly, and both the absorbing latent and the main latent have positive cosine sim with the probe direction, and both have ablation effect on the spelling task.
Another intuition I have is that when the SAE absorbs a dense feature like “starts with S” into a sparse latent like “snake”, it loses the ability to adjust the relative levels of the various component features relative to each other. So maybe the “snake” latent is 80% snakiness, and 20% starts with S, but then in a real activation the SAE needs to reconstruct 75% snakiness and 25% starts with S. So to do this, it might fire a proper “starts with S” latent but weakly to make up the difference.
Hopefully this is something we can validate with toy models. I suspect that the specific values of L1 penalty and feature co-occurrence rates / magnitudes will lead to different levels of absorption.
This thread reminds me that comparing feature absorption in SAEs with tied encoder / decoder weights and in end-to-end SAEs seems like valuable follow up.
Another approach would be to use per-token decoder bias as seen in some previous work: https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases But this would only solve it when the absorbing feature is a token. If it’s more abstract then this wouldn’t work as well.
Semi-relatedly, since most (all) of the SAE work since the original paper has gone into untied encoded/decoder weights, we don’t really know whether modern SAE architectures like Jump ReLU or TopK suffer as large of a performance hit as the original SAEs do, especially with the gains from adding token biases.