I agree we shouldn’t interpret features by their max-activation, but I think the activation magnitude really does matter. Removing smaller activations affects downstream CE less than larger activations (but this does mean the small activations do matter). A weighted percentage of feature activation captures this more (ie (sum of all golden gate activations)/(sum of all activations)).
I do believe “lower-activating examples don’t fit your hypothesis” is bad because of circuits. If you find out that “Feature 3453 is a linear combination of the Golden Gate (GG) feature and the positive sentiment feature” then you do understand this feature at high GG activations, but not low GG + low positive sentiment activations (since you haven’t interpreted low GG activations).
Your “code-error” feature example is good. If it only fits “code-error” at the largest feature activations & does other things, then if we ablate this feature, we’ll take a capabilities hit because the lower activations were used in other computations. But, let’s focus on the lower activations which we don’t understand are being used in other computations bit. We could also have “code-error” or “deception” being represented in the lower activations of other features which, when co-occurring, cause the model to be deceptive or write code errors.
[Although, Anthropic showed evidence against this by ablating the code-error feature & running on errored code which predicted a non-error output]
Finding Features
Anthropic suggested that if you have a feature that occurs 1/Billion tokens, you need 1 Billion features. You also mention finding important features. I think SAE’s find features on the dataset you give it. For example, we trained an SAE on only chess data (on a chess-finetuned-Pythia model) & all the features were on chess data. I bet if you trained it on code, it’d find only code features (note: I do think some semantic & token level features that would generalize to other domains).
Pragmatically, if there are features you care about, then it’s important to train the SAE on many texts that exhibit that feature. This is also true for the safety relevant features.
In general, I don’t think you need these 1000x feature expansions. Even a 1x feature expansion will give you sparse features (because of the L1 penalty). If you want your model to [have positive personality traits] then you only need to disentangle those features.
[Note: I think your “SAE’s don’t find all Othello board state features” does not make the point that SAE’s don’t find relevant features, but I’d need to think for 15 min to clearly state it which I don’t want to do now, lol. If you think that’s a crux though, then I’ll try to communicate it]
Correlated Features
They said 82% of features had a max of 0.3 correlation which (wait, does this imply that 18% of their million billion features did correlate even more???), I agree is a lot. I think this is strongest evidence for “neuron basis is not as good as SAE’s”, which I’m unsure who is still arguing that, but as a sanity check makes sense.
However, some neurons are monosemantic so it makes sense for SAE features to also find those (though again, 18% of a milllion billion have a higher correlation than 0.3?)
> We additionally confirmed that feature activations are not strongly correlated with activations of any residual stream basis direction.
I’m sure they actually found very strongly correlated features specifically for the outlier dimensions in the residual stream which Anthropic has previous work showing is basis aligned (unless Anthropic trains their models in ways that doesn’t produce an outlier dimension which there is existing lit on).
[Note: I wrote a lot. Feel free to respond to this comment in parts!]
I do believe “lower-activating examples don’t fit your hypothesis” is bad because of circuits. If you find out that “Feature 3453 is a linear combination of the Golden Gate (GG) feature and the positive sentiment feature” then you do understand this feature at high GG activations, but not low GG + low positive sentiment activations (since you haven’t interpreted low GG activations).
Yeah, this is the kind of limitation I’m worried about. Maybe for interpretability purposes, it would be good to pretend we have a gated SAE which only kicks in at ~50% max activation. So when you look at the active features all the “noisy” low-activation features are hidden and you only see “the model is strongly thinking about the Golden Gate Bridge”. This ties in to my question at the end of how many tokens have any high-activation feature.
Anthropic suggested that if you have a feature that occurs 1/Billion tokens, you need 1 Billion features. You also mention finding important features. I think SAE’s find features on the dataset you give it.
This matches my intuition. Do you know if people have experimented on this and written it up anywhere? I imagine the simplest thing to do might be having corpuses in different languages (e.g. English and Arabic), and to train an SAE on various ratios of them until an Arabic-text-detector feature shows up.
I’m sure they actually found very strongly correlated features specifically for the outlier dimensions in the residual stream which Anthropic has previous work showing is basis aligned (unless Anthropic trains their models in ways that doesn’t produce an outlier dimension which there is existing lit on).
That would make sense, assuming they have outlier dimensions!
Strong upvote fellow co-author! lol
Highest-activating Features
I agree we shouldn’t interpret features by their max-activation, but I think the activation magnitude really does matter. Removing smaller activations affects downstream CE less than larger activations (but this does mean the small activations do matter). A weighted percentage of feature activation captures this more (ie (sum of all golden gate activations)/(sum of all activations)).
I do believe “lower-activating examples don’t fit your hypothesis” is bad because of circuits. If you find out that “Feature 3453 is a linear combination of the Golden Gate (GG) feature and the positive sentiment feature” then you do understand this feature at high GG activations, but not low GG + low positive sentiment activations (since you haven’t interpreted low GG activations).
Your “code-error” feature example is good. If it only fits “code-error” at the largest feature activations & does other things, then if we ablate this feature, we’ll take a capabilities hit because the lower activations were used in other computations. But, let’s focus on the lower activations which we don’t understand are being used in other computations bit. We could also have “code-error” or “deception” being represented in the lower activations of other features which, when co-occurring, cause the model to be deceptive or write code errors.
[Although, Anthropic showed evidence against this by ablating the code-error feature & running on errored code which predicted a non-error output]
Finding Features
Anthropic suggested that if you have a feature that occurs 1/Billion tokens, you need 1 Billion features. You also mention finding important features. I think SAE’s find features on the dataset you give it. For example, we trained an SAE on only chess data (on a chess-finetuned-Pythia model) & all the features were on chess data. I bet if you trained it on code, it’d find only code features (note: I do think some semantic & token level features that would generalize to other domains).
Pragmatically, if there are features you care about, then it’s important to train the SAE on many texts that exhibit that feature. This is also true for the safety relevant features.
In general, I don’t think you need these 1000x feature expansions. Even a 1x feature expansion will give you sparse features (because of the L1 penalty). If you want your model to [have positive personality traits] then you only need to disentangle those features.
[Note: I think your “SAE’s don’t find all Othello board state features” does not make the point that SAE’s don’t find relevant features, but I’d need to think for 15 min to clearly state it which I don’t want to do now, lol. If you think that’s a crux though, then I’ll try to communicate it]
Correlated Features
They said 82% of features had a max of 0.3 correlation which (wait, does this imply that 18% of their million billion features did correlate even more???), I agree is a lot. I think this is strongest evidence for “neuron basis is not as good as SAE’s”, which I’m unsure who is still arguing that, but as a sanity check makes sense.
However, some neurons are monosemantic so it makes sense for SAE features to also find those (though again, 18% of a milllion billion have a higher correlation than 0.3?)
I’m sure they actually found very strongly correlated features specifically for the outlier dimensions in the residual stream which Anthropic has previous work showing is basis aligned (unless Anthropic trains their models in ways that doesn’t produce an outlier dimension which there is existing lit on).
[Note: I wrote a lot. Feel free to respond to this comment in parts!]
Yeah, this is the kind of limitation I’m worried about. Maybe for interpretability purposes, it would be good to pretend we have a gated SAE which only kicks in at ~50% max activation. So when you look at the active features all the “noisy” low-activation features are hidden and you only see “the model is strongly thinking about the Golden Gate Bridge”. This ties in to my question at the end of how many tokens have any high-activation feature.
This matches my intuition. Do you know if people have experimented on this and written it up anywhere? I imagine the simplest thing to do might be having corpuses in different languages (e.g. English and Arabic), and to train an SAE on various ratios of them until an Arabic-text-detector feature shows up.
That would make sense, assuming they have outlier dimensions!