I do believe “lower-activating examples don’t fit your hypothesis” is bad because of circuits. If you find out that “Feature 3453 is a linear combination of the Golden Gate (GG) feature and the positive sentiment feature” then you do understand this feature at high GG activations, but not low GG + low positive sentiment activations (since you haven’t interpreted low GG activations).
Yeah, this is the kind of limitation I’m worried about. Maybe for interpretability purposes, it would be good to pretend we have a gated SAE which only kicks in at ~50% max activation. So when you look at the active features all the “noisy” low-activation features are hidden and you only see “the model is strongly thinking about the Golden Gate Bridge”. This ties in to my question at the end of how many tokens have any high-activation feature.
Anthropic suggested that if you have a feature that occurs 1/Billion tokens, you need 1 Billion features. You also mention finding important features. I think SAE’s find features on the dataset you give it.
This matches my intuition. Do you know if people have experimented on this and written it up anywhere? I imagine the simplest thing to do might be having corpuses in different languages (e.g. English and Arabic), and to train an SAE on various ratios of them until an Arabic-text-detector feature shows up.
I’m sure they actually found very strongly correlated features specifically for the outlier dimensions in the residual stream which Anthropic has previous work showing is basis aligned (unless Anthropic trains their models in ways that doesn’t produce an outlier dimension which there is existing lit on).
That would make sense, assuming they have outlier dimensions!
Yeah, this is the kind of limitation I’m worried about. Maybe for interpretability purposes, it would be good to pretend we have a gated SAE which only kicks in at ~50% max activation. So when you look at the active features all the “noisy” low-activation features are hidden and you only see “the model is strongly thinking about the Golden Gate Bridge”. This ties in to my question at the end of how many tokens have any high-activation feature.
This matches my intuition. Do you know if people have experimented on this and written it up anywhere? I imagine the simplest thing to do might be having corpuses in different languages (e.g. English and Arabic), and to train an SAE on various ratios of them until an Arabic-text-detector feature shows up.
That would make sense, assuming they have outlier dimensions!