Are the features learned by the model the same as the features learned by SAEs?
TL;DR: I want true features model-features to be a property of the model weights, and to be recognizable without access to the full dataset. Toymodels have that property. My “poor man’s model-features” have it. I want to know whether SAE-features have this property too, or if SAE-features do not match the true features model-features.
Introduction: Neural networks likely encode features in superposition. That is, features are represented as directions in activation space, and the model likely tracks many more features than dimensions in activation space. Because features are sparse, it should still be possible for the model to recover and use individual feature values.[1]
Problem statement: The prevailing method for finding these features are Sparse Autoencoders (SAEs). SAEs are well-motivated because they do recover superposed features in toy models. However, I am not certain whether SAEs recover the features of LLMs. I am worried (though not confident) that SAEs do not recover the features of the model (but the dataset), and that we are thus overconfident in how much SAEs tell us.
SAE failure mode: SAEs are trained to achieve a certain compression[2] task: Compress activations into a sparse overcomplete basis, and reconstruct the original activations based on this compressed representation. The solution to this problem can be identical to what the neural network does (wanting to store & use information), but it not necessarily is. In TMS, the network’s only objective is to compress features, so it is natural that the SAE-features match the model-features. But LLMs solve a different task (well, we don’t have a good idea what LLMs do), and training an SAE on a model’s activations might yield a basis different from the model-features (see hypothetical Example 1 below).
Operationalisation of model-features (I’m tabooing “true features”): In the Toy Model of Superposition (TMS) the model’s weights are clearly adjusted to the features directions. We can tell a feature from looking at the model weights. I want this to be a property of true SAE-features as well. Then I would be confident that the features are a property of the model, and not (only) of the dataset distribution. Concrete operationalisation:
I give you 5 real SAE-features, and 5 made-up features (with similar properties). Can you tell which features are the real ones? Without relying on the dataset (but you may use an individual prompt). Lindsey (2024) is some evidence, but would it distinguish the SAE-features from an arbitrary decomposition of the activations into 5 fake-features?
Why do I care? I expect that the model-features are, in some sense, the computational units of the model. I expect our understanding to be more accurate (and to generalize) if we understand what the model actually does internally (see hypothetical Example 2 below).
Is this possible? Toy models of computation in superposition seem to suggest that models give special treatment to feature directions (compared to arbitrary activation directions), for example the error correction described here. This may privilege the basis of model-features over other decompositions of activations. I discuss experiment proposals at the bottom.
Example 1: Imagine an LLM was trained on The Pile excluding Wikipedia. Now we train an SAE on the model’s activations on a different dataset including Wikipedia. I expect that the SAE will find Wikipedia-related features: For example, a Wikipedia-citation-syntax feature on a low level, or an Wikipedia-style-objectivity feature on a high level. I would claim that this is not a feature of the model: During training the model never encountered these concepts, it has not reserved a direction in its superposition arrangement (think geometric shapes in Toy Model of Superposition) for this feature.
It feels like there is a fundamental distinction between a model (SGD) “deciding” whether to learn a feature (as it does in TMS) and an SAE finding a feature that was useful for compressing activations.
Example 2: Maybe an SAE trained on an LLM playing Civilization and Risk finds a feature that corresponds to “strategic deception” on this dataset. But actually the model does not use a “strategic deception” feature (instead strategic deception originates from some, say, the “power dynamics” feature), and it just happens that the instances of strategic deception in those games clustered into a specific direction. If we now take this direction to monitor for strategic deception we will fail to notice other strategic deception originating from the same “power dynamics” features.
If we had known that the model-features that were active during the strategic deception instances were the “power dynamics” (+ other) features, we would have been able to choose the right, better generalizing, deception detection feature.
Experiment proposals: I have explored the abnormal effect that “poor man’s model-features” (sampled as the difference between two independent model activations) have on model outputs, and their relation to theoretically predicted noise suppression in feature activations. Experiments in Gurnee (2024) and Lindsey (2024) suggest that SAE decoder errors and SAE-features also have an abnormal effect on the model. With the LASR Labs team I mentor I want to explore whether SAE-features match the theoretical predictions, and whether the SAE-feature effects match those expected from model-features.
My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.
This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.”—no, this is fine, the model-feature exists but simply isn’t found by the SAE.
What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won’t help us understand models
Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)
There is a view that SAE features are just a useful tool for describing activations (interpretable features) and manipulating activations (useful for steering and probing). That SAEs are just a particularly good method in a larger class of methods, but not uniquely principled. In that case I wouldn’t expect this connection to model behaviour.
But often we make the claim that we often make is that the model sees and understands the world as a set of model-features, and that we can see the same features by looking at SAE-features of the activations. And then I want to see the extra evidence.
Are the features learned by the model the same as the features learned by SAEs?
TL;DR: I want
true featuresmodel-features to be a property of the model weights, and to be recognizable without access to the full dataset. Toy models have that property. My “poor man’s model-features” have it. I want to know whether SAE-features have this property too, or if SAE-features do not match thetrue featuresmodel-features.Introduction: Neural networks likely encode features in superposition. That is, features are represented as directions in activation space, and the model likely tracks many more features than dimensions in activation space. Because features are sparse, it should still be possible for the model to recover and use individual feature values.[1]
Problem statement: The prevailing method for finding these features are Sparse Autoencoders (SAEs). SAEs are well-motivated because they do recover superposed features in toy models. However, I am not certain whether SAEs recover the features of LLMs. I am worried (though not confident) that SAEs do not recover the features of the model (but the dataset), and that we are thus overconfident in how much SAEs tell us.
SAE failure mode: SAEs are trained to achieve a certain compression[2] task: Compress activations into a sparse overcomplete basis, and reconstruct the original activations based on this compressed representation. The solution to this problem can be identical to what the neural network does (wanting to store & use information), but it not necessarily is. In TMS, the network’s only objective is to compress features, so it is natural that the SAE-features match the model-features. But LLMs solve a different task (well, we don’t have a good idea what LLMs do), and training an SAE on a model’s activations might yield a basis different from the model-features (see hypothetical Example 1 below).
Operationalisation of model-features (I’m tabooing “true features”): In the Toy Model of Superposition (TMS) the model’s weights are clearly adjusted to the features directions. We can tell a feature from looking at the model weights. I want this to be a property of
trueSAE-features as well. Then I would be confident that the features are a property of the model, and not (only) of the dataset distribution. Concrete operationalisation:I give you 5 real SAE-features, and 5 made-up features (with similar properties). Can you tell which features are the real ones? Without relying on the dataset (but you may use an individual prompt). Lindsey (2024) is some evidence, but would it distinguish the SAE-features from an arbitrary decomposition of the activations into 5 fake-features?
Why do I care? I expect that the model-features are, in some sense, the computational units of the model. I expect our understanding to be more accurate (and to generalize) if we understand what the model actually does internally (see hypothetical Example 2 below).
Is this possible? Toy models of computation in superposition seem to suggest that models give special treatment to feature directions (compared to arbitrary activation directions), for example the error correction described here. This may privilege the basis of model-features over other decompositions of activations. I discuss experiment proposals at the bottom.
Example 1: Imagine an LLM was trained on The Pile excluding Wikipedia. Now we train an SAE on the model’s activations on a different dataset including Wikipedia. I expect that the SAE will find Wikipedia-related features: For example, a Wikipedia-citation-syntax feature on a low level, or an Wikipedia-style-objectivity feature on a high level. I would claim that this is not a feature of the model: During training the model never encountered these concepts, it has not reserved a direction in its superposition arrangement (think geometric shapes in Toy Model of Superposition) for this feature.
It feels like there is a fundamental distinction between a model (SGD) “deciding” whether to learn a feature (as it does in TMS) and an SAE finding a feature that was useful for compressing activations.
Example 2: Maybe an SAE trained on an LLM playing Civilization and Risk finds a feature that corresponds to “strategic deception” on this dataset. But actually the model does not use a “strategic deception” feature (instead strategic deception originates from some, say, the “power dynamics” feature), and it just happens that the instances of strategic deception in those games clustered into a specific direction. If we now take this direction to monitor for strategic deception we will fail to notice other strategic deception originating from the same “power dynamics” features.
If we had known that the model-features that were active during the strategic deception instances were the “power dynamics” (+ other) features, we would have been able to choose the right, better generalizing, deception detection feature.
Experiment proposals: I have explored the abnormal effect that “poor man’s model-features” (sampled as the difference between two independent model activations) have on model outputs, and their relation to theoretically predicted noise suppression in feature activations. Experiments in Gurnee (2024) and Lindsey (2024) suggest that SAE decoder errors and SAE-features also have an abnormal effect on the model. With the LASR Labs team I mentor I want to explore whether SAE-features match the theoretical predictions, and whether the SAE-feature effects match those expected from model-features.
I’ll ignore the “there’s more to activations than just features” point made e.g. here, that’s a separate discussion.
I know the SAE basis is larger, but it is enforced to be sparse and thus cannot perfectly store the activations.
My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.
This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.”—no, this is fine, the model-feature exists but simply isn’t found by the SAE.
What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won’t help us understand models
Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)
There is a view that SAE features are just a useful tool for describing activations (interpretable features) and manipulating activations (useful for steering and probing). That SAEs are just a particularly good method in a larger class of methods, but not uniquely principled. In that case I wouldn’t expect this connection to model behaviour.
But often we make the claim that we often make is that the model sees and understands the world as a set of model-features, and that we can see the same features by looking at SAE-features of the activations. And then I want to see the extra evidence.