Fascinating post! I (along with Hardik Bhatnagar and Joseph Bloom) recently completed a profile of cases of SAE latent co-occurrence in GPT2-small and Gemma-2-2b (see here) and I think that this is a really compelling driver for a lot of the behaviour that we see, such as the link to SAE width. In particular, we observe a lot of cases with apparent parent-child relations between the latents (e.g. here).
We also see a similar ‘splitting’ of activation strength in cases of composition e.g. we find a case where the child latents are all days of the week (e.g. ‘Monday’), but the activation (of lack thereof) of the parent latent corresponds to whether there is a space in the token (e.g. ′ Monday’) (see here). When the parent and child are active, both have roughly half the activation strength of the child when it is active alone, which I think is similar to what you observe, although made more complex because we do not know the underlying true features in this case. If this holds in general, perhaps it would be possible to improve your method for preventing co-occurrence/absorption by looking not only for cases of splits in the activation density, but for the activation strengths between pairs of features being strongly coupled/proportional in such a manner?
The behavior you see in your study is fascinating as well! I wonder if using a tied SAE would force these relationships in your work to be even more obvious, since if the SAE decoder in a tied SAE tries to mix co-occurring parent/child features together it has to also mix them in the encoder and thus it should show up in the activation patterns more clearly. If an underlying feature co-occurs between two latents (e.g. a parent feature), tied SAEs don’t have a good way to keep the latents themselves from firing together and thus showing up as a co-firing latent. Untied SAEs can more easily do an absorptiony thing and turn off one latent when the other fires, for example, even if they both encode similar underlying features.
I think a next step for this work is to try to do clustering of activations based on their position in the activation density histogram of latents. I expect we should see some of the same clusters being present across multiple latents, and that those latents should also co-fire together to some extent.
The two other things in your work that feel important are the idea of models using low activations as a form of “uncertainty”, and non-linear features like days of the week forming a circle. The toy examples in our work here assume that both of these things don’t happen, that features basically fire with a set magnitude (maybe with some variance), and the directions of features are mutually orthogonal (or mostly mutually orthogonal). In the case of models using low activations to signal uncertainty, we won’t necessarily see a clean peak in the activation histogram for the feature activating, or the width of the activation peak might look very large. In the case of features forming a circle, then the underlying directions are not mutually orthogonal, and this will also likely show up as more activation peaks in the activation density histograms of latents representing these circular concepts, but those peaks won’t correspond to parent/child relationships and absorption but instead just the fact that different vectors on a circle all project onto each other.
Do you think your work can be extended to automatically classify if an underlying feature is a circular or non-linear feature, or is in a parent/child relationship, and if the underlying feature doesn’t basically fire with a set magnitude but instead uses magnitude as uncertainty? It would be great to have a sense of what portion of features in a model are of which sorts (set magnitude vs variable magnitude, mostly orthogonal direction vs forming a geometric shape with related features, parent/child, etc...). For the method we present here, it would be helpful to know if an activation density peak is an unwanted parent or child feature component that should project out of the latent, vs something that’s intrisically part of the latent (e.g. just the same feature with a lower magnitude, or a circular geometric relationship with related features)
I agree that comparing tied and untied SAE might be a good way to separate cases where the underlying features are inherently co-occurring. I have wondered if this might lead to a way to better understand the structure of how the model makes decisions, similar to the work of Adam Shai (https://arxiv.org/abs/2405.15943). It may be that cases where the tied SAE has to just not represent a feature, are a good way of detecting inherently hierarchical features (to work out if something is an apple you first decide if it is a fruit for example), if LLM learn to think that way.
I think what you say about clustering of activation densities makes sense, though in the case of Gemma I think the JumpReLU might need to be corrected for to ‘align’ them.
In terms of classifying ‘uncertainty’ vs ‘compositional’ cases of co-occurrence, I believe there is a in the graph structure of what features co-occured with one another, but have not yet nailed down how much structure implies function and vice-versa.
Nevertheless, I think this points to a way that could likely quantify composition vs ambiguity.
Regarding non-linear / circular projections my intuition is that this goes hand in hand with compositionality, but I would not say this for certain.
But trying to nail down the relation between the co-occurrence graph structure and the type of co-occurrence is certainly something I would like to look into further.
Fascinating post! I (along with Hardik Bhatnagar and Joseph Bloom) recently completed a profile of cases of SAE latent co-occurrence in GPT2-small and Gemma-2-2b (see here) and I think that this is a really compelling driver for a lot of the behaviour that we see, such as the link to SAE width. In particular, we observe a lot of cases with apparent parent-child relations between the latents (e.g. here).
We also see a similar ‘splitting’ of activation strength in cases of composition e.g. we find a case where the child latents are all days of the week (e.g. ‘Monday’), but the activation (of lack thereof) of the parent latent corresponds to whether there is a space in the token (e.g. ′ Monday’) (see here). When the parent and child are active, both have roughly half the activation strength of the child when it is active alone, which I think is similar to what you observe, although made more complex because we do not know the underlying true features in this case. If this holds in general, perhaps it would be possible to improve your method for preventing co-occurrence/absorption by looking not only for cases of splits in the activation density, but for the activation strengths between pairs of features being strongly coupled/proportional in such a manner?
The behavior you see in your study is fascinating as well! I wonder if using a tied SAE would force these relationships in your work to be even more obvious, since if the SAE decoder in a tied SAE tries to mix co-occurring parent/child features together it has to also mix them in the encoder and thus it should show up in the activation patterns more clearly. If an underlying feature co-occurs between two latents (e.g. a parent feature), tied SAEs don’t have a good way to keep the latents themselves from firing together and thus showing up as a co-firing latent. Untied SAEs can more easily do an absorptiony thing and turn off one latent when the other fires, for example, even if they both encode similar underlying features.
I think a next step for this work is to try to do clustering of activations based on their position in the activation density histogram of latents. I expect we should see some of the same clusters being present across multiple latents, and that those latents should also co-fire together to some extent.
The two other things in your work that feel important are the idea of models using low activations as a form of “uncertainty”, and non-linear features like days of the week forming a circle. The toy examples in our work here assume that both of these things don’t happen, that features basically fire with a set magnitude (maybe with some variance), and the directions of features are mutually orthogonal (or mostly mutually orthogonal). In the case of models using low activations to signal uncertainty, we won’t necessarily see a clean peak in the activation histogram for the feature activating, or the width of the activation peak might look very large. In the case of features forming a circle, then the underlying directions are not mutually orthogonal, and this will also likely show up as more activation peaks in the activation density histograms of latents representing these circular concepts, but those peaks won’t correspond to parent/child relationships and absorption but instead just the fact that different vectors on a circle all project onto each other.
Do you think your work can be extended to automatically classify if an underlying feature is a circular or non-linear feature, or is in a parent/child relationship, and if the underlying feature doesn’t basically fire with a set magnitude but instead uses magnitude as uncertainty? It would be great to have a sense of what portion of features in a model are of which sorts (set magnitude vs variable magnitude, mostly orthogonal direction vs forming a geometric shape with related features, parent/child, etc...). For the method we present here, it would be helpful to know if an activation density peak is an unwanted parent or child feature component that should project out of the latent, vs something that’s intrisically part of the latent (e.g. just the same feature with a lower magnitude, or a circular geometric relationship with related features)
I agree that comparing tied and untied SAE might be a good way to separate cases where the underlying features are inherently co-occurring. I have wondered if this might lead to a way to better understand the structure of how the model makes decisions, similar to the work of Adam Shai (https://arxiv.org/abs/2405.15943). It may be that cases where the tied SAE has to just not represent a feature, are a good way of detecting inherently hierarchical features (to work out if something is an apple you first decide if it is a fruit for example), if LLM learn to think that way.
I think what you say about clustering of activation densities makes sense, though in the case of Gemma I think the JumpReLU might need to be corrected for to ‘align’ them.
In terms of classifying ‘uncertainty’ vs ‘compositional’ cases of co-occurrence, I believe there is a in the graph structure of what features co-occured with one another, but have not yet nailed down how much structure implies function and vice-versa.
Compositionality seemed to correlate with a ‘hub and spoke’ type of structure (see here, top left panel: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 and https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 .
We also found a cluster in layer 18 that mirrors the first example above in layer 12 of Gemma-2-2b. It has worse compostional encoding, but also a slightly less hub-like structure: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201.
For ambiguity, we normally see a close to fully connected graph e.g. https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201
This is clearly not perfect, as https://feature-cooccurrence.streamlit.app/?model=gpt2-small&subgraph=125&sae_release=res-jb-feature-splitting&sae_id=blocks_8_hook_resid_pre_24576&size=5&point_x=-31.171305&point_y=-6.12931 does not fit this pattern, looking like there is a compositional encoding of position of a token in the url, but the graph is not in the hub/spoke pattern.
Nevertheless, I think this points to a way that could likely quantify composition vs ambiguity.
Regarding non-linear / circular projections my intuition is that this goes hand in hand with compositionality, but I would not say this for certain.
But trying to nail down the relation between the co-occurrence graph structure and the type of co-occurrence is certainly something I would like to look into further.