Matthew A. Clarke comments on Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

Matthew A. Clarke 3 Jan 2025 10:17 UTC
1 point
0
I agree that comparing tied and untied SAE might be a good way to separate cases where the underlying features are inherently co-occurring. I have wondered if this might lead to a way to better understand the structure of how the model makes decisions, similar to the work of Adam Shai (https://arxiv.org/abs/2405.15943). It may be that cases where the tied SAE has to just not represent a feature, are a good way of detecting inherently hierarchical features (to work out if something is an apple you first decide if it is a fruit for example), if LLM learn to think that way.

I think what you say about clustering of activation densities makes sense, though in the case of Gemma I think the JumpReLU might need to be corrected for to ‘align’ them.

In terms of classifying ‘uncertainty’ vs ‘compositional’ cases of co-occurrence, I believe there is a in the graph structure of what features co-occured with one another, but have not yet nailed down how much structure implies function and vice-versa.

Compositionality seemed to correlate with a ‘hub and spoke’ type of structure (see here, top left panel: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 and https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 .

We also found a cluster in layer 18 that mirrors the first example above in layer 12 of Gemma-2-2b. It has worse compostional encoding, but also a slightly less hub-like structure: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201.

For ambiguity, we normally see a close to fully connected graph e.g. https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201

This is clearly not perfect, as https://feature-cooccurrence.streamlit.app/?model=gpt2-small&subgraph=125&sae_release=res-jb-feature-splitting&sae_id=blocks_8_hook_resid_pre_24576&size=5&point_x=-31.171305&point_y=-6.12931 does not fit this pattern, looking like there is a compositional encoding of position of a token in the url, but the graph is not in the hub/spoke pattern.

Nevertheless, I think this points to a way that could likely quantify composition vs ambiguity.

Regarding non-linear / circular projections my intuition is that this goes hand in hand with compositionality, but I would not say this for certain.

But trying to nail down the relation between the co-occurrence graph structure and the type of co-occurrence is certainly something I would like to look into further.