To generate the trees in sparselatents.com/tree_view, I use a variant of Masked Cosine Similarity (MCS), a metric introduced in Towards Monosemanticity. The original MCS is calculated like this: For any two latents A and B, first compute the cosine similarity between their activations, but only considering tokens where latent A is active. Then compute the same similarity, but only for tokens where latent B is active. The final MCS value is the larger of these two similarities.
Instead of taking the max, I do a directed MCS where I just consider the cosine similarity between A and B’s activations on tokens where B is active. Then, I multiply this directed MCS score by max(B activations)/max(A activations) to ignore latents that don’t fire very much. I’m not sure that this multiplication step is necessary.
I also use a higher threshold of 0.6.
Starting from a parent latent, say S/1/12, I find all latents in a larger-width SAE (say S/2) that pass the directed MCS threshold. Then, I re-apply the method to those S/2 latents to find children in S/3.
The result is often a non-tree DAG as some of the identified latents in S/3 have more than one parent in S/2. To simplify rendering, I assign these latents to the parent they have the highest score with. This obscures the true structure, but I wasn’t sure of a clean way to automatically render these DAGs.
The trees should be thought of not as fully displaying the structure of the model, but instead of surfacing small sets of latents that I expect demonstrate feature absorption when viewed together.
Tree Methodology
To generate the trees in sparselatents.com/tree_view, I use a variant of Masked Cosine Similarity (MCS), a metric introduced in Towards Monosemanticity. The original MCS is calculated like this: For any two latents A and B, first compute the cosine similarity between their activations, but only considering tokens where latent A is active. Then compute the same similarity, but only for tokens where latent B is active. The final MCS value is the larger of these two similarities.
Instead of taking the max, I do a directed MCS where I just consider the cosine similarity between A and B’s activations on tokens where B is active. Then, I multiply this directed MCS score by max(B activations)/max(A activations) to ignore latents that don’t fire very much. I’m not sure that this multiplication step is necessary.
I also use a higher threshold of 0.6.
Starting from a parent latent, say S/1/12, I find all latents in a larger-width SAE (say S/2) that pass the directed MCS threshold. Then, I re-apply the method to those S/2 latents to find children in S/3.
The result is often a non-tree DAG as some of the identified latents in S/3 have more than one parent in S/2. To simplify rendering, I assign these latents to the parent they have the highest score with. This obscures the true structure, but I wasn’t sure of a clean way to automatically render these DAGs.
The trees should be thought of not as fully displaying the structure of the model, but instead of surfacing small sets of latents that I expect demonstrate feature absorption when viewed together.