The e2e having different feature directions across seeds was quite the bummer, but then I thought “are the encoder directions different though?”
Intuitively the encoder directions affect which datapoints each feature activates on, and the decoder is the causal downstream effect. For e2e, we would expect widely different decoder directions because there are many free parameters (from some other work that showed SVD of gradients had many zero singular values, meaning moving in most directions don’t effect the downstream loss), but not necessarily encoder directions.
If the encoder directions are similar across seeds, I’d trust them to inform relevant features for the model output (in cases where we don’t care about connections w/ downstream layers).
However, I was not able to find the SAEs for various seeds.
Trying to replicate Cos-sim Plots
I downloaded the similar CE at layer 6 for all three types of SAEs & took their cos-sim (last column in figure 3).
I think your cos-sim metric gives different results if you take the max over the first or 2nd dimension (or equivalently swapped the order of decoders multiplied by each other). If so, I think this is because you might double-count or something? Regardless, I ended up doing some hungarian algorithm to take the overall max (but don’t double-count), but it’s on cpu, so I only did the first 10k/40k features. Below is results for both encoder & decoder, which do replicate the directional results.
Nonzero Features
Additionally I thought that some results were from counting nonzero features, which, for the encoder is some high-cos-sim features, and decoder is the low-cos-sim features weirdly enough.
Would appreciate if y’all upload any repeated seeds!
My code is temporarily hosted (for a few weeks maybe?) here.
Every SAE in the paper is hosted on wandb, only some are hosted on huggingface, so I suggest loading them from wandb for now. We’ll upload more to huggingface if several people prefer that. Info for downloading from wandb can be found in the repo, the easiest way is probably:
# pip install e2e_sae
# Save your wandb api key in .env
from e2e_sae import SAETransformer
model = SAETransformer.from_wandb("sparsify/gpt2/d8vgjnyc")
sae = list(model.saes.values())[0] # Assumes only 1 sae in model, true for all saes in paper
encoder = sae.encoder[0]
dict_elements = sae.dict_elements # Returns the normalized decoder elements
The wandb ids for different seeds can be found in the geometric analysis script here. That script, along with plot_performance.py, is a good place to see which wandb ids were used for each plot in the paper, as well as the exact code used to produce the plots in the paper (including the cosine sim plots you replicated above).
If you want to avoid the e2e_sae dependency, you can find the raw sae weights in the samples_400000.pt file in the respective wandb run. Just make sure to normalize the decoder weights after downloading (note that this was done before uploading to huggingface so people could load the SAEs into e.g. SAELens without having to worry about it).
If so, I think this is because you might double-count or something?
We do double count in the sense that, if, when comparing the similarity between A and B, element A_i has max cosine sim with B_j, we don’t remove B_j from being in the max cosine sim for other elements in A. It’s not obvious (to me at least) that we shouldn’t do this when summarising dictionary similarity in a single metric, though I agree there is a tonne of useful geometric comparison that isn’t covered by our single number. Really glad you’re digging deeper into this. I do think there is lots that can be learned here.
Btw it’s not intuitive to me that the encoder directions might be similar even though the decoder directions are not. Curious if you could share your intuitions here.
Here is the Jaccard similarity (ie similarity of input-token activations) across seeds
The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times).
I also (mostly) replicated the decoder similarity chart:
And calculated the encoder sim:
[I, again, needed to remove dead features (< 10 activations) to get the graphs here.]
So yes, I believe the original paper’s claim that e2e features learn quite different features across seeds is substantiated.
Thanks so much! All the links and info will save me time:)
Regarding cos-sim, after thinking a bit, I think it’s more sinister. For cross-cos-sim comparison, you get different results if you take the max over the 0th or 1st dimension (equivalent to doing cos(local, e2e) vs cos(e2e, local). As an example, you could have 2 features each, 3 point in the same direction and 1 points opposte. Making up numbers:
For more intuition, suppose 4 local features surround 1 e2e feature (and the other features are pointed elsewhere). Then the 4 local features will all have high max-cos sim but the e2e only has 1. So it’s not just double-counting, but quadruple counting. You could see for yourself if you swap your dim=1 to 0 in your code.
But my original comment showed your results are still directionally correct when doing [global max w/ replacement] (if I coded it correctly).
Btw it’s not intuitive to me that the encoder directions might be similar even though the decoder directions are not. Curious if you could share your intuitions here.
The decoder directions have degrees of freedom, but the encoder directions...might have similar degrees of freedom and I’m wrong, lol. BUT! they might be functionally equivalent, so they activate on similar datapoints across seeds. That is more laborious to check though, waaaah.
I can check both (encoder directions first) because previous literature is really only on the SVD of gradient (ie the output), but an SAE might be more constrained when separating out inputs into sparse features. Thanks for prompting for my intuition!
The e2e having different feature directions across seeds was quite the bummer, but then I thought “are the encoder directions different though?”
Intuitively the encoder directions affect which datapoints each feature activates on, and the decoder is the causal downstream effect. For e2e, we would expect widely different decoder directions because there are many free parameters (from some other work that showed SVD of gradients had many zero singular values, meaning moving in most directions don’t effect the downstream loss), but not necessarily encoder directions.
If the encoder directions are similar across seeds, I’d trust them to inform relevant features for the model output (in cases where we don’t care about connections w/ downstream layers).
However, I was not able to find the SAEs for various seeds.
Trying to replicate Cos-sim Plots
I downloaded the similar CE at layer 6 for all three types of SAEs & took their cos-sim (last column in figure 3).
I think your cos-sim metric gives different results if you take the max over the first or 2nd dimension (or equivalently swapped the order of decoders multiplied by each other). If so, I think this is because you might double-count or something? Regardless, I ended up doing some hungarian algorithm to take the overall max (but don’t double-count), but it’s on cpu, so I only did the first 10k/40k features. Below is results for both encoder & decoder, which do replicate the directional results.
Nonzero Features
Additionally I thought that some results were from counting nonzero features, which, for the encoder is some high-cos-sim features, and decoder is the low-cos-sim features weirdly enough.
Would appreciate if y’all upload any repeated seeds!
My code is temporarily hosted (for a few weeks maybe?) here.
Every SAE in the paper is hosted on wandb, only some are hosted on huggingface, so I suggest loading them from wandb for now. We’ll upload more to huggingface if several people prefer that. Info for downloading from wandb can be found in the repo, the easiest way is probably:
The wandb ids for different seeds can be found in the geometric analysis script here. That script, along with plot_performance.py, is a good place to see which wandb ids were used for each plot in the paper, as well as the exact code used to produce the plots in the paper (including the cosine sim plots you replicated above).
If you want to avoid the e2e_sae dependency, you can find the raw sae weights in the
samples_400000.pt
file in the respective wandb run. Just make sure to normalize the decoder weights after downloading (note that this was done before uploading to huggingface so people could load the SAEs into e.g. SAELens without having to worry about it).We do double count in the sense that, if, when comparing the similarity between A and B, element A_i has max cosine sim with B_j, we don’t remove B_j from being in the max cosine sim for other elements in A. It’s not obvious (to me at least) that we shouldn’t do this when summarising dictionary similarity in a single metric, though I agree there is a tonne of useful geometric comparison that isn’t covered by our single number. Really glad you’re digging deeper into this. I do think there is lots that can be learned here.
Btw it’s not intuitive to me that the encoder directions might be similar even though the decoder directions are not. Curious if you could share your intuitions here.
I finally checked!
Here is the Jaccard similarity (ie similarity of input-token activations) across seeds
The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times).
I also (mostly) replicated the decoder similarity chart:
And calculated the encoder sim:
[I, again, needed to remove dead features (< 10 activations) to get the graphs here.]
So yes, I believe the original paper’s claim that e2e features learn quite different features across seeds is substantiated.
Thanks so much! All the links and info will save me time:)
Regarding cos-sim, after thinking a bit, I think it’s more sinister. For cross-cos-sim comparison, you get different results if you take the max over the 0th or 1st dimension (equivalent to doing cos(local, e2e) vs cos(e2e, local). As an example, you could have 2 features each, 3 point in the same direction and 1 points opposte. Making up numbers:
feature-directions(1D) = [ [1],[1]] & [[1],[-1]]
cos-sim = [[1, 1], [-1, −1]]
For more intuition, suppose 4 local features surround 1 e2e feature (and the other features are pointed elsewhere). Then the 4 local features will all have high max-cos sim but the e2e only has 1. So it’s not just double-counting, but quadruple counting. You could see for yourself if you swap your dim=1 to 0 in your code.
But my original comment showed your results are still directionally correct when doing [global max w/ replacement] (if I coded it correctly).
The decoder directions have degrees of freedom, but the encoder directions...might have similar degrees of freedom and I’m wrong, lol. BUT! they might be functionally equivalent, so they activate on similar datapoints across seeds. That is more laborious to check though, waaaah.
I can check both (encoder directions first) because previous literature is really only on the SVD of gradient (ie the output), but an SAE might be more constrained when separating out inputs into sparse features. Thanks for prompting for my intuition!