For each SAE feature (i.e. each column of W_dec), we can look for a distinct feature with the maximum cosine similarity to the first. Here is a histogram of these max cos sims, for Joseph Bloom’s SAE trained at resid_pre, layer 10 in gpt2-small. The corresponding plot for random features is shown for comparison:
The SAE features are much less orthogonal than the random ones. This effect persists if, instead of the maximum cosine similarity, we look at the 10th largest, or the 100th largest:
I think it’s a good idea to include a loss term to incentivise feature orthogonality.
For each SAE feature (i.e. each column of W_dec), we can look for a distinct feature with the maximum cosine similarity to the first. Here is a histogram of these max cos sims, for Joseph Bloom’s SAE trained at resid_pre, layer 10 in gpt2-small. The corresponding plot for random features is shown for comparison:
The SAE features are much less orthogonal than the random ones. This effect persists if, instead of the maximum cosine similarity, we look at the 10th largest, or the 100th largest:
I think it’s a good idea to include a loss term to incentivise feature orthogonality.
Thanks, that’s very interesting!