jacob_drori comments on Do sparse autoencoders find “true features”?

jacob_drori 27 Feb 2024 20:24 UTC
3 points
0
For each SAE feature (i.e. each column of W_dec), we can look for a distinct feature with the maximum cosine similarity to the first. Here is a histogram of these max cos sims, for Joseph Bloom’s SAE trained at resid_pre, layer 10 in gpt2-small. The corresponding plot for random features is shown for comparison:

The SAE features are much less orthogonal than the random ones. This effect persists if, instead of the maximum cosine similarity, we look at the 10th largest, or the 100th largest:
I think it’s a good idea to include a loss term to incentivise feature orthogonality.
- Demian Till 29 Feb 2024 16:50 UTC
  1 point
  0
  Parent
  Thanks, that’s very interesting!