jacob_drori comments on Open Source Replication & Commentary on Anthropic’s Dictionary Learning Paper

jacob_drori 26 Oct 2023 22:58 UTC
1 point
0
Define the “frequent neurons” of the hidden layer to be those that fire with frequency > 1e-4. The image of this set of neurons under W_dec forms a set of vectors living in R^d_mlp, which I’ll call “frequent features”.
These frequent features are less orthogonal than I’d naively expect.
If we choose two vectors uniformly at random on the (d_mlp)-sphere, their cosine sim has mean 0 and variance 1/d_mlp = 0.0005. But in your SAE, the mean cosine sim between distinct frequent features is roughly 0.0026, and the variance is 0.002.
So the frequent features have more cosine similarity than you’d get by just choosing a bunch of directions at random on the (d_mlp)-sphere. This effect persists even when you throw out the neuron-sparse features (as per your top10 definition).
Any idea why this might be the case? My previous intuition had been that transformers try to pack in their features as orthogonally as possible, but it looks like I might’ve been wrong about this. I’d also be interested to know if a similar effect is also found in the residual stream, or if it’s entirely due to some weirdness with relu picking out a preferred basis for the mlp hidden layer.
- Neel Nanda 27 Oct 2023 18:09 UTC
  2 points
  0
  Parent
  Interesting! My guess is that the numbers are small enough that there’s not much to it? But I share your prior that it should be basically orthogonal. The MLP basis is weird and privileged and I don’t feel well equipped to reason about it