A quick and dirty first experiment with adding an orthogonality regularizer indicates that this can work without too much penalty on the reconstruction loss. I trained an SAE on the MLP output of a 1-layer model with dictionary size 8192 (16 times the MLP output size).
I trained this without the regularizer and got a reconstruction score of 0.846 at an L0 of ~17. With the regularizer, I got a reconstruction score of 0.828 at an L0 of ~18.
Looking at the cosine similarities between neurons:
Interesting peaks around a cosine similarity of 0.3 and 0.05 there! Maybe (very speculative) that tells us something about the way the model encodes features in superposition?
The peaks at 0.05 and 0.3 are strange. What regulariser did you use? Also, could you check whether all features whose nearest neighbour has cosine similarity 0.3 have the same nearest neighbour (and likewise for 0.05)?
I expect the 0.05 peak might be the minimum cosine similarity if you want to distribute 8192 vectors over a 512-dimensional space uniformly? I used a bit of a weird regularizer where I penalized:
mean cosine similarity + mean max cosine similarity + max max cosine similarity
I will check later whether the 0.3 peak all have the same neighbour.
Nice, that’s promising! It would also be interesting to see how those peaks are affected when you retrain the SAE both on the same target model and on different target models.
A quick and dirty first experiment with adding an orthogonality regularizer indicates that this can work without too much penalty on the reconstruction loss. I trained an SAE on the MLP output of a 1-layer model with dictionary size 8192 (16 times the MLP output size).
I trained this without the regularizer and got a reconstruction score of 0.846 at an L0 of ~17.
With the regularizer, I got a reconstruction score of 0.828 at an L0 of ~18.
Looking at the cosine similarities between neurons:
Interesting peaks around a cosine similarity of 0.3 and 0.05 there! Maybe (very speculative) that tells us something about the way the model encodes features in superposition?
The peaks at 0.05 and 0.3 are strange. What regulariser did you use? Also, could you check whether all features whose nearest neighbour has cosine similarity 0.3 have the same nearest neighbour (and likewise for 0.05)?
I expect the 0.05 peak might be the minimum cosine similarity if you want to distribute 8192 vectors over a 512-dimensional space uniformly? I used a bit of a weird regularizer where I penalized:
mean cosine similarity + mean max cosine similarity + max max cosine similarity
I will check later whether the 0.3 peak all have the same neighbour.
Nice, that’s promising! It would also be interesting to see how those peaks are affected when you retrain the SAE both on the same target model and on different target models.