Logan Riggs comments on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Logan Riggs 2 Feb 2024 14:09 UTC
2 points
0
There’s a few things to note. Later layers have:
1. worse CE-diff & variance explained (e.g. the layer 0 CE-diff seems great!)
2. larger L2 norms in the original LLM activations
3. worse ratio of reconstruction-L2/original-L2 (meaning it’s under-normed)*
4. less dead features (maybe they need more features?)
For (3), we might expect under-normed reconstructions because there’s a trade-off between L1 & MSE. After training, however, we can freeze the encoder, locking in the L0, and train on the decoder or scalar multiples of the hidden layer (h/t to Ben Wright for first figuring this out).
(4) Seems like a pretty easy experiment to try to just vary num of features to see if this explains part of the gap.
*