worse ratio of reconstruction-L2/original-L2 (meaning it’s under-normed)*
less dead features (maybe they need more features?)
For (3), we might expect under-normed reconstructions because there’s a trade-off between L1 & MSE. After training, however, we can freeze the encoder, locking in the L0, and train on the decoder or scalar multiples of the hidden layer (h/t to Ben Wright for first figuring this out).
(4) Seems like a pretty easy experiment to try to just vary num of features to see if this explains part of the gap.
There’s a few things to note. Later layers have:
worse CE-diff & variance explained (e.g. the layer 0 CE-diff seems great!)
larger L2 norms in the original LLM activations
worse ratio of reconstruction-L2/original-L2 (meaning it’s under-normed)*
less dead features (maybe they need more features?)
For (3), we might expect under-normed reconstructions because there’s a trade-off between L1 & MSE. After training, however, we can freeze the encoder, locking in the L0, and train on the decoder or scalar multiples of the hidden layer (h/t to Ben Wright for first figuring this out).
(4) Seems like a pretty easy experiment to try to just vary num of features to see if this explains part of the gap.
*