Rohin Shah comments on Improving Dictionary Learning with Gated Sparse Autoencoders

Rohin Shah 25 Apr 2024 22:54 UTC
LW: 6 AF: 4
0
AF
Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.
This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio $\frac{E [| |^x | |^{2}]}{E [| | x | |^{2}]}$ . Looking at the $γ$ that would optimize the reconstruction loss ensures that we’re capturing only bias from the L1 regularization, and not capturing the “inherent” need to shrink the vector given these nonzero angles. (In particular, if we computed $\frac{E [| |^x | |^{2}]}{E [| | x | |^{2}]}$ for Gated SAEs, I expect that would be below 1.)
I think the main thing we got wrong is that we accidentally treated $E [| |^x - x | |^{2}]$ as though it were $E [| |^x - γ x | |^{2}]$ . To the extent that was the main mistake, I think it explains why our results still look how we expected them to—usually $γ$ is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small.
We’re going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it’s probably worth waiting for that—I expect we’ll provide much more detailed derivations that make everything a lot clearer.