jacob_drori comments on Improving Dictionary Learning with Gated Sparse Autoencoders

jacob_drori 26 Apr 2024 0:00 UTC
LW: 0 AF: -1
0
AF
Nice work! I’m not sure I fully understand what the “gated-ness” is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup:
Let $f$ and $^x$ be the encoder and decoder functions, as in your paper, and let $x$ be the model activation that is fed into the SAE.
The usual SAE reconstruction is $^x (f (x))$ , which suffers from the shrinkage problem.
Now, introduce a new learned parameter $t \in R^{n_{f e a t u r e s}}$ , and define an “expanded” reconstruction $y_{e x p a n d e d} =^x (t ⊙ f (x))$ , where $⊙$ denotes elementwise multiplication.
Finally, take the loss to be:
$L = | | {^x}_{c o p y} (f (x)) - x | |_{2}^{2} + | | y_{e x p a n d e d} - x | |_{2}^{2} + λ | | f (x) | |_{1}$ .
where ${^x}_{c o p y}$ ensures the decoder gets no gradients from the first term. As I understand it, this is exactly the loss appearing in your paper. The only difference in the setup is the lack of the Heaviside step function.
Did you try this setup? Or does it fail for an obvious reason I missed?
- Rohin Shah 26 Apr 2024 6:50 UTC
  LW: 2 AF: 2
  0
  AF Parent
  This suggestion seems less expressive than (but similar in spirit to) the “rescale & shift” baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn’t capture all the benefits of Gated SAEs.
  The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to $ReLU (π_{gate} (x))$ , so you might think of $π_{gate}$ as “tainted”, and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on $f_{mag}$ for everything else.