chanind comments on Toy Models of Feature Absorption in SAEs

chanind 9 Oct 2024 10:34 UTC
1 point
0
Thank you for sharing this! I clearly didn’t read the original “Towards Monsemanticity” closely enough! It seems like the main argument is that when the weights are untied, the encoder and decoder learn different vectors, thus this is evidence that the encoder and decoder should be untied. But this is consistent with the feature absorption work—we see the encoder and decoder learning different things, but that’s not because the SAE is learning better representations but instead because the SAE is finding degenerate solutions which increase sparsity.
Are there are any known patterns of feature firings where untying the encoder and decoder results in the SAE finding the correct or better representations, but where tying the encoder and decoder does not?
- CallumMcDougall 12 Oct 2024 21:05 UTC
  3 points
  0
  Parent
  I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
  I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases
  - RogerDearnaley 20 Nov 2024 21:42 UTC
    2 points
    0
    Parent
    I think an approach I’d try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it’s only really dangerous if you don’t know it’s happening and it causes you to think a feature is inactive when it’s instead inobviously active via another feature it’s been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.
    This is all closely related to the issue of compositional codes: absorption is just a code entry that’s compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I’ve been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can’t overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you’re analyzing, all $d$ of which are active all the time — I suspect using something like a cost proportional to $l n (m a x (d, 1 / f))$ might work, where $d$ is the dimensionality of the underlying embedding space and $f$ is the frequency of the dictionary entry being activated.