I’m not as familiar with the history of SAEs—were tied weights used in the past, but then abandoned due to resulting in lower sparsity? If that sparsity is gained by creating feature absorption, then it’s not a good thing since absorption does lead to higher sparsity but worse interpretability. I’m uncomfortable with the idea that higher sparsity is always better since the model might just have some underlying features its tracking that are dense, and IMO the goal should be to recover the model’s “true” features, if such a thing can be said to exist, rather than maximizing sparsity which is just a proxy metric.
The thesis of this feature absorption work is that absorption causes latents that look interpretable but actually aren’t. We initially found this initially by trying to evaluate the interpretability of Gemma Scope SAEs and found that latents which seemed to be tracking an interpretable feature have holes in their recall that didn’t make sense. I’d be curious if tied weights were used in the past and if so, why they were abandoned. Regardless, it seems like the thing we need to do next for this work is to just try out variants of tied weights for real LLM SAEs and see if the results are more interpretable, regardless of the sparsity scores.
However, we find that in our trained models the learned encoder weights are not the transpose of the decoder weights and are cleverly offset to increase representational capacity. Specifically, we find that similar features which have closely related dictionary vectors have encoder weights that are offset so that they prevent crosstalk between the noisy feature inputs and confusion between the distinct features.
That post also includes a summary of Neel Nanda’s replication of the experiments, and they provided an additional interpretation of this that I think is interesting.
One question from this work is whether the encoder and decoder should be tied. I find that, empirically, the decoder and encoder weights for each feature are moderately different, with median cosine similiarty of only 0.5, which is empirical evidence they’re doing different things and should not be tied. Conceptually, the encoder and decoder are doing different things: the encoder is detecting, finding the optimal direction to project onto to detect the feature, minimising interference with other similar features, while the decoder is trying to represent the feature, and tries to approximate the “true” feature direction regardless of any interference.
Thank you for sharing this! I clearly didn’t read the original “Towards Monsemanticity” closely enough! It seems like the main argument is that when the weights are untied, the encoder and decoder learn different vectors, thus this is evidence that the encoder and decoder should be untied. But this is consistent with the feature absorption work—we see the encoder and decoder learning different things, but that’s not because the SAE is learning better representations but instead because the SAE is finding degenerate solutions which increase sparsity.
Are there are any known patterns of feature firings where untying the encoder and decoder results in the SAE finding the correct or better representations, but where tying the encoder and decoder does not?
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases
I think an approach I’d try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it’s only really dangerous if you don’t know it’s happening and it causes you to think a feature is inactive when it’s instead inobviously active via another feature it’s been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.
This is all closely related to the issue of compositional codes: absorption is just a code entry that’s compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I’ve been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can’t overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you’re analyzing, all d of which are active all the time — I suspect using something like a cost proportional to ln(max(d,1/f)) might work, where d is the dimensionality of the underlying embedding space and f is the frequency of the dictionary entry being activated.
I’m not as familiar with the history of SAEs—were tied weights used in the past, but then abandoned due to resulting in lower sparsity? If that sparsity is gained by creating feature absorption, then it’s not a good thing since absorption does lead to higher sparsity but worse interpretability. I’m uncomfortable with the idea that higher sparsity is always better since the model might just have some underlying features its tracking that are dense, and IMO the goal should be to recover the model’s “true” features, if such a thing can be said to exist, rather than maximizing sparsity which is just a proxy metric.
The thesis of this feature absorption work is that absorption causes latents that look interpretable but actually aren’t. We initially found this initially by trying to evaluate the interpretability of Gemma Scope SAEs and found that latents which seemed to be tracking an interpretable feature have holes in their recall that didn’t make sense. I’d be curious if tied weights were used in the past and if so, why they were abandoned. Regardless, it seems like the thing we need to do next for this work is to just try out variants of tied weights for real LLM SAEs and see if the results are more interpretable, regardless of the sparsity scores.
Originally they were tied (because it makes intuitive sense), but I believe Anthropic was the first to suggest untying them, and found that this helped it differentiate similar features:
That post also includes a summary of Neel Nanda’s replication of the experiments, and they provided an additional interpretation of this that I think is interesting.
Thank you for sharing this! I clearly didn’t read the original “Towards Monsemanticity” closely enough! It seems like the main argument is that when the weights are untied, the encoder and decoder learn different vectors, thus this is evidence that the encoder and decoder should be untied. But this is consistent with the feature absorption work—we see the encoder and decoder learning different things, but that’s not because the SAE is learning better representations but instead because the SAE is finding degenerate solutions which increase sparsity.
Are there are any known patterns of feature firings where untying the encoder and decoder results in the SAE finding the correct or better representations, but where tying the encoder and decoder does not?
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases
I think an approach I’d try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it’s only really dangerous if you don’t know it’s happening and it causes you to think a feature is inactive when it’s instead inobviously active via another feature it’s been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.
This is all closely related to the issue of compositional codes: absorption is just a code entry that’s compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I’ve been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can’t overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you’re analyzing, all d of which are active all the time — I suspect using something like a cost proportional to ln(max(d,1/f)) might work, where d is the dimensionality of the underlying embedding space and f is the frequency of the dictionary entry being activated.