Thank you for sharing this! I clearly didn’t read the original “Towards Monsemanticity” closely enough! It seems like the main argument is that when the weights are untied, the encoder and decoder learn different vectors, thus this is evidence that the encoder and decoder should be untied. But this is consistent with the feature absorption work—we see the encoder and decoder learning different things, but that’s not because the SAE is learning better representations but instead because the SAE is finding degenerate solutions which increase sparsity.
Are there are any known patterns of feature firings where untying the encoder and decoder results in the SAE finding the correct or better representations, but where tying the encoder and decoder does not?
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases
Thank you for sharing this! I clearly didn’t read the original “Towards Monsemanticity” closely enough! It seems like the main argument is that when the weights are untied, the encoder and decoder learn different vectors, thus this is evidence that the encoder and decoder should be untied. But this is consistent with the feature absorption work—we see the encoder and decoder learning different things, but that’s not because the SAE is learning better representations but instead because the SAE is finding degenerate solutions which increase sparsity.
Are there are any known patterns of feature firings where untying the encoder and decoder results in the SAE finding the correct or better representations, but where tying the encoder and decoder does not?
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases