A hacky solution might be to look at the top activations using encoder directions AND decoder directions. We can think of the encoder as giving a “specific” meaning and the decoder a “broad” meaning, potentially overlapping other latents. Discrepancies between the two sets of top activations would indicate absorption.
Untied encoders give sparser activations by effectively removing activations that can be better attributed to other latents. So an encoder direction’s top activations can only be understood in the context of all the other latents.
Top activations using the decoder direction would be less sparse but give a fuller picture that is not dependent on what other latents are learned. The activations may be less monosemantic though, especially as you move towards weaker activations.
That’s an interesting idea! That might help if training a new SAE with tied encoder/decoder (or some loss which encourages the same thing) isn’t an option. It seems like with absorption you’re still going to get mixes of of multiple features in the decoder, and a mix of the correct feature and the negative of excluded features in the encoder, which isn’t ideal. Still, it’s a good question whether it’s possible to take a trained SAE with absorption and somehow identify the absorption and remove it or mitigate it rather than training from scratch. It would also be really interesting if we could find a way to detect absorption and use that as a way to quantify the underlying feature co-occurrences somehow.
I think you’re correct that tying the encoder and decoder will mean that the SAE won’t be as sparse. But then, maybe the underlying features we’re trying to reconstruct are themselves not necessarily all sparse, so that could potentially be OK. E.g. things like “noun”, “verb”, “is alphanumeric”, etc… are all things the model certainly knows, but would be dense if tracked in a SAE. The true test will be to try training some real tied SAEs and seeing how interpretable the results look like.
A hacky solution might be to look at the top activations using encoder directions AND decoder directions. We can think of the encoder as giving a “specific” meaning and the decoder a “broad” meaning, potentially overlapping other latents. Discrepancies between the two sets of top activations would indicate absorption.
Untied encoders give sparser activations by effectively removing activations that can be better attributed to other latents. So an encoder direction’s top activations can only be understood in the context of all the other latents.
Top activations using the decoder direction would be less sparse but give a fuller picture that is not dependent on what other latents are learned. The activations may be less monosemantic though, especially as you move towards weaker activations.
That’s an interesting idea! That might help if training a new SAE with tied encoder/decoder (or some loss which encourages the same thing) isn’t an option. It seems like with absorption you’re still going to get mixes of of multiple features in the decoder, and a mix of the correct feature and the negative of excluded features in the encoder, which isn’t ideal. Still, it’s a good question whether it’s possible to take a trained SAE with absorption and somehow identify the absorption and remove it or mitigate it rather than training from scratch. It would also be really interesting if we could find a way to detect absorption and use that as a way to quantify the underlying feature co-occurrences somehow.
I think you’re correct that tying the encoder and decoder will mean that the SAE won’t be as sparse. But then, maybe the underlying features we’re trying to reconstruct are themselves not necessarily all sparse, so that could potentially be OK. E.g. things like “noun”, “verb”, “is alphanumeric”, etc… are all things the model certainly knows, but would be dense if tracked in a SAE. The true test will be to try training some real tied SAEs and seeing how interpretable the results look like.