My take is that I’d expect to see absorption happen any time there’s a dense feature that co-occurs with more sparse features. So for example things like parts of speech, where you could have a “noun” latent, and things that are nouns (e.g. “dogs”, “cats”, etc...) would probably show this as well. If there’s co-occurrence, then the SAE can maximize sparsity by folding some of the dense feature into the sparse features. This is something that would need to be validated experimentally though.
It’s also problematic that it’s hard to know where this will happen, especially with features where it’s less obvious what the ground-truth labels should be. E.g. if we want to understand if a model is acting deceptively, we don’t have strong ground-truth to know that a latent should or shouldn’t fire.
Still, it’s promising that this should be something that’s easily testable with toy models, so hopefully we can test out solutions to absorption in an environment where we can control every feature’s frequency and co-occurrence patterns.
Determining ground-truth definitely seems like the tough aspect there. Very good idea to come up with ‘starts with _’ as a case where that issue is tractable, and another good idea to tackle it with toy models where you can control that up front. Thanks!
My take is that I’d expect to see absorption happen any time there’s a dense feature that co-occurs with more sparse features. So for example things like parts of speech, where you could have a “noun” latent, and things that are nouns (e.g. “dogs”, “cats”, etc...) would probably show this as well. If there’s co-occurrence, then the SAE can maximize sparsity by folding some of the dense feature into the sparse features. This is something that would need to be validated experimentally though.
It’s also problematic that it’s hard to know where this will happen, especially with features where it’s less obvious what the ground-truth labels should be. E.g. if we want to understand if a model is acting deceptively, we don’t have strong ground-truth to know that a latent should or shouldn’t fire.
Still, it’s promising that this should be something that’s easily testable with toy models, so hopefully we can test out solutions to absorption in an environment where we can control every feature’s frequency and co-occurrence patterns.
Determining ground-truth definitely seems like the tough aspect there. Very good idea to come up with ‘starts with _’ as a case where that issue is tractable, and another good idea to tackle it with toy models where you can control that up front. Thanks!