What a great discovery, that’s extremely cool. Intuitively, I would worry a bit that the ‘spelling miracle’ is such an odd edge case that it may not be representative of typical behavior, although just the fact that ‘starts with _’ shows up as an SAE feature assuages that worry somewhat. I can see why you’d choose it, though, since it’s so easy to mechanically confirm what tokens ought to trigger the feature. Do you have some ideas for non-spelling-related features that would make good next tests?
My take is that I’d expect to see absorption happen any time there’s a dense feature that co-occurs with more sparse features. So for example things like parts of speech, where you could have a “noun” latent, and things that are nouns (e.g. “dogs”, “cats”, etc...) would probably show this as well. If there’s co-occurrence, then the SAE can maximize sparsity by folding some of the dense feature into the sparse features. This is something that would need to be validated experimentally though.
It’s also problematic that it’s hard to know where this will happen, especially with features where it’s less obvious what the ground-truth labels should be. E.g. if we want to understand if a model is acting deceptively, we don’t have strong ground-truth to know that a latent should or shouldn’t fire.
Still, it’s promising that this should be something that’s easily testable with toy models, so hopefully we can test out solutions to absorption in an environment where we can control every feature’s frequency and co-occurrence patterns.
Determining ground-truth definitely seems like the tough aspect there. Very good idea to come up with ‘starts with _’ as a case where that issue is tractable, and another good idea to tackle it with toy models where you can control that up front. Thanks!
Thanks Egg! Really good question. Short answer: Look at MetaSAE’s for inspiration.
Long answer:
There are a few reasons to believe that feature absorption won’t just be a thing for graphemic information:
People have noticed SAE latent false negatives in general, beyond just spelling features. For example this quote from the Anthropic August update. I think they also make a comment about feature coordination being important in the July update as well.
If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way. Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts.
MetaSAEs are highly suggestive of lots of absorption. Starts with letter features are found by MetaSAEs along with lots of others (my personal favorite is a ” Jerry” feature on which a Jewish meta-feature fires. I won’t what that’s about!?) 🤔
Conceptually, being token or character specific doesn’t play a big role. As Neel mentioned in his tweet here, once you understand the concept, it’s clear that this is a strategy for generating sparsity in general when you have this kind of relationship between concepts. Here’s a latent that’s a bit less token aligned in the MetaSAE app which can still be decomposed into meta-latents.
In terms of what I really want to see people look at: What wasn’t clear from Meta-SAEs (which I think is clearer here) is that absorption is important for interpretable causal mediation. That is, for the spelling task, absorbing features look like a kind of mechanistic anomaly (but is actually an artefact of the method) where the spelling information is absorbed. But if we found absorption in a case where we didn’t know the model knew a property of some concept (or we didn’t know it was a property), but saw it in the meta-SAE, that would be very cool. Imagine seeing attribution to a latent tracking something about a person, but then the meta-latents tell you that the model was actually leveraging some very specific fact about that person. This might really important for understanding things like sycophancy…
What a great discovery, that’s extremely cool. Intuitively, I would worry a bit that the ‘spelling miracle’ is such an odd edge case that it may not be representative of typical behavior, although just the fact that ‘starts with _’ shows up as an SAE feature assuages that worry somewhat. I can see why you’d choose it, though, since it’s so easy to mechanically confirm what tokens ought to trigger the feature. Do you have some ideas for non-spelling-related features that would make good next tests?
My take is that I’d expect to see absorption happen any time there’s a dense feature that co-occurs with more sparse features. So for example things like parts of speech, where you could have a “noun” latent, and things that are nouns (e.g. “dogs”, “cats”, etc...) would probably show this as well. If there’s co-occurrence, then the SAE can maximize sparsity by folding some of the dense feature into the sparse features. This is something that would need to be validated experimentally though.
It’s also problematic that it’s hard to know where this will happen, especially with features where it’s less obvious what the ground-truth labels should be. E.g. if we want to understand if a model is acting deceptively, we don’t have strong ground-truth to know that a latent should or shouldn’t fire.
Still, it’s promising that this should be something that’s easily testable with toy models, so hopefully we can test out solutions to absorption in an environment where we can control every feature’s frequency and co-occurrence patterns.
Determining ground-truth definitely seems like the tough aspect there. Very good idea to come up with ‘starts with _’ as a case where that issue is tractable, and another good idea to tackle it with toy models where you can control that up front. Thanks!
Thanks Egg! Really good question. Short answer: Look at MetaSAE’s for inspiration.
Long answer:
There are a few reasons to believe that feature absorption won’t just be a thing for graphemic information:
People have noticed SAE latent false negatives in general, beyond just spelling features. For example this quote from the Anthropic August update. I think they also make a comment about feature coordination being important in the July update as well.
MetaSAEs are highly suggestive of lots of absorption. Starts with letter features are found by MetaSAEs along with lots of others (my personal favorite is a ” Jerry” feature on which a Jewish meta-feature fires. I won’t what that’s about!?) 🤔
Conceptually, being token or character specific doesn’t play a big role. As Neel mentioned in his tweet here, once you understand the concept, it’s clear that this is a strategy for generating sparsity in general when you have this kind of relationship between concepts. Here’s a latent that’s a bit less token aligned in the MetaSAE app which can still be decomposed into meta-latents.
In terms of what I really want to see people look at: What wasn’t clear from Meta-SAEs (which I think is clearer here) is that absorption is important for interpretable causal mediation. That is, for the spelling task, absorbing features look like a kind of mechanistic anomaly (but is actually an artefact of the method) where the spelling information is absorbed. But if we found absorption in a case where we didn’t know the model knew a property of some concept (or we didn’t know it was a property), but saw it in the meta-SAE, that would be very cool. Imagine seeing attribution to a latent tracking something about a person, but then the meta-latents tell you that the model was actually leveraging some very specific fact about that person. This might really important for understanding things like sycophancy…
That all makes sense, thanks. I’m really looking forward to seeing where this line of research goes from here!