Joseph Bloom comments on [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Joseph Bloom 25 Sep 2024 15:31 UTC
6 points
0
Thanks Egg! Really good question. Short answer: Look at MetaSAE’s for inspiration.
Long answer:
There are a few reasons to believe that feature absorption won’t just be a thing for graphemic information:
- People have noticed SAE latent false negatives in general, beyond just spelling features. For example this quote from the Anthropic August update. I think they also make a comment about feature coordination being important in the July update as well.
If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way. Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts.
- MetaSAEs are highly suggestive of lots of absorption. Starts with letter features are found by MetaSAEs along with lots of others (my personal favorite is a ” Jerry” feature on which a Jewish meta-feature fires. I won’t what that’s about!?) 🤔
- Conceptually, being token or character specific doesn’t play a big role. As Neel mentioned in his tweet here, once you understand the concept, it’s clear that this is a strategy for generating sparsity in general when you have this kind of relationship between concepts. Here’s a latent that’s a bit less token aligned in the MetaSAE app which can still be decomposed into meta-latents.
In terms of what I really want to see people look at: What wasn’t clear from Meta-SAEs (which I think is clearer here) is that absorption is important for interpretable causal mediation. That is, for the spelling task, absorbing features look like a kind of mechanistic anomaly (but is actually an artefact of the method) where the spelling information is absorbed. But if we found absorption in a case where we didn’t know the model knew a property of some concept (or we didn’t know it was a property), but saw it in the meta-SAE, that would be very cool. Imagine seeing attribution to a latent tracking something about a person, but then the meta-latents tell you that the model was actually leveraging some very specific fact about that person. This might really important for understanding things like sycophancy…
- eggsyntax 25 Sep 2024 19:48 UTC
  2 points
  0
  Parent
  That all makes sense, thanks. I’m really looking forward to seeing where this line of research goes from here!