Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEswe found many spelling/sound related meta-latents.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEswe found many spelling/sound related meta-latents.
Thanks! We were sad not to have time to try out Meta-SAEs but want to in the future.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.
I think this is the wrong way to go to be honest. I see it as doubling down on sparsity and a single decomposition, both of which I think may just not reflect the underlying data generating process. Heavily inspired by some of John Wentworth’s ideas here.
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.
Definitely seems like multiple ways to interpret this work, as also described in SAE feature geometry is outside the superposition hypothesis. Either we need to find other methods and theory that somehow finds more atomic features, or we need to get a more complete picture of what the SAEs are learning at different levels of abstraction and composition.
Both seem important and interesting lines of work to me!
I would argue that the starting point is to look in variation in exogenous factors. Like let’s say you have a text describing a scene. You could remove individual sentences describing individual objects in the scene to get peturbed texts describing scenes without those objects. Then the first goal for interpretability can be to map out how those changes flow through the network.
This is probably more relevant for interpreting e.g. a vision model than for interpreting a language model. Part of the challenge for language models is that we don’t have a good idea of their final use-case, so it’s hard to come up with an equally-representative task to interpret them on. But maybe with some work one could find one.
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
I understand that, I more meant my suggestion as an idea for if you want to go beyond poking holes in SAE to instead solve interpretability.
We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes.
One downside to this is that spelling is a fairly simple task for LLMs.
We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been.
I expect that:
Objects in real-world tasks will be spread over many tokens, so they will not be identifiable within individual tokens.
Objects in real-world tasks will be massively heterogenous, so they will not be identifiable with a small number of dimensions.
Implications:
SAE latents will not be relevant at all, because they are limited to individual tokens.
The value of interpretability will less be about finding a small fixed set of mediators and more about developing a taxonomy of root causes and tools that can be used to identify those root causes.
SAEs would be an example of such a tool, except I don’t expect they will end up working.
A half-baked thought on a practical use-case would be, LLMs are often used for making chatbot assistants. If one had a taxonomy for different kinds of users of chatbots, and how they influence the chatbots, one could maybe create a tool for debugging cases where the language model does something weird, by looking at chat logs and extracting the LLM’s model for what kind of user it is dealing with.
But I guess part of the long-term goal of mechanistic interpretability is people are worried about x-risk from learned optimization, and they want to identify fragments of that ahead of time so they can ring the fire alarm. I guess upon reflection I’m especially bearish about this strategy because I think x-risk will occur at a higher level than individual LLMs and that whatever happens when we’re diminished all the way down to a forward propagation is going to look indistinguishable for safe and unsafe AIs.
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.
Thanks! We were sad not to have time to try out Meta-SAEs but want to in the future.
I think this is the wrong way to go to be honest. I see it as doubling down on sparsity and a single decomposition, both of which I think may just not reflect the underlying data generating process. Heavily inspired by some of John Wentworth’s ideas here.
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.
Definitely seems like multiple ways to interpret this work, as also described in SAE feature geometry is outside the superposition hypothesis. Either we need to find other methods and theory that somehow finds more atomic features, or we need to get a more complete picture of what the SAEs are learning at different levels of abstraction and composition.
Both seem important and interesting lines of work to me!
I would argue that the starting point is to look in variation in exogenous factors. Like let’s say you have a text describing a scene. You could remove individual sentences describing individual objects in the scene to get peturbed texts describing scenes without those objects. Then the first goal for interpretability can be to map out how those changes flow through the network.
This is probably more relevant for interpreting e.g. a vision model than for interpreting a language model. Part of the challenge for language models is that we don’t have a good idea of their final use-case, so it’s hard to come up with an equally-representative task to interpret them on. But maybe with some work one could find one.
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
Ah, I didn’t read the paper, only the LW post.
I understand that, I more meant my suggestion as an idea for if you want to go beyond poking holes in SAE to instead solve interpretability.
One downside to this is that spelling is a fairly simple task for LLMs.
I expect that:
Objects in real-world tasks will be spread over many tokens, so they will not be identifiable within individual tokens.
Objects in real-world tasks will be massively heterogenous, so they will not be identifiable with a small number of dimensions.
Implications:
SAE latents will not be relevant at all, because they are limited to individual tokens.
The value of interpretability will less be about finding a small fixed set of mediators and more about developing a taxonomy of root causes and tools that can be used to identify those root causes.
SAEs would be an example of such a tool, except I don’t expect they will end up working.
A half-baked thought on a practical use-case would be, LLMs are often used for making chatbot assistants. If one had a taxonomy for different kinds of users of chatbots, and how they influence the chatbots, one could maybe create a tool for debugging cases where the language model does something weird, by looking at chat logs and extracting the LLM’s model for what kind of user it is dealing with.
But I guess part of the long-term goal of mechanistic interpretability is people are worried about x-risk from learned optimization, and they want to identify fragments of that ahead of time so they can ring the fire alarm. I guess upon reflection I’m especially bearish about this strategy because I think x-risk will occur at a higher level than individual LLMs and that whatever happens when we’re diminished all the way down to a forward propagation is going to look indistinguishable for safe and unsafe AIs.
That’s just my opinion though.