I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
I understand that, I more meant my suggestion as an idea for if you want to go beyond poking holes in SAE to instead solve interpretability.
We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes.
One downside to this is that spelling is a fairly simple task for LLMs.
We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been.
I expect that:
Objects in real-world tasks will be spread over many tokens, so they will not be identifiable within individual tokens.
Objects in real-world tasks will be massively heterogenous, so they will not be identifiable with a small number of dimensions.
Implications:
SAE latents will not be relevant at all, because they are limited to individual tokens.
The value of interpretability will less be about finding a small fixed set of mediators and more about developing a taxonomy of root causes and tools that can be used to identify those root causes.
SAEs would be an example of such a tool, except I don’t expect they will end up working.
A half-baked thought on a practical use-case would be, LLMs are often used for making chatbot assistants. If one had a taxonomy for different kinds of users of chatbots, and how they influence the chatbots, one could maybe create a tool for debugging cases where the language model does something weird, by looking at chat logs and extracting the LLM’s model for what kind of user it is dealing with.
But I guess part of the long-term goal of mechanistic interpretability is people are worried about x-risk from learned optimization, and they want to identify fragments of that ahead of time so they can ring the fire alarm. I guess upon reflection I’m especially bearish about this strategy because I think x-risk will occur at a higher level than individual LLMs and that whatever happens when we’re diminished all the way down to a forward propagation is going to look indistinguishable for safe and unsafe AIs.
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
Ah, I didn’t read the paper, only the LW post.
I understand that, I more meant my suggestion as an idea for if you want to go beyond poking holes in SAE to instead solve interpretability.
One downside to this is that spelling is a fairly simple task for LLMs.
I expect that:
Objects in real-world tasks will be spread over many tokens, so they will not be identifiable within individual tokens.
Objects in real-world tasks will be massively heterogenous, so they will not be identifiable with a small number of dimensions.
Implications:
SAE latents will not be relevant at all, because they are limited to individual tokens.
The value of interpretability will less be about finding a small fixed set of mediators and more about developing a taxonomy of root causes and tools that can be used to identify those root causes.
SAEs would be an example of such a tool, except I don’t expect they will end up working.
A half-baked thought on a practical use-case would be, LLMs are often used for making chatbot assistants. If one had a taxonomy for different kinds of users of chatbots, and how they influence the chatbots, one could maybe create a tool for debugging cases where the language model does something weird, by looking at chat logs and extracting the LLM’s model for what kind of user it is dealing with.
But I guess part of the long-term goal of mechanistic interpretability is people are worried about x-risk from learned optimization, and they want to identify fragments of that ahead of time so they can ring the fire alarm. I guess upon reflection I’m especially bearish about this strategy because I think x-risk will occur at a higher level than individual LLMs and that whatever happens when we’re diminished all the way down to a forward propagation is going to look indistinguishable for safe and unsafe AIs.
That’s just my opinion though.