I would argue that the starting point is to look in variation in exogenous factors. Like let’s say you have a text describing a scene. You could remove individual sentences describing individual objects in the scene to get peturbed texts describing scenes without those objects. Then the first goal for interpretability can be to map out how those changes flow through the network.
This is probably more relevant for interpreting e.g. a vision model than for interpreting a language model. Part of the challenge for language models is that we don’t have a good idea of their final use-case, so it’s hard to come up with an equally-representative task to interpret them on. But maybe with some work one could find one.
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
I understand that, I more meant my suggestion as an idea for if you want to go beyond poking holes in SAE to instead solve interpretability.
We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes.
One downside to this is that spelling is a fairly simple task for LLMs.
We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been.
I expect that:
Objects in real-world tasks will be spread over many tokens, so they will not be identifiable within individual tokens.
Objects in real-world tasks will be massively heterogenous, so they will not be identifiable with a small number of dimensions.
Implications:
SAE latents will not be relevant at all, because they are limited to individual tokens.
The value of interpretability will less be about finding a small fixed set of mediators and more about developing a taxonomy of root causes and tools that can be used to identify those root causes.
SAEs would be an example of such a tool, except I don’t expect they will end up working.
A half-baked thought on a practical use-case would be, LLMs are often used for making chatbot assistants. If one had a taxonomy for different kinds of users of chatbots, and how they influence the chatbots, one could maybe create a tool for debugging cases where the language model does something weird, by looking at chat logs and extracting the LLM’s model for what kind of user it is dealing with.
But I guess part of the long-term goal of mechanistic interpretability is people are worried about x-risk from learned optimization, and they want to identify fragments of that ahead of time so they can ring the fire alarm. I guess upon reflection I’m especially bearish about this strategy because I think x-risk will occur at a higher level than individual LLMs and that whatever happens when we’re diminished all the way down to a forward propagation is going to look indistinguishable for safe and unsafe AIs.
I would argue that the starting point is to look in variation in exogenous factors. Like let’s say you have a text describing a scene. You could remove individual sentences describing individual objects in the scene to get peturbed texts describing scenes without those objects. Then the first goal for interpretability can be to map out how those changes flow through the network.
This is probably more relevant for interpreting e.g. a vision model than for interpreting a language model. Part of the challenge for language models is that we don’t have a good idea of their final use-case, so it’s hard to come up with an equally-representative task to interpret them on. But maybe with some work one could find one.
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
Ah, I didn’t read the paper, only the LW post.
I understand that, I more meant my suggestion as an idea for if you want to go beyond poking holes in SAE to instead solve interpretability.
One downside to this is that spelling is a fairly simple task for LLMs.
I expect that:
Objects in real-world tasks will be spread over many tokens, so they will not be identifiable within individual tokens.
Objects in real-world tasks will be massively heterogenous, so they will not be identifiable with a small number of dimensions.
Implications:
SAE latents will not be relevant at all, because they are limited to individual tokens.
The value of interpretability will less be about finding a small fixed set of mediators and more about developing a taxonomy of root causes and tools that can be used to identify those root causes.
SAEs would be an example of such a tool, except I don’t expect they will end up working.
A half-baked thought on a practical use-case would be, LLMs are often used for making chatbot assistants. If one had a taxonomy for different kinds of users of chatbots, and how they influence the chatbots, one could maybe create a tool for debugging cases where the language model does something weird, by looking at chat logs and extracting the LLM’s model for what kind of user it is dealing with.
But I guess part of the long-term goal of mechanistic interpretability is people are worried about x-risk from learned optimization, and they want to identify fragments of that ahead of time so they can ring the fire alarm. I guess upon reflection I’m especially bearish about this strategy because I think x-risk will occur at a higher level than individual LLMs and that whatever happens when we’re diminished all the way down to a forward propagation is going to look indistinguishable for safe and unsafe AIs.
That’s just my opinion though.