This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team’s excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn’t yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.
Our team’s two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One exception is our infrastructure snippet, which we think could be useful to mechanistic interpretability researchers more broadly. We present preliminary results in a range of areas to do with SAEs, from improving and interpreting steering vectors, to improving ghost grads, to replacing SAE encoders with an inference-time sparse approximation algorithm.
Where possible, we’ve tried to clearly state our level of confidence in our results, and the evidence that led us to these conclusions so you can evaluate for yourself. We expect to be wrong about at least some of the things in here! Please take this in the spirit of an interesting idea shared by a colleague at a lab meeting, rather than as polished pieces of research we’re willing to stake our reputation on. We hope to turn some of the more promising snippets into more fleshed out and rigorous papers at a later date.
We also have a forthcoming paper on an updated SAE architecture that seems to be a moderate Pareto-improvement, stay tuned!
How to read this post: This is a short summary post, accompanying the much longer post with all the snippets. We recommend reading the summaries of each snippet below, and then zooming in to whichever snippets seem most interesting to you. They can be read in any order.
We analyse the steering vectors used in Turner et. al, 2023 using SAEs. We find that they are highly interpretable, and that in some cases we can get better performance by constructing interpretable steering vectors from SAE features, though in other cases we struggle to. We hope to better disentangle what’s going on in future works.
There are two sub-problems in dictionary learning, learning the dictionary of feature vectors (an SAE’s decoder, $W_{dec}$ and computing the sparse coefficient vector on a given input (an SAE’s encoder). The SAE’s encoder is a linear map followed by a ReLU, which is a weak function with a range of issues. We explore disentangling these problems by taking a trained SAE, throwing away the encoder, keeping the decoder, and learning the sparse coefficients at inference-time. This lets us study the question of how well the SAE encoder is working while holding the quality of the dictionary constant, and better evaluate the quality of different dictionaries.
One notable finding is that high L0 SAEs have higher quality dictionaries than low L0 SAEs, even if we learn coefficients with low L0 at inference time.
In their January update, the Anthropic team introduced a new auxiliary loss, “ghost grads”, as a potential improvement on resampling for minimising the number of dead features in a SAE. We replicate their work, and find that it under-performs resampling. We present an improvement, multiplying the ghost grads loss by the proportion of dead features, which makes ghost grads competitive.
We don’t yet see a compelling reason to move away from resampling to ghost grads as our default method for training SAEs, but we think it’s possible ghost grads could be further improved
Iterating on the science of SAEs is hard in language models, as things are slow and we lack a ground truth, so a natural goal is training SAEs on simpler toy models. We tried training SAEs on compressed Tracr models, but ran into a range of difficulties, and now think that compression may be very difficult to achieve in Tracr models without changing the underlying algorithm.
We also try training SAEs on the ReLU output model of Toy Models of Superposition, but find that it’s too toy to be an interesting proxy for language models.
We have tried replicating some of the ideas listed in the “Improvements to Dictionary Learning” section of the Anthropic interpretability team’s February update. In this snippet we briefly share our findings. We now set Adam’s beta1 to 0 by default in our SAE training runs, which sometimes helps and is sometimes neutral, but haven’t adopted any of the other recommendations.
In line with prior work, we’ve explored measuring SAE interpretability automatically by using LLMs to detect patterns in activations. We write up our thoughts on the strengths and weaknesses of this approach, some tentative observations, and present a case study where Gemini interpreted a feature we’d initially thought uninterpretable. We overall consider auto-interp a useful technique, that provides some signal on top of cheap metrics like L0 and loss recovered, but may also introduce systematic biases and should be used with caution.
Good tooling is essential for doing mechanistic interpretability research, in particular intervening on and saving intermediate activations. We work in JAX, which introduces unique opportunities and challenges. We write up some desiderata and solutions we’ve found for meeting them, which may be useful for other engineers and researchers doing mechanistic interpretability, especially in JAX. This is not specific to the SAE project.
[Summary] Progress Update #1 from the GDM Mech Interp Team
Introduction
This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team’s excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn’t yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.
Our team’s two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One exception is our infrastructure snippet, which we think could be useful to mechanistic interpretability researchers more broadly. We present preliminary results in a range of areas to do with SAEs, from improving and interpreting steering vectors, to improving ghost grads, to replacing SAE encoders with an inference-time sparse approximation algorithm.
Where possible, we’ve tried to clearly state our level of confidence in our results, and the evidence that led us to these conclusions so you can evaluate for yourself. We expect to be wrong about at least some of the things in here! Please take this in the spirit of an interesting idea shared by a colleague at a lab meeting, rather than as polished pieces of research we’re willing to stake our reputation on. We hope to turn some of the more promising snippets into more fleshed out and rigorous papers at a later date.
We also have a forthcoming paper on an updated SAE architecture that seems to be a moderate Pareto-improvement, stay tuned!
How to read this post: This is a short summary post, accompanying the much longer post with all the snippets. We recommend reading the summaries of each snippet below, and then zooming in to whichever snippets seem most interesting to you. They can be read in any order.
Summaries
Activation Steering with SAEs
We analyse the steering vectors used in Turner et. al, 2023 using SAEs. We find that they are highly interpretable, and that in some cases we can get better performance by constructing interpretable steering vectors from SAE features, though in other cases we struggle to. We hope to better disentangle what’s going on in future works.
Replacing SAE Encoders with Inference-Time Optimisation
There are two sub-problems in dictionary learning, learning the dictionary of feature vectors (an SAE’s decoder, $W_{dec}$ and computing the sparse coefficient vector on a given input (an SAE’s encoder). The SAE’s encoder is a linear map followed by a ReLU, which is a weak function with a range of issues. We explore disentangling these problems by taking a trained SAE, throwing away the encoder, keeping the decoder, and learning the sparse coefficients at inference-time. This lets us study the question of how well the SAE encoder is working while holding the quality of the dictionary constant, and better evaluate the quality of different dictionaries.
One notable finding is that high L0 SAEs have higher quality dictionaries than low L0 SAEs, even if we learn coefficients with low L0 at inference time.
Improving Ghost Grads
In their January update, the Anthropic team introduced a new auxiliary loss, “ghost grads”, as a potential improvement on resampling for minimising the number of dead features in a SAE. We replicate their work, and find that it under-performs resampling. We present an improvement, multiplying the ghost grads loss by the proportion of dead features, which makes ghost grads competitive.
We don’t yet see a compelling reason to move away from resampling to ghost grads as our default method for training SAEs, but we think it’s possible ghost grads could be further improved
SAEs on Tracr and Toy Models
Iterating on the science of SAEs is hard in language models, as things are slow and we lack a ground truth, so a natural goal is training SAEs on simpler toy models. We tried training SAEs on compressed Tracr models, but ran into a range of difficulties, and now think that compression may be very difficult to achieve in Tracr models without changing the underlying algorithm.
We also try training SAEs on the ReLU output model of Toy Models of Superposition, but find that it’s too toy to be an interesting proxy for language models.
Replicating “Improvements to Dictionary Learning”
We have tried replicating some of the ideas listed in the “Improvements to Dictionary Learning” section of the Anthropic interpretability team’s February update. In this snippet we briefly share our findings. We now set Adam’s beta1 to 0 by default in our SAE training runs, which sometimes helps and is sometimes neutral, but haven’t adopted any of the other recommendations.
Interpreting SAE Features with Gemini Ultra
In line with prior work, we’ve explored measuring SAE interpretability automatically by using LLMs to detect patterns in activations. We write up our thoughts on the strengths and weaknesses of this approach, some tentative observations, and present a case study where Gemini interpreted a feature we’d initially thought uninterpretable. We overall consider auto-interp a useful technique, that provides some signal on top of cheap metrics like L0 and loss recovered, but may also introduce systematic biases and should be used with caution.
Instrumenting LLM model internals in JAX
Good tooling is essential for doing mechanistic interpretability research, in particular intervening on and saving intermediate activations. We work in JAX, which introduces unique opportunities and challenges. We write up some desiderata and solutions we’ve found for meeting them, which may be useful for other engineers and researchers doing mechanistic interpretability, especially in JAX. This is not specific to the SAE project.