Enabling New Applications with Today’s Mechanistic Interpretability Toolkit

Working on the more applied side of Mechanistic Interpretability (MI) research, I wanted to share some more evidence on why building on top of MI tools both have valuable applications for model performance on existing tasks and enable new applications right now.

Over the past few months, I’ve been working with sparse autoencoders (SAEs) on applications of topic steering (fine-tuning) [#5 on @scasper’s recent post under New Predictions]. In short, they perform well on different steering targets depending on the representativeness of the underlying SAE (you can check out my code/​methods on this topic from this summer : https://​​github.com/​​IBM/​​sae-steering ). This is pretty cool and as standard of an improvement as historically dictates in progress towards better fine-tuning approaches.

What I find way more exciting, given my background in the health domain where interpretability and model control is critical, are the new, practical approaches that are now possible in critical domains like health.

Takeaways from Applied MI Research:

1. SAEs for downstream fine-tuning tasks is already practical

Many existing approaches (e.g. Templeton et. al 2024) show how altering SAE neurons can impact generated output. These are limited in practice by finding the right set of neurons from millions of potential neurons and the weight to set them out (as you can observe these limitations in Gemma Steer). Concretely, for SAEs to be used in practice for fine-tuning to a text, we need to be able to:

  • automatically identify semantically similar neurons to a text (vs. searching millions of neurons)

  • propose a manipulation scheme that is parameter independent (preventing multiple generation runs)

We address these two limitations using our published methods, which use a pretty simple and efficient approach that a) creates SAE neurons similarity scores using prompt distances and SAE activations which are then b) manipulated based on activations on generated text. You can check out the code/​demo for more information. Our results show that this approach both generates linguistically correct text that is similar to fine-tuning targets given the underlying representation of those concepts is present in the SAE (naturally limited by effects like polysemanticity, ghost neurons etc.).

Correctness is a standard metric, but that’s not the whole story. Because SAE manipulation happens on the inference side of things vs. the training side, if performed across multiple topics, there’s a computational sweet spot where this approach is magnitudes faster than a fine-tuning approach. Our results show that the cost is < +0.001s/​token on our unoptimized code.

2. Unexpected Benefits of an MI approach

Accuracy and efficiency are pretty standard ways to evaluate new fine-tuning methods, but what I’m really impressed by are the additional benefits that this level of control and precision can provide in critical domains like healthcare.

SAE manipulation (using my propsed mechanism) provides a way to quantify uncertainty given the underlying SAE’s representational power. This enables additional control on the generating process, and can quantify when the fine-tuning objective is not within the representational power of the SAE.

Let’s talk about what that means for critical domains. We know the concepts an SAE neuron represents by processing millions of tokens and seeing which ones fire. Some SAE neurons remain polysemantic (a quick browse through Neuronpedia can show this) and by being able to quantify which neurons are altered by how much, there is a good heuristic for which generated text shouldn’t be shown to a user because of the SAE’s ability to manipulate the text. Additionally, this approach can prevent information leakage and support fine-tuning when there is little data available. This is because our method only accesses the target text to identify which SAE neurons are semantically similar. Thus it can take advantage of the fact that the concepts underlying these neurons were identified by doing a one-time processing across millions of tokens instead of the content in the target text.

3. Whole model control is possible today

It may not be perfect, but it is possible. We can already steer models by making SAE modifications at different layers. We can also use this approach to steer output at every level, thus constraining the set of features a model can represent. Providing this level of control can go a long way in addressing some of @scasper’s other points, like identifying adversarial attacks, or even more general research questions, like what exactly is happening when we observe behavior like the Hydra phenomena in LLMs.

Given the premise of LessWrong, I do also believe that this behavior has strong implications for framing how LLMs are “reasoning” (although a very divisive take, I’m ready for the comments) about generated text and what that means about generative models more generally.

I’m actively looking at these questions if folks are interested in collaborating!


Some final thoughts about SAEs vs. other approaches inside and outside of MI: I find SAEs incredibly promising as tools due to their construction. They are trained to represent many concepts jointly and in the end, are very similar to additive steering vectors approaches as Nanda et al. 2024show. While existing approaches both produce additive vectors for layer-level output, we also know that SAEs are more dynamic due to their construction and can actually modify the additive vector based on the ‘context’ of past tokens (we use a swapping mechanism in the code).

Here’s the main point again: SAEs and other MI tools are currently available across multiple models and layers (due to the substantial efforts of folks in this community!). We can reuse these SAEs for downstream tasks right now. But in doing so, let’s not ignore that they enable many more applications , especially in domains that haven’t been able to benefit from the latest models.

I’m very grateful to this community and excited to move (hopefully) from a lurker to a contributor. Looking forward to continuing the discussion!

No comments.