I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.
Joseph Bloom
Oh interesting! Will make a note to look into this more.
Jan shared with me! We’re excited about this direction :)
Eliciting bad contexts
Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces
Cool work! I’d be excited to see whether latents found via this method are higher quality linear classifiers when they appear to track concepts (eg: first letters) and also if they enable us to train better classifiers over model internals than other SAE architectures or linear probes (https://transformer-circuits.pub/2024/features-as-classifiers/index.html).
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Toy Models of Feature Absorption in SAEs
Cool work!
Have you tried to generate autointerp of the SAE features? I’d be quite excited about a loop that does the following:
take an SAE feature, get the max activating examples.
Use a multi-modal model, maybe Claude, to do autointerp via images of each of the chess positions (might be hard but with the right prompt seems doable).
Based on a codebase that implements chess logic which can be abstracted away (eg: has functions that take a board state and return whether or not statements are true like “is the king in check?”), get a model to implement a function that matches it’s interpretation of the feature.
Use this to generate a labelled dataset on which you then train a linear probe.
Compare the probe activations to the feature activations. In particular, see whether you can generate a better automatic interpretation of the feature if you prompt with examples of where it differs from the probe.
I suspect this is nicer than language modelling in that you can programmatically generate your data labels from explanations rather than relying on LMs. Of course you could just a priori decide what probes to train but the loop between autointerp and next probe seems cool to me. I predict that current SAE training methods will result in long description lengths or low recall and the tradeoff will be poor.
Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).
In terms of more serious followup, I’d like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
This thread reminds me that comparing feature absorption in SAEs with tied encoder / decoder weights and in end-to-end SAEs seems like valuable follow up.
Thanks Egg! Really good question. Short answer: Look at MetaSAE’s for inspiration.
Long answer:
There are a few reasons to believe that feature absorption won’t just be a thing for graphemic information:
People have noticed SAE latent false negatives in general, beyond just spelling features. For example this quote from the Anthropic August update. I think they also make a comment about feature coordination being important in the July update as well.
If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way. Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts.
MetaSAEs are highly suggestive of lots of absorption. Starts with letter features are found by MetaSAEs along with lots of others (my personal favorite is a ” Jerry” feature on which a Jewish meta-feature fires. I won’t what that’s about!?) 🤔
Conceptually, being token or character specific doesn’t play a big role. As Neel mentioned in his tweet here, once you understand the concept, it’s clear that this is a strategy for generating sparsity in general when you have this kind of relationship between concepts. Here’s a latent that’s a bit less token aligned in the MetaSAE app which can still be decomposed into meta-latents.
In terms of what I really want to see people look at: What wasn’t clear from Meta-SAEs (which I think is clearer here) is that absorption is important for interpretable causal mediation. That is, for the spelling task, absorbing features look like a kind of mechanistic anomaly (but is actually an artefact of the method) where the spelling information is absorbed. But if we found absorption in a case where we didn’t know the model knew a property of some concept (or we didn’t know it was a property), but saw it in the meta-SAE, that would be very cool. Imagine seeing attribution to a latent tracking something about a person, but then the meta-latents tell you that the model was actually leveraging some very specific fact about that person. This might really important for understanding things like sycophancy…
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.
Thanks! We were sad not to have time to try out Meta-SAEs but want to in the future.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.
I think this is the wrong way to go to be honest. I see it as doubling down on sparsity and a single decomposition, both of which I think may just not reflect the underlying data generating process. Heavily inspired by some of John Wentworth’s ideas here.
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.
[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
It seems that PIBBSS might be pivoting away from higher variance blue sky research to focus on more mainstream AI interpretability. While this might create more opportunities for funding, I think this would be a mistake. The AI safety ecosystem needs a home for “weird ideas” and PIBBSS seems the most reputable, competent, EA-aligned place for this! I encourage PIBBSS to “embrace the weird,” albeit while maintaining high academic standards for basic research, modelled off the best basic science institutions.
I was a recent PIBBSS mentor, and am a mech interp person who is likely to be considered mainstream by many people and for this reason I wanted to push back on this concern.
A few thoughts:
I don’t want to put words in your mouth but I do want to clarify that we shouldn’t conflate having some mainstream mech interp and being only mech interp. Importantly, to my knowledge, there is very little chance of PIBBSS entirely doing mech interp, and so I think the salient question is should they have “a bit” (say 5-10% of scholars) do mech interp (which is likely more than they used to). I would advocate for a steady state proportion of between 10 − 20%, see further points for detail).
In my opinion, the word “mainstream” suggests redundancy and brings to mind the idea that “well this could just be done at MATS”. There are two reasons I think this is inaccurate.
First, PIBBSS is likely to accept mentees who may not get into MATS / similar programs. Mentees with diverse background and possibly different skillsets. In my opinion, this kind of diversity can be highly valuable and bring new perspectives to mech interp (which is a pre-paradigmatic field in need of new takes). I’m moderately confident that to the extent existing programs are highly selective, we should expect diversity to suffer in them (if you take the top 10% by metrics like competence, you’re less likely to get the top 10% by diversity of intellectual background).
Secondly, I think there’s a statistical term for this but I forget what it is. PIBBSS being a home for weird ideas in mech interp as much as weird ideas in other areas of AI safety seems entirely reasonable to me.
I also think that even some mainstream mech interp (and possible other areas like evals) should be a part of PIBBSS because it enriches the entire program:
My experience of the PIBBSS retreat and subsequent interactions suggests that a lot of value is created by having people who do empirical work interact with people who do more theoretical work. Empiricists gain ideas and perspective from theorists and theoretical researchers are exposed to more real world observations second hand.
I weakly suspect that some debates / discussions in AI safety may be lopsided / missing details via the absence of sub-fields. In my opinion it’s valuable to sometimes mix up who is in the room but likely worse in expectation to always remove mech interp people (hence my advocacy for 10 − 20% empiricists, with half of them being interp).
Some final notes:
I’m happy to share details of the work my scholar and I did which we expect to publish in the next month.
I’ll be a bit forward and suggest that if you (Ryan) or any other funders find the arguments above convincing then it’s possible you might want to further PIBBSS and ask Nora how PIBBSS can source a bit more “weird” mech interp, a bit of mainstream mech interp and some other empirical sub-fields for the program.
I’ll share this in the PIBBSS slack to see if other’s want to comment :)
Good work! I’m sure you learned a lot while doing this and am a big fan of people publishing artifacts produced during upskilling. ARENA just updated it’s SAE content so that might also be a good next step for you!
Showing SAE Latents Are Not Atomic Using Meta-SAEs
Stitching SAEs of different sizes
7B parameter PM
@Logan Riggs this link doesn’t work for me.
Good resource: https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J ← Neel Nanda’s glossary.
> What is a feature?
Often gets confused because early literature doesn’t distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: “Not all Language Model Features are Linear” ⇒ not all the representations are linear (and is not a statement about what gets represented).
> Are there different circuits that appear in a network based on your definition of what a relevant feature is?
This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don’t know of any work doing circuit finding with multiple feature finding methods though (but I’d be excited about it).
> How crisp are these circuits that appear, both in toy examples and in the wild?
Read ACDC. https://arxiv.org/abs/2304.14997 . Generally, not crisp.
> What are the best examples of “circuits in the wild” that are actually robust?
The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.
> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?
Interesting question. See Neel’s thoughts here: https://www.neelnanda.io/mechanistic-interpretability/othello#finding-modular-circuits
> Does this depend on training?
Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).
> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)
I don’t know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926