I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.
Joseph Bloom
Oh interesting! Will make a note to look into this more.
Jan shared with me! We’re excited about this direction :)
Cool work! I’d be excited to see whether latents found via this method are higher quality linear classifiers when they appear to track concepts (eg: first letters) and also if they enable us to train better classifiers over model internals than other SAE architectures or linear probes (https://transformer-circuits.pub/2024/features-as-classifiers/index.html).
Cool work!
Have you tried to generate autointerp of the SAE features? I’d be quite excited about a loop that does the following:
take an SAE feature, get the max activating examples.
Use a multi-modal model, maybe Claude, to do autointerp via images of each of the chess positions (might be hard but with the right prompt seems doable).
Based on a codebase that implements chess logic which can be abstracted away (eg: has functions that take a board state and return whether or not statements are true like “is the king in check?”), get a model to implement a function that matches it’s interpretation of the feature.
Use this to generate a labelled dataset on which you then train a linear probe.
Compare the probe activations to the feature activations. In particular, see whether you can generate a better automatic interpretation of the feature if you prompt with examples of where it differs from the probe.
I suspect this is nicer than language modelling in that you can programmatically generate your data labels from explanations rather than relying on LMs. Of course you could just a priori decide what probes to train but the loop between autointerp and next probe seems cool to me. I predict that current SAE training methods will result in long description lengths or low recall and the tradeoff will be poor.
Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).
In terms of more serious followup, I’d like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.
This thread reminds me that comparing feature absorption in SAEs with tied encoder / decoder weights and in end-to-end SAEs seems like valuable follow up.
Thanks Egg! Really good question. Short answer: Look at MetaSAE’s for inspiration.
Long answer:
There are a few reasons to believe that feature absorption won’t just be a thing for graphemic information:
People have noticed SAE latent false negatives in general, beyond just spelling features. For example this quote from the Anthropic August update. I think they also make a comment about feature coordination being important in the July update as well.
If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way. Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts.
MetaSAEs are highly suggestive of lots of absorption. Starts with letter features are found by MetaSAEs along with lots of others (my personal favorite is a ” Jerry” feature on which a Jewish meta-feature fires. I won’t what that’s about!?) 🤔
Conceptually, being token or character specific doesn’t play a big role. As Neel mentioned in his tweet here, once you understand the concept, it’s clear that this is a strategy for generating sparsity in general when you have this kind of relationship between concepts. Here’s a latent that’s a bit less token aligned in the MetaSAE app which can still be decomposed into meta-latents.
In terms of what I really want to see people look at: What wasn’t clear from Meta-SAEs (which I think is clearer here) is that absorption is important for interpretable causal mediation. That is, for the spelling task, absorbing features look like a kind of mechanistic anomaly (but is actually an artefact of the method) where the spelling information is absorbed. But if we found absorption in a case where we didn’t know the model knew a property of some concept (or we didn’t know it was a property), but saw it in the meta-SAE, that would be very cool. Imagine seeing attribution to a latent tracking something about a person, but then the meta-latents tell you that the model was actually leveraging some very specific fact about that person. This might really important for understanding things like sycophancy…
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.
Thanks! We were sad not to have time to try out Meta-SAEs but want to in the future.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.
I think this is the wrong way to go to be honest. I see it as doubling down on sparsity and a single decomposition, both of which I think may just not reflect the underlying data generating process. Heavily inspired by some of John Wentworth’s ideas here.
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.
It seems that PIBBSS might be pivoting away from higher variance blue sky research to focus on more mainstream AI interpretability. While this might create more opportunities for funding, I think this would be a mistake. The AI safety ecosystem needs a home for “weird ideas” and PIBBSS seems the most reputable, competent, EA-aligned place for this! I encourage PIBBSS to “embrace the weird,” albeit while maintaining high academic standards for basic research, modelled off the best basic science institutions.
I was a recent PIBBSS mentor, and am a mech interp person who is likely to be considered mainstream by many people and for this reason I wanted to push back on this concern.
A few thoughts:
I don’t want to put words in your mouth but I do want to clarify that we shouldn’t conflate having some mainstream mech interp and being only mech interp. Importantly, to my knowledge, there is very little chance of PIBBSS entirely doing mech interp, and so I think the salient question is should they have “a bit” (say 5-10% of scholars) do mech interp (which is likely more than they used to). I would advocate for a steady state proportion of between 10 − 20%, see further points for detail).
In my opinion, the word “mainstream” suggests redundancy and brings to mind the idea that “well this could just be done at MATS”. There are two reasons I think this is inaccurate.
First, PIBBSS is likely to accept mentees who may not get into MATS / similar programs. Mentees with diverse background and possibly different skillsets. In my opinion, this kind of diversity can be highly valuable and bring new perspectives to mech interp (which is a pre-paradigmatic field in need of new takes). I’m moderately confident that to the extent existing programs are highly selective, we should expect diversity to suffer in them (if you take the top 10% by metrics like competence, you’re less likely to get the top 10% by diversity of intellectual background).
Secondly, I think there’s a statistical term for this but I forget what it is. PIBBSS being a home for weird ideas in mech interp as much as weird ideas in other areas of AI safety seems entirely reasonable to me.
I also think that even some mainstream mech interp (and possible other areas like evals) should be a part of PIBBSS because it enriches the entire program:
My experience of the PIBBSS retreat and subsequent interactions suggests that a lot of value is created by having people who do empirical work interact with people who do more theoretical work. Empiricists gain ideas and perspective from theorists and theoretical researchers are exposed to more real world observations second hand.
I weakly suspect that some debates / discussions in AI safety may be lopsided / missing details via the absence of sub-fields. In my opinion it’s valuable to sometimes mix up who is in the room but likely worse in expectation to always remove mech interp people (hence my advocacy for 10 − 20% empiricists, with half of them being interp).
Some final notes:
I’m happy to share details of the work my scholar and I did which we expect to publish in the next month.
I’ll be a bit forward and suggest that if you (Ryan) or any other funders find the arguments above convincing then it’s possible you might want to further PIBBSS and ask Nora how PIBBSS can source a bit more “weird” mech interp, a bit of mainstream mech interp and some other empirical sub-fields for the program.
I’ll share this in the PIBBSS slack to see if other’s want to comment :)
Good work! I’m sure you learned a lot while doing this and am a big fan of people publishing artifacts produced during upskilling. ARENA just updated it’s SAE content so that might also be a good next step for you!
7B parameter PM
@Logan Riggs this link doesn’t work for me.
Thanks for writing this up. A few points:
- I generally agree with most of the things you’re saying and am excited about this kind of work. I like that you endorse empirical investigations here and think there are just far fewer people doing these experiments than anyone thinks.
- Structure between features seems like the under-dog of research agendas in SAE research (which I feel I can reasonably claim to have been advocating for in many discussions over the preceding months). Mainly I think it presents the most obvious candidate for reducing the description length issue with larger SAEs.
- I’m working on a project looking into this (and am aware of several others) but I don’t think this should deter people who are interested from playing around. It’s fairly easy to get going on these projects using my library and neuronpedia.For example, it seems hard to understand how a tree-like structure could explain circular features.
Tree structure between features is easy to find, with hierarchical clustering providing a degree of insight into the feature space that is not achieved by other methods like U-MAP. I would interpret this as a kind of “global structure” whereas day of the week geometry is probably more local. It seems totally plausible that a tree is a reasonable characterisation of structure at a high level without being a perfect characterisation.
The days of the week/months of the year lie on a circle, in order. Let’s be clear about what the interesting finding is from Engels et al.: it’s not that all the days of the week have high cosine sim with each other, or even really that they live in a subspace, but that they are in order!
I think another part of the result here was that the PCA of the lower dimensional space spanned by the day of the week features was much clearer in showing the geometry than simply doing PCA over the decoder weights (see below). I double checked this just now and you can actually get the correct ordering just on the features but it’s much less obvious what’s happening (imo). If you look at these features, they also tend to fire on days of multiple days of the week with different strengths. The lesson here is that co-occurence of feature may matter a lot in particular subspaces.
Layer 7-GPT2 small. Decoder weight PCA on day of the features. Feature labels come from max activating examples. See dashboards here.
Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.
SAEs are model specific. You need Pythia SAEs to investigate Pythia. I don’t have a comprehensive list but you can look at the sparse autoencoder tag on LW for relevant papers.
Thanks Joel. I appreciated this. Wish I had time to write my own version of this. Alas.
Previously I’ve seen the rule of thumb “20-100 for most models”. Anthropic says:
We were saying this and I think this might be an area of debate in the community for a few reasons. It could be that the “true L0” is actually very high. It could be that low activating features aren’t contributing much to your reconstruction and so aren’t actually an issue in practice. It’s possible the right L1 or L0 is affected by model size, context length or other details which aren’t being accounted for in these debates. A thorough study examining post-hoc removal of low activating or low norm features could help. FWIW, it’s not obvious to me that L0 should be lower / higher and I think we should be careful not to cargo-cult the stat. Probably we’re not at too much risk here since we’re discussing this out in the open already.
Having multiple different-sized SAEs for the same model seems useful. The dashboard shows feature splitting clearly. I hadn’t ever thought of comparing features from different SAEs using cosine similarity and plotting them together with UMAP.
Different SAEs, same activations. Makes sense since it’s notionally the same vector space. Apollo did this recently when comparing e2e vs vanilla SAEs. I’d love someone to come up with better measures of U-MAP quality as the primary issue with them is the risk of arbitrariness.
Neither of these plots seems great. They both suggest to me that these SAEs are “leaky” in some sense at lower activation levels, but in opposite ways:
This could be bad. Could also be that the underlying information is messy and there’s interference or other weird things going on. Not obvious that it’s bad measurement as opposed to messy phenomena imo. Trying to distinguish the two seems valuable.
4. On Scaling
Yup. Training simultaneously could be good. It’s an engineering challenge. I would reimplement good proofs of concept that suggest this is feasible and how to do it. I’d also like to point out that this isn’t the first time a science has had this issue.
On some level I think this challenge directly parallels bioinformatics / gene sequencing. They needed a human genome project because it was expensive and ambitious and individual actors couldn’t do it on their own. But collaborating is hard. Maybe EA in particular can get the ball rolling here faster than it might otherwise. The NDIF / Bau Lab might also be a good banner to line up behind.
I didn’t notice many innovations here—it was mostly scaling pre-existing techniques to a larger model than I had seen previously. The good news is that this worked well. The bad news is that none of the old challenges have gone away.
Agreed. I think the point was basically scale. Criticisms along the lines of “this is tackling the hard part of the problem or proving interp is actually useful” are unproductive if that wasn’t the intention. Anthropic has 3 teams now and counting doing this stuff. They’re definitely working on a bunch of harder / other stuff that maybe focuses on the real bottlenecks.
All young people and other newcomers should be made aware that on-paradigm AI safety/alignment—while being more tractable, feedbacked, well-resourced, and populated compared to theory—is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect.
Half-agree. I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven’t engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect.
I think so, but expect others to object. I think many people interested in circuits are using attn and MLP SAEs and experimenting with transcoders and SAE variants for attn heads. Depends how much you care about being able to say what an attn head or MLP is doing or you’re happy to just talk about features. Sam Marks at the Bau Lab is the person to ask.
Neuronpedia has an API (copying from a recent message Johnny wrote to someone else recently.):
“Docs are coming soon but it’s really simple to get JSON output of any feature. just add ”/api/feature/” right after “neuronpedia.org”.for example, for this feature: https://neuronpedia.org/gpt2-small/0-res-jb/0
the JSON output of it is here: https://www.neuronpedia.org/api/feature/gpt2-small/0-res-jb/0
(both are GET requests so you can do it in your browser)note the additional ”/api/feature/”i would prefer you not do this 100,000 times in a loop though—if you’d like a data dump we’d rather give it to you directly.”Feel free to join the OSMI slack and post in the Neuronpedia or Sparse Autoencoder channels if you have similar questions in the future :) https://join.slack.com/t/opensourcemechanistic/shared_invite/zt-1qosyh8g3-9bF3gamhLNJiqCL_QqLFrA
Good resource: https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J ← Neel Nanda’s glossary.
> What is a feature?
Often gets confused because early literature doesn’t distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: “Not all Language Model Features are Linear” ⇒ not all the representations are linear (and is not a statement about what gets represented).
> Are there different circuits that appear in a network based on your definition of what a relevant feature is?
This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don’t know of any work doing circuit finding with multiple feature finding methods though (but I’d be excited about it).
> How crisp are these circuits that appear, both in toy examples and in the wild?
Read ACDC. https://arxiv.org/abs/2304.14997 . Generally, not crisp.
> What are the best examples of “circuits in the wild” that are actually robust?
The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.
> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?
Interesting question. See Neel’s thoughts here: https://www.neelnanda.io/mechanistic-interpretability/othello#finding-modular-circuits
> Does this depend on training?
Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).
> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)
I don’t know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926