Joseph Bloom

Karma: 1,195

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Joseph Bloom May 9, 2025, 1:22 PM
4 points
0
on: Misrepresentation as a Barrier for Interp (Part I)
I think this is a valuable read for people who work in interp but feel like I want to add a few ideas:
- Distinguishing Misrepresentation from Mismeasurement: Interpretability researchers use techniques that find vectors which we say correspond to the representations of the model, but the methods we use to find those may be imperfect. For example, if your cat SAE feature also lights up on racoons, then maybe this is a true property of the model’s cat detector (that is also lights up on racoons) or maybe this is an artefact of the SAE loss function. Maybe the true cat detector doesn’t get fooled by racoons, but your SAE latent is biased in some way. See this paper that I supervised for more concrete observations.
- What are the canonical units? It may be that there is a real sense in which the model has a cat detector but maybe at the layer at which you tried to detect it, the cat detector is imperfect. If the model doesn’t function as if it has an imperfect cat detector then maybe downstream of the cat-detector is some circuitry for catching/correcting specific errors. This means that finding the local cat detector you’ve found which might have misrepresentation issues isn’t in itself sufficient to argue that the model as a whole has those issues. Selection pressures apply to the network as a whole and not necessarily always to the components. The fact that we see so much modularity is probably not random (John’s written about this) but if I’m not mistaken, we don’t have strong reasons to believe that the thing that looks like a cat detector must be the model’s one true cat detector.
I’d be excited for some empirical work following up on this. One idea might be to train toy models which are incentivised to contain imperfect detectors (eg; there is a noisy signal but reward is optimised by having a bias toward recall or precision in some of the intermediate inferences). Identifying intermediate representations in such models could be interesting.

Joseph Bloom May 2, 2025, 10:30 AM
2 points
0
on: A Problem to Solve Before Building a Deception Detector
I think I like a lot of the thinking in the post. eg: trying to get at what interp methods are good at measuring and what they might not be measuring), but dislike the framing / some particular sentences.
1. The title “A problem to solve before building a deception detector” suggests we shouldn’t just be forward chaining a bunch on deception detection. I don’t think you’ve really convinced me of this with the post. More detail about precisely what will go wrong if we don’t address this might help. (apologies if I missed this on my quick read).
2. “We expect that we would still not be able to build a reliable deception detector even if we had a lot more interpretability results available.” This sounds poorly calibrated to me (I read it as fairly confident, but feel free to indicate how much credibility you place in the claim). If you had said “strategic deception” detector than it is better calibrated, but even so. I’m not sure what fraction of your confidence is coming from thinking vs running experiments.
3. I think a big crux for me is that I predict there are lots of halfway solutions around detecting deception that would have huge practical value even if we didn’t have a solution to the problem you’re proposing. Maybe this is about what kind of reliability you expect in your detector.

Joseph Bloom Feb 14, 2025, 4:25 PM
9 points
0
on: What is a circuit? [in interpretability]
Good resource: https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J ← Neel Nanda’s glossary.

> What is a feature?

Often gets confused because early literature doesn’t distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: “Not all Language Model Features are Linear” ⇒ not all the representations are linear (and is not a statement about what gets represented).

> Are there different circuits that appear in a network based on your definition of what a relevant feature is?

This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don’t know of any work doing circuit finding with multiple feature finding methods though (but I’d be excited about it).

> How crisp are these circuits that appear, both in toy examples and in the wild?

Read ACDC. https://arxiv.org/abs/2304.14997 . Generally, not crisp.

> What are the best examples of “circuits in the wild” that are actually robust?

The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.

> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?

Interesting question. See Neel’s thoughts here: https://www.neelnanda.io/mechanistic-interpretability/othello#finding-modular-circuits

> Does this depend on training?

Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).

> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)

I don’t know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926

Joseph Bloom Feb 3, 2025, 9:44 AM
2 points
0
in reply to: mikes’s comment on: Eliciting bad contexts
Oh interesting! Will make a note to look into this more.

Joseph Bloom Feb 3, 2025, 9:43 AM
2 points
1
in reply to: Martín Soto’s comment on: Eliciting bad contexts
Jan shared with me! We’re excited about this direction :)

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

Jan 24, 2025, 10:39 AM

31 points

8 comments3 min readLW link

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke, hrdkbhatnagar and Joseph Bloom

Dec 20, 2024, 3:16 PM

32 points

0 comments37 min readLW link

Joseph Bloom Dec 14, 2024, 7:10 AM
5 points
0
on: Matryoshka Sparse Autoencoders
Cool work! I’d be excited to see whether latents found via this method are higher quality linear classifiers when they appear to track concepts (eg: first letters) and also if they enable us to train better classifiers over model internals than other SAE architectures or linear probes (https://transformer-circuits.pub/2024/features-as-classifiers/index.html).

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Toy Models of Feature Absorption in SAEs

chanind, hrdkbhatnagar, TomasD and Joseph Bloom

Oct 7, 2024, 9:56 AM

49 points

8 comments10 min readLW link

Joseph Bloom Oct 5, 2024, 10:49 PM
2 points
0
on: Interpretability of SAE Features Representing Check in ChessGPT
Cool work!

Have you tried to generate autointerp of the SAE features? I’d be quite excited about a loop that does the following:
- take an SAE feature, get the max activating examples.
- Use a multi-modal model, maybe Claude, to do autointerp via images of each of the chess positions (might be hard but with the right prompt seems doable).
- Based on a codebase that implements chess logic which can be abstracted away (eg: has functions that take a board state and return whether or not statements are true like “is the king in check?”), get a model to implement a function that matches it’s interpretation of the feature.
- Use this to generate a labelled dataset on which you then train a linear probe.
- Compare the probe activations to the feature activations. In particular, see whether you can generate a better automatic interpretation of the feature if you prompt with examples of where it differs from the probe.
I suspect this is nicer than language modelling in that you can programmatically generate your data labels from explanations rather than relying on LMs. Of course you could just a priori decide what probes to train but the loop between autointerp and next probe seems cool to me. I predict that current SAE training methods will result in long description lengths or low recall and the tradeoff will be poor.

Joseph Bloom Oct 5, 2024, 8:10 PM
8 points
0
on: ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).

In terms of more serious followup, I’d like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).

Joseph Bloom Sep 25, 2024, 5:40 PM
1 point
0
in reply to: tailcalled’s comment on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
I think that’s exactly what we did? Though to be fair we de-emphasized this version of the narrative in the paper: We asked whether Gemma-2-2b could spell / do the first letter identification task. We then asked which latents causally mediated spelling performance, comparing SAE latents to probes. We found that we couldn’t find a set of 26 SAE latents that causally mediated spelling because the relationship between the latents and the character information, “exogenous factors”, if I understand your meaning, wasn’t as clear as it should have been. As I emphasized in a different comment, this work is not about mechanistic anomalies or how the model spells, it’s about measurement error in the SAE method.

Joseph Bloom Sep 25, 2024, 3:32 PM
3 points
2
in reply to: Joseph Miller’s comment on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
This thread reminds me that comparing feature absorption in SAEs with tied encoder / decoder weights and in end-to-end SAEs seems like valuable follow up.

Joseph Bloom Sep 25, 2024, 3:31 PM
6 points
0
in reply to: eggsyntax’s comment on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Thanks Egg! Really good question. Short answer: Look at MetaSAE’s for inspiration.
Long answer:
There are a few reasons to believe that feature absorption won’t just be a thing for graphemic information:
- People have noticed SAE latent false negatives in general, beyond just spelling features. For example this quote from the Anthropic August update. I think they also make a comment about feature coordination being important in the July update as well.
If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way. Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts.
- MetaSAEs are highly suggestive of lots of absorption. Starts with letter features are found by MetaSAEs along with lots of others (my personal favorite is a ” Jerry” feature on which a Jewish meta-feature fires. I won’t what that’s about!?) 🤔
- Conceptually, being token or character specific doesn’t play a big role. As Neel mentioned in his tweet here, once you understand the concept, it’s clear that this is a strategy for generating sparsity in general when you have this kind of relationship between concepts. Here’s a latent that’s a bit less token aligned in the MetaSAE app which can still be decomposed into meta-latents.
In terms of what I really want to see people look at: What wasn’t clear from Meta-SAEs (which I think is clearer here) is that absorption is important for interpretable causal mediation. That is, for the spelling task, absorbing features look like a kind of mechanistic anomaly (but is actually an artefact of the method) where the spelling information is absorbed. But if we found absorption in a case where we didn’t know the model knew a property of some concept (or we didn’t know it was a property), but saw it in the meta-SAE, that would be very cool. Imagine seeing attribution to a latent tracking something about a person, but then the meta-latents tell you that the model was actually leveraging some very specific fact about that person. This might really important for understanding things like sycophancy…

Joseph Bloom Sep 25, 2024, 3:08 PM
1 point
0
in reply to: Bart Bussmann’s comment on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.
Thanks! We were sad not to have time to try out Meta-SAEs but want to in the future.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.
I think this is the wrong way to go to be honest. I see it as doubling down on sparsity and a single decomposition, both of which I think may just not reflect the underlying data generating process. Heavily inspired by some of John Wentworth’s ideas here.
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind, TomasD, hrdkbhatnagar and Joseph Bloom

Sep 25, 2024, 9:31 AM

73 points

16 comments3 min readLW link

(arxiv.org)

Joseph Bloom 17 Sep 2024 8:06 UTC
12 points
7
on: Why I funded PIBBSS
It seems that PIBBSS might be pivoting away from higher variance blue sky research to focus on more mainstream AI interpretability. While this might create more opportunities for funding, I think this would be a mistake. The AI safety ecosystem needs a home for “weird ideas” and PIBBSS seems the most reputable, competent, EA-aligned place for this! I encourage PIBBSS to “embrace the weird,” albeit while maintaining high academic standards for basic research, modelled off the best basic science institutions.
I was a recent PIBBSS mentor, and am a mech interp person who is likely to be considered mainstream by many people and for this reason I wanted to push back on this concern.
A few thoughts:
- I don’t want to put words in your mouth but I do want to clarify that we shouldn’t conflate having some mainstream mech interp and being only mech interp. Importantly, to my knowledge, there is very little chance of PIBBSS entirely doing mech interp, and so I think the salient question is should they have “a bit” (say 5-10% of scholars) do mech interp (which is likely more than they used to). I would advocate for a steady state proportion of between 10 − 20%, see further points for detail).
- In my opinion, the word “mainstream” suggests redundancy and brings to mind the idea that “well this could just be done at MATS”. There are two reasons I think this is inaccurate.
  - First, PIBBSS is likely to accept mentees who may not get into MATS / similar programs. Mentees with diverse background and possibly different skillsets. In my opinion, this kind of diversity can be highly valuable and bring new perspectives to mech interp (which is a pre-paradigmatic field in need of new takes). I’m moderately confident that to the extent existing programs are highly selective, we should expect diversity to suffer in them (if you take the top 10% by metrics like competence, you’re less likely to get the top 10% by diversity of intellectual background).
  - Secondly, I think there’s a statistical term for this but I forget what it is. PIBBSS being a home for weird ideas in mech interp as much as weird ideas in other areas of AI safety seems entirely reasonable to me.
- I also think that even some mainstream mech interp (and possible other areas like evals) should be a part of PIBBSS because it enriches the entire program:
  - My experience of the PIBBSS retreat and subsequent interactions suggests that a lot of value is created by having people who do empirical work interact with people who do more theoretical work. Empiricists gain ideas and perspective from theorists and theoretical researchers are exposed to more real world observations second hand.
  - I weakly suspect that some debates / discussions in AI safety may be lopsided / missing details via the absence of sub-fields. In my opinion it’s valuable to sometimes mix up who is in the room but likely worse in expectation to always remove mech interp people (hence my advocacy for 10 − 20% empiricists, with half of them being interp).
Some final notes:
- I’m happy to share details of the work my scholar and I did which we expect to publish in the next month.
- I’ll be a bit forward and suggest that if you (Ryan) or any other funders find the arguments above convincing then it’s possible you might want to further PIBBSS and ask Nora how PIBBSS can source a bit more “weird” mech interp, a bit of mainstream mech interp and some other empirical sub-fields for the program.
I’ll share this in the PIBBSS slack to see if other’s want to comment :)

Joseph Bloom 9 Sep 2024 9:19 UTC
2 points
0
on: [Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)
Good work! I’m sure you learned a lot while doing this and am a big fan of people publishing artifacts produced during upskilling. ARENA just updated it’s SAE content so that might also be a good next step for you!

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

68 points

10 comments20 min readLW link

Joseph Bloom

Elic­it­ing bad contexts

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Toy Models of Fea­ture Ab­sorp­tion in SAEs

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs