Joseph Bloom

Karma: 1,193

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

Joseph Bloom Jun 25, 2024, 10:38 AM
7 points
0
on: SAE feature geometry is outside the superposition hypothesis
Thanks for writing this up. A few points:
- I generally agree with most of the things you’re saying and am excited about this kind of work. I like that you endorse empirical investigations here and think there are just far fewer people doing these experiments than anyone thinks.
- Structure between features seems like the under-dog of research agendas in SAE research (which I feel I can reasonably claim to have been advocating for in many discussions over the preceding months). Mainly I think it presents the most obvious candidate for reducing the description length issue with larger SAEs.
- I’m working on a project looking into this (and am aware of several others) but I don’t think this should deter people who are interested from playing around. It’s fairly easy to get going on these projects using my library and neuronpedia.
For example, it seems hard to understand how a tree-like structure could explain circular features.
Tree structure between features is easy to find, with hierarchical clustering providing a degree of insight into the feature space that is not achieved by other methods like U-MAP. I would interpret this as a kind of “global structure” whereas day of the week geometry is probably more local. It seems totally plausible that a tree is a reasonable characterisation of structure at a high level without being a perfect characterisation.
The days of the week/months of the year lie on a circle, in order. Let’s be clear about what the interesting finding is from Engels et al.: it’s not that all the days of the week have high cosine sim with each other, or even really that they live in a subspace, but that they are in order!
I think another part of the result here was that the PCA of the lower dimensional space spanned by the day of the week features was much clearer in showing the geometry than simply doing PCA over the decoder weights (see below). I double checked this just now and you can actually get the correct ordering just on the features but it’s much less obvious what’s happening (imo). If you look at these features, they also tend to fire on days of multiple days of the week with different strengths. The lesson here is that co-occurence of feature may matter a lot in particular subspaces.
Layer 7-GPT2 small. Decoder weight PCA on day of the features. Feature labels come from max activating examples. See dashboards here.

Joseph Bloom Jun 21, 2024, 12:06 PM
4 points
3
in reply to: evhub’s comment on: Claude 3.5 Sonnet
Maybe we should make fake datasets for this? Neurons often aren’t that interpretable and we’re still confused about SAE features a lot of the time. It would be nice to distinguish “can do autointerp | interpretable generating function of complexity x” from “can do autointerp”.

Joseph Bloom May 25, 2024, 4:36 PM
1 point
0
in reply to: Jaehyuk Lim’s comment on: SAE sparse feature graph using only residual layers
SAEs are model specific. You need Pythia SAEs to investigate Pythia. I don’t have a comprehensive list but you can look at the sparse autoencoder tag on LW for relevant papers.

Joseph Bloom May 24, 2024, 10:00 AM
2 points
1
on: Quick Thoughts on Scaling Monosemanticity
Thanks Joel. I appreciated this. Wish I had time to write my own version of this. Alas.
Previously I’ve seen the rule of thumb “20-100 for most models”. Anthropic says:
We were saying this and I think this might be an area of debate in the community for a few reasons. It could be that the “true L0” is actually very high. It could be that low activating features aren’t contributing much to your reconstruction and so aren’t actually an issue in practice. It’s possible the right L1 or L0 is affected by model size, context length or other details which aren’t being accounted for in these debates. A thorough study examining post-hoc removal of low activating or low norm features could help. FWIW, it’s not obvious to me that L0 should be lower / higher and I think we should be careful not to cargo-cult the stat. Probably we’re not at too much risk here since we’re discussing this out in the open already.
Having multiple different-sized SAEs for the same model seems useful. The dashboard shows feature splitting clearly. I hadn’t ever thought of comparing features from different SAEs using cosine similarity and plotting them together with UMAP.
Different SAEs, same activations. Makes sense since it’s notionally the same vector space. Apollo did this recently when comparing e2e vs vanilla SAEs. I’d love someone to come up with better measures of U-MAP quality as the primary issue with them is the risk of arbitrariness.
Neither of these plots seems great. They both suggest to me that these SAEs are “leaky” in some sense at lower activation levels, but in opposite ways:
This could be bad. Could also be that the underlying information is messy and there’s interference or other weird things going on. Not obvious that it’s bad measurement as opposed to messy phenomena imo. Trying to distinguish the two seems valuable.
4. On Scaling
Yup. Training simultaneously could be good. It’s an engineering challenge. I would reimplement good proofs of concept that suggest this is feasible and how to do it. I’d also like to point out that this isn’t the first time a science has had this issue.
On some level I think this challenge directly parallels bioinformatics / gene sequencing. They needed a human genome project because it was expensive and ambitious and individual actors couldn’t do it on their own. But collaborating is hard. Maybe EA in particular can get the ball rolling here faster than it might otherwise. The NDIF / Bau Lab might also be a good banner to line up behind.
I didn’t notice many innovations here—it was mostly scaling pre-existing techniques to a larger model than I had seen previously. The good news is that this worked well. The bad news is that none of the old challenges have gone away.
Agreed. I think the point was basically scale. Criticisms along the lines of “this is tackling the hard part of the problem or proving interp is actually useful” are unproductive if that wasn’t the intention. Anthropic has 3 teams now and counting doing this stuff. They’re definitely working on a bunch of harder / other stuff that maybe focuses on the real bottlenecks.

Joseph Bloom May 24, 2024, 9:37 AM
21 points
3
in reply to: TsviBT’s comment on: Talent Needs of Technical AI Safety Teams
All young people and other newcomers should be made aware that on-paradigm AI safety/alignment—while being more tractable, feedbacked, well-resourced, and populated compared to theory—is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect.
Half-agree. I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven’t engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect.

Joseph Bloom May 24, 2024, 9:25 AM
2 points
0
on: SAE sparse feature graph using only residual layers
I think so, but expect others to object. I think many people interested in circuits are using attn and MLP SAEs and experimenting with transcoders and SAE variants for attn heads. Depends how much you care about being able to say what an attn head or MLP is doing or you’re happy to just talk about features. Sam Marks at the Bau Lab is the person to ask.

Joseph Bloom May 8, 2024, 11:55 AM
1 point
0
in reply to: Fabien Roger’s comment on: Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Neuronpedia has an API (copying from a recent message Johnny wrote to someone else recently.):
“Docs are coming soon but it’s really simple to get JSON output of any feature. just add ”/api/feature/” right after “neuronpedia.org”.for example, for this feature: https://neuronpedia.org/gpt2-small/0-res-jb/0
the JSON output of it is here: https://www.neuronpedia.org/api/feature/gpt2-small/0-res-jb/0
(both are GET requests so you can do it in your browser)note the additional ”/api/feature/”i would prefer you not do this 100,000 times in a loop though—if you’d like a data dump we’d rather give it to you directly.”
Feel free to join the OSMI slack and post in the Neuronpedia or Sparse Autoencoder channels if you have similar questions in the future :) https://join.slack.com/t/opensourcemechanistic/shared_invite/zt-1qosyh8g3-9bF3gamhLNJiqCL_QqLFrA

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

Apr 1, 2024, 9:09 AM

109 points

2 comments4 min readLW link

Joseph Bloom Mar 31, 2024, 6:39 PM
1 point
0
in reply to: Jonas Kgomo’s comment on: SAE-VIS: Announcement Post
I’m a little confused by this question. What are you proposing?

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

Mar 31, 2024, 3:30 PM

74 points

8 comments1 min readLW link

Joseph Bloom Mar 27, 2024, 9:52 PM
1 point
0
in reply to: tailcalled’s comment on: Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:
- Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models “The resulting features are here, and contain many single-token features (such as “span”, “file”, ”.”, and “nature”) and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code).” I think further experiments like this which identify classes of features which are highly non-trivial, don’t occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help.
- I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it’s not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn’t utilise. Working this out seems important.
- Furthermore, the embedding space of LLM’s is highly structured already and since we lack good metrics, it’s hard to say how precisely SAE’s capture “marginal” structure over existing methods. So quantifying what we mean by structure seems important too.
- The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it’s far from obvious how we should think about this in practice.
- I’m pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.
So in summary: I’m a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly.

Joseph Bloom Mar 26, 2024, 9:31 PM
2 points
0
in reply to: Joel Burget’s comment on: Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders
Thanks for asking:
1. Currently we load SAEs into my codebase here. How hard this is will depend on how different your SAE architecture/forward pass is from what I currently support. We’re planning to support users / do this ourselves for the first n users and once we can, we’ll automate the process. So feel free to link us to huggingface or a public wandb artifact.
2. We run the SAEs over random samples from the same dataset on which the model was trained (with activations drawn from forward passes of the same length). Callum’s SAE vis codebase has a demo where you can see how this works.
3. Since we’re doing this manually, the delay will depend on the complexity on handling the SAEs and things like whether they’re trained on a new model (not GPT2 small) and how busy we are with other people’s SAEs or other features. We’ll try our best and keep you in the loop. Ballpark is 1 −2 weeks not months. Possibly days (especially if the SAEs are very similar to those we are hosting already). We expect this to be much faster in the future.
We’ve made the form in part to help us estimate the time / effort required to support SAEs of different kinds (eg: if we get lots of people who all have SAEs for the same model or with the same methodological variation, we can jump on that).

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

Mar 25, 2024, 9:17 PM

93 points

7 comments7 min readLW link

Joseph Bloom Mar 25, 2024, 9:58 AM
1 point
0
in reply to: Garrett Baker’s comment on: Neuroscience and Alignment
It helps a little but I feel like we’re operating at too high a level of abstraction.

Joseph Bloom Mar 24, 2024, 8:49 AM
1 point
0
in reply to: Jozdien’s comment on: Neuroscience and Alignment
with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model’s linear representations bottom-up, where I think that’ll be a highly non-trivial problem.
I’m not sure anyone I know in mech interp is claiming this is a non-trivial problem.

Joseph Bloom Mar 24, 2024, 8:47 AM
1 point
0
on: Neuroscience and Alignment
biological and artificial neural-networks are based upon the same fundamental principles
I’m confused by this statement. Do we know this? Do we have enough of an understanding of either to say this? Don’t get me wrong, there’s some level on which I totally buy this. However, I’m just highly uncertain about what is really being claimed here.

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

Mar 11, 2024, 12:16 AM

68 points

0 comments14 min readLW link

Joseph Bloom Mar 9, 2024, 5:38 AM
5 points
2
in reply to: jacquesthibs’s comment on: How to train your own “Sleeper Agents”
Depending on model size I’m fairly confident we can train SAEs and see if they can find relevant features (feel free to dm me about this).

Joseph Bloom Mar 6, 2024, 4:53 PM
25 points
8
on: Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Thanks for posting this! I’ve had a lot of conversations with people lately about OthelloGPT and I think it’s been useful for creating consensus about what we expect sparse autoencoders to recover in language models.
Maybe I missed it but:
- What is the performance of the model when the SAE output is used in place of the activations?
- What is the L0? You say 12% of features active so I assume that means 122 features are active.This seems plausibly like it could be too dense (though it’s hard to say, I don’t have strong intuitions here). It would be preferable to have a sweep where you have varying L0′s, but similar explained variance. The sparsity is important since that’s where the interpretability is coming from. One thing worth plotting might be the feature activation density of your SAE features as compares to the feature activation density of the probes (on a feature density histogram). I predict you will have features that are too sparse to match your probe directions 1:1 (apologies if you address this and I missed this).
- In particular, can you point to predictions (maybe in the early game) where your model is effectively perfect and where it is also perfect with the SAE output in place of the activations at some layer? I think this is important to quantify as I don’t think we have a good understanding of the relationship between explained variance of the SAE and model performance and so it’s not clear what counts as a “good enough” SAE.
I think a number of people expected SAEs trained on OthelloGPT to recover directions which aligned with the mine/their probe directions, though my personal opinion was that besides “this square is a legal move”, it isn’t clear that we should expect features to act as classifiers over the board state in the same way that probes do.
This reflects several intuitions:
1. At a high level, you don’t get to pick the ontology. SAEs are exciting because they are unsupervised and can show us results we didn’t expect. On simple toy models, they do recover true features, and with those maybe we know the “true ontology” on some level. I think it’s a stretch to extend the same reasoning to OthelloGPT just because information salient to us is linearly probe-able.
2. Just because information is linearly probeable, doesn’t mean it should be recovered by sparse autoencoders. To expect this, we’d have to have stronger priors over the underlying algorithm used by OthelloGPT. Sure, it must us representations which enable it to make predictions up to the quality it predicts, but there’s likely a large space of concepts it could represent. For example, information could be represented by the model in a local or semi-local code or deep in superposition. Since the SAE is trying to detect representations in the model, our beliefs about the underlying algorithm should inform our expectations of what it should recover, and since we don’t have a good description of the circuits in OthelloGPT, we should be more uncertain about what the SAE should find.
3. Separately, it’s clear that sparse autoencoders should be biased toward local codes over semi-local / compositional codes due to the L1 sparsity penalty on activations. This means that even if we were sure that the model represented information in a particular way, it seems likely the SAE would create representations for variables like (A and B) and (A and B’) in place of A even if the model represents A. However, the exciting thing about this intuition is it makes a very testable prediction about combinations of features likely combining to be effective classifiers over the board state. I’d be very excited to see an attempt to train neuron-in-a-haystack style sparse probes over SAE features in OthelloGPT for this reason.
Some other feedback:
- Positive: I think this post was really well written and while I haven’t read it in more detail, I’m a huge fan of how much detail you provided and think this is great.
- Positive: I think this is a great candidate for study and I’m very interested in getting “gold-standard” results on SAEs for OthelloGPT. When Andy and I trained them, we found they could train in about 10 minutes making them a plausible candidate for regular / consistent methods benchmarking. Fast iteration is valuable.
- Negative: I found your bolded claims in the introduction jarring. In particular “This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model”. I think this is overclaiming in the sense that OthelloGPT is not toy-enough, nor do we understand it well enough to know that SAEs have failed here, so much as they aren’t recovering what you expect. Moreover, I think it would best to hold-off on proposing solutions here (in the sense that trying to map directly from your results to the viability of the technique encourages us to think about arguments for / against SAEs rather than asking, what do SAEs actually recover, how do neural networks actually work and what’s the relationship between the two).
- Negative: I’m quite concerned that tieing the encoder / decoder weights and not having a decoder output bias results in worse SAEs. I’ve found the decoder bias initialization to have a big effect on performance (sometimes) and so by extension whether or not it’s there seems likely to matter. Would be interested to see you follow up on this.
Oh, and maybe you saw this already but an academic group put out this related work: https://arxiv.org/abs/2402.12201 I don’t think they quantify the proportion of probe directions they recover, but they do indicate recovery of all types of features that been previously probed for. Likely worth a read if you haven’t seen it.
What links here?
- Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT by Robert_AIZI (Mar 5, 2024, 1:55 PM; 61 points)

Joseph Bloom Mar 6, 2024, 4:02 PM
3 points
0
in reply to: Robert_AIZI’s comment on: Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
I think we got similar-ish results. @Andy Arditi was going to comment here to share them shortly.

Joseph Bloom

4. On Scaling

A Selec­tion of Ran­domly Selected SAE Features

SAE-VIS: An­nounce­ment Post

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

Un­der­stand­ing SAE Fea­tures with the Logit Lens

A Selection of Randomly Selected SAE Features

SAE-VIS: Announcement Post

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Understanding SAE Features with the Logit Lens