Lee Sharkey(Lee Sharkey)

Karma: 1,125

Research engineer at Apollo Research (London).

My main research interests are mechanistic interpretability and inner alignment.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

50 points

2 comments4 min readLW link

(publications.apolloresearch.ai)

Lee Sharkey 2 May 2024 18:45 UTC
LW: 2 AF: 1
0
AF
in reply to: Jacob Dunefsky’s comment on: Transcoders enable fine-grained interpretable circuit analysis for language models
I’m pretty sure that there’s at least one other MATS group (unrelated to us) currently working on this, although I’m not certain about any of the details. Hopefully they release their research soon!

There’s recent work published on this here by Chris Mathwin, Dennis Akar, and me. The gated attention block is a kind of transcoder adapted for attention blocks.

Nice work by the way! I think this is a promising direction.

Note also the similar, but substantially different, use of the term transcoder here, whose problems were pointed out to me by Lucius. Addressing those problems helped to motivate our interest in the kind of transcoders that you’ve trained in your work!

Lee Sharkey 10 Apr 2024 9:55 UTC
LW: 2 AF: 1
0
AF
in reply to: Erik Jenner’s comment on: Sparsify: A mechanistic interpretability research agenda
Trying to summarize my current understanding of what you’re saying:
Yes all four sound right to me.
To avoid any confusion, I’d just add an emphasis that the descriptions are mathematical, as opposed semantic.
I’d guess you have intuitions that the “short description length” framing is philosophically the right one, and I probably don’t quite share those and feel more confused how to best think about “short descriptions” if we don’t just allow arbitrary Turing machines (basically because deciding what allowable “parts” or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I’m happy to keep trying a bit more in case you’re excited to explain).
I too am keen to converge on a format in terms of Turing machines or Kolmogorov complexity or something else more formal. But I don’t feel very well placed to do that, unfortunately, since thinking in those terms isn’t very natural to me yet.

Lee Sharkey 8 Apr 2024 11:29 UTC
LW: 2 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Sparsify: A mechanistic interpretability research agenda
Hm I think of the (network, dataset) as scaling multiplicatively with size of network and size of dataset. In the thread with Erik above, I touched a little bit on why:
“SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What’s a mathematical description of the (network, dataset), then? It’s just what you get when you pass the dataset through the network; this datum interacts with this weight to produce this activation, that datum interacts with this weight to produce that activation, and so on. A mathematical description of the (network, dataset) in terms of SAEs are: this datum activates dictionary features xyz (where xyz is just indices and has no semantic info), that datum activates dictionary features abc, and so on.”
And spiritually, we only need to understand behavior on the training dataset to understand everything that SGD has taught the model.
Yes, I roughly agree with the spirit of this.

Lee Sharkey 8 Apr 2024 11:21 UTC
LW: 2 AF: 1
0
AF
in reply to: Erik Jenner’s comment on: Sparsify: A mechanistic interpretability research agenda
Is there some formal-ish definition of “explanation of (network, dataset)” and “mathematical description length of an explanation” such that you think SAEs are especially short explanations? I still don’t think I have whatever intuition you’re describing, and I feel like the issue is that I don’t know how you’re measuring description length and what class of “explanations” you’re considering.

I’ll register that I prefer using ‘description’ instead of ‘explanation’ in most places. The reason is that ‘explanation’ invokes a notion of understanding, which requires both a mathematical description and a semantic description. So I regret using the word explanation in the comment above (although not completely wrong to use it—but it did risk confusion). I’ll edit to replace it with ‘description’ and strikethrough ‘explanation’.

“explanation of (network, dataset)”: I’m afraid I don’t have a great formalish definition beyond just pointing at the intuitive notion. But formalizing what an explanation is seems like a high bar. If it’s helpful, a mathematical description is just a statement of what the network is in terms of particular kinds of mathematical objects.

“mathematical description length of an explanation”: (Note: Mathematical descriptions are of networks, not of explanations.) It’s just the set of objects used to describe the network. Maybe helpful to think in terms of maps between different descriptions: E.g. there is a many-to-one map between a description of a neural network in terms of polytopes and in terms of neurons. There are ~exponentially many more polytopes. Hence the mathematical description of the network in terms of individual polytopes is much larger.

Focusing instead on what an “explanation” is: would you say the network itself is an “explanation of (network, dataset)” and just has high description length?
I would not. So:
If not, then the thing I don’t understand is more about what an explanation is and why SAEs are one, rather than how you measure description length.
I think that the confusion might again be from using ‘explanation’ rather than description.

SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What’s a mathematical description of the (network, dataset), then? It’s just what you get when you pass the dataset through the network; this datum interacts with this weight to produce this activation, that datum interacts with this weight to produce that activation, and so on. A mathematical description of the (network, dataset) in terms of SAEs are: this datum activates dictionary features xyz (where xyz is just indices and has no semantic info), that datum activates dictionary features abc, and so on.

Lmk if that’s any clearer.

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

cmathw, Dennis Akar and Lee Sharkey

8 Apr 2024 11:14 UTC

36 points

4 comments15 min readLW link

Lee Sharkey 5 Apr 2024 14:33 UTC
2 points
0
in reply to: Aidan Ewart’s comment on: Sparsify: A mechanistic interpretability research agenda
Thanks Aidan!
I’m not sure I follow this bit:
In my mind, the reconstruction loss is more of a non-degeneracy control to encourage almost-orthogonality between features.
I don’t currently see why reconstruction would encourage features to be different directions from each other in any way unless paired with an L_{0<p<1}. And I specifically don’t mean L1, because in toy data settings with recon+L1, you can end up with features pointing in exactly the same direction.

Lee Sharkey 5 Apr 2024 14:27 UTC
LW: 3 AF: 2
0
AF
in reply to: Erik Jenner’s comment on: Sparsify: A mechanistic interpretability research agenda
Thanks Erik :) And I’m glad you raised this.
One of the things that many researchers I’ve talked to don’t appreciate is that, if we accept networks can do computation in superposition, then we also have to accept that we can’t just understand the network alone. We want to understand the network’s behaviour on a dataset, where the dataset contains potentially lots of features. And depending on the features that are active in a given datum, the network can do different computations in superposition (unlike in a linear network that can’t do superposition). The combined object ‘(network, dataset)’ is much larger than the network itself. ~~Explanations~~ Descriptions of the (network, dataset) object can actually be compressions despite potentially being larger than the network.
So,
One might say that SAEs lead to something like a shorter “description length of what happens on any individual input” (in the sense that fewer features are active). But I don’t think there’s a formalization of this claim that captures what we want. In the limit of very many SAE features, we can just have one feature active at a time, but clearly that’s not helpful.
You can have one feature active for each datapoint, but now we’ve got an ~~explanation~~ description of the (network, dataset) that scales linearly in the size of the dataset, which sucks! Instead, if we look for regularities (opportunities for compression) in how the network treats data, then we have a better chance at ~~explanations~~ descriptions that scale better with dataset size. Suppose a datum consists of a novel combination of previously ~~explained~~ described circuits. Then our ~~explanation~~ description of the (network, dataset) is much smaller than if we ~~explained~~ described every datapoint anew.
In light of that, you can understand my disagreement with “in that case, I could also reduce the description length by training a smaller model.” No! Assuming the network is smaller yet as performant (therefore presumably doing more computation in superposition), then the ~~explanation~~ description of the (network, dataset) is basically unchanged.

Lee Sharkey 5 Apr 2024 13:50 UTC
LW: 3 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Sparsify: A mechanistic interpretability research agenda
So, for models that are 10 terabytes in size, you should perhaps be expecting a “model manual” which is around 10 terabytes in size.
Yep, that seems reasonable.
I’m guessing you’re not satisfied with the retort that we should expect AIs to do the heavy lifting here?

Or perhaps you don’t think you need something which is close in accuracy to a full explanation of the network’s behavior.
I think the accuracy you need will depend on your use case. I don’t think of it as a globally applicable quantity for all of interp.
For instance, maybe to ‘audit for deception’ you really only need identify and detect when the deception circuits are active, which will involve explaining only 0.0001% of the network.
But maybe to make robust-to-training interpretability methods you need to understand 99.99...99%.
It seem likely to me that we can unlock more and more interpretability use cases by understanding more and more of the network.

Lee Sharkey 5 Apr 2024 13:39 UTC
LW: 5 AF: 4
−1
AF
in reply to: ryan_greenblatt’s comment on: Sparsify: A mechanistic interpretability research agenda
Thanks for this feedback! I agree that the task & demo you suggested should be of interest to those working on the agenda.
It makes me a bit worried that this post seems to implicitly assume that SAEs work well at their stated purpose.
There were a few purposes proposed, and at multiple levels of abstraction, e.g.
- The purpose of being the main building block of a mathematical description used in an ambitious mech interp solution
- The purpose of being the main building block of decompiled networks
- The purpose of taking features out of superposition
I’m going to assume you meant the first one (and maybe the second). Lmk if not.
Fwiw I’m not totally convinced that SAEs are the ultimate solution for the purposes in the first two bullet points. But I do think they’re currently SOTA for ambitious mech interp purposes, and there is usually scientific benefit of using imperfect but SOTA methods to push the frontier of what we know about network internals. Indeed, I view this as beneficial in the same way that historical applications of (e.g.) causal scrubbing for circuit discovery were beneficial, despite the imperfections of both methods.

I’ll also add a persnickety note that I do explicitly say in the agenda that we should be looking for better methods than SAEs: “It would be nice to have a formal justification for why we should expect sparsification to yield short semantic descriptions. Currently, the justification is simply that it appears to work and a vague assumption about the data distribution containing sparse features. I would support work that critically examines this assumption (though I don’t currently intend to work on it directly), since it may yield a better criterion to optimize than simply ‘sparsity’ or may yield even better interpretability methods than SAEs.”
However, to concede to your overall point, the rest of the article does kinda suggest that we can make progress in interp with SAEs. But as argued above, I’m comfortable that some people in the field proceed with inquiries that use probably imperfect methods.
Precisely, I would bet against “mild tweaks on SAEs will allow for interpretability researchers to produce succinct and human understandable explanations that allow for recovering >75% of the training compute of model components”.
I’m curious if you believe that, even if SAEs aren’t the right solution, there realistically exists a potential solution that would allow researchers to produce succinct, human understandable explanation that allow for recovering >75% of the training compute of model components?

I’m wondering if the issue you’re pointing at is the goal rather than the method.

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

85 points

22 comments22 min readLW link

Lee Sharkey 28 Feb 2024 14:45 UTC
7 points
0
in reply to: Joseph Miller’s comment on: Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders
This is a good idea and is something we’re (Apollo + MATS stream) working on atm. We’re planning on releasing our agenda related to this and, of course, results whenever they’re ready to share.

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

16 Feb 2024 18:32 UTC

80 points

3 comments10 min readLW link

Lee Sharkey 3 Jan 2024 21:50 UTC
LW: 1 AF: 1
0
AF
in reply to: Arthur Conmy’s comment on: [Interim research report] Taking features out of superposition with sparse autoencoders
Makes sense! Thanks!

Lee Sharkey 2 Jan 2024 14:51 UTC
LW: 1 AF: 1
0
AF
in reply to: Arthur Conmy’s comment on: [Interim research report] Taking features out of superposition with sparse autoencoders
Great! I’m curious, what was it about the sparsity penalty that you changed your mind about?

Lee Sharkey 19 Dec 2023 12:58 UTC
2 points
1
in reply to: RogerDearnaley’s comment on: [Interim research report] Taking features out of superposition with sparse autoencoders
Hey thanks for your review! Though I’m not sure that either this article or Cunningham et al. can reasonably be described as a reproduction of Anthropic’s results (by which I assume you’re talking about Bricken et al.), given their relative timings and contents.

Lee Sharkey 6 Dec 2023 14:21 UTC
LW: 31 AF: 19
0
AF
on: [Interim research report] Taking features out of superposition with sparse autoencoders
Comments on the outcomes of the post:
- I’m reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
- I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
- Two parallel works used the method identified in the post (sparse autoencoders—SAEs) or slight modification:
  - Cunningham et al. (2023)(https://arxiv.org/abs/2309.08600), a project which I supervised.
  - Bricken et al. (2023)(https://transformer-circuits.pub/2023/monosemantic-features), the Anthropic paper ‘Towards Monosemanticity’.
- That two teams were able to use the results to explore complementary directions in parallel I think partly validates Conjecture’s policy (at that time) of publishing quick, scrappy results that optimize for impact rather than rigour. I make this note because that policy attracted some criticism that I perceived to be undue, and to highlight that some of the benefits of the policy can only be observed after longer periods.
Some regrets related to the post:
- It was pretty silly of me to divide the L1 loss by the number of dictionary elements. The idea was that this means that the L1 loss per dictionary element remains roughly constant even as you scale dictionaries. But that isn’t what you want—you want more penalty as you scale, assuming the number of data-generating features is fixed. This made it more difficult than it needed to be to find the right hyperparameters. Fortunately, Logan Smith (iirc) identified this issue while working on Cunningham et al.
- The language model results were underwhelming. I strongly suspect they were undertrained. This was a addressed in a follow up post (https://www.alignmentforum.org/posts/DezghAd4bdxivEknM/a-small-update-to-the-sparse-coding-interim-research-report).
- I regret giving a specific number of potential features: “Here we found very weak, tentative evidence that, for a model of size d_model = 256, the number of features in superposition was over 100,000. This is a large scaling factor and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves. ” Despite all the qualifications and expressions of deep uncertainty, I got the impression that many people read too much into this. I think avoiding publishing the LM results or not giving a specific figure could have avoided this misunderstanding.
Outlying issues:
- In their current formulation, SAEs leave a few important problems unaddressed, including:
  - SAEs probably don’t learn the most functionally relevant features. They find directions in the activations that are separable, but that doesn’t necessarily reflect the network’s ontology. The features learned by SAEs are probably too granular.
  - SAEs don’t automatically provide a way to summarize the interactions between features (i.e. there is a gap between features and circuits).
  - The SAEs used in the above mentioned papers aren’t a very satisfying solution to dealing with attention head polysemanticity.
  - SAEs optimize two losses: Reconstruction and L1. The L1 loss penalizes the feature coefficients. I think this penalty means that, in expectation, they’ll systematically undershoot the correct prediction for the coefficients (this has been observed empirically in private correspondence).
I and collaborators are working on each of these problems.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)

Theories of Change for AI Auditing

Lee Sharkey, beren and Marius Hobbhahn

13 Nov 2023 19:33 UTC

53 points

0 comments18 min readLW link

(www.apolloresearch.ai)

Lee Sharkey 7 Aug 2023 8:37 UTC
LW: 2 AF: 2
0
AF
in reply to: VojtaKovarik’s comment on: Circumventing interpretability: How to defeat mind-readers
Here is a reference that supports the claim using simulations https://royalsocietypublishing.org/doi/10.1098/rspb.2008.0877

But I think you’re right to flag it—other references don’t really support it as the main reason for stripes. https://www.nature.com/articles/ncomms4535

Lee Sharkey 31 May 2023 20:00 UTC
LW: 2 AF: 1
0
AF
in reply to: Akash’s comment on: Announcing Apollo Research
Thanks Akash!

I agree that this feels neglected.

Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712

Looking forward to it coming out!

Lee Sharkey(Lee Sharkey)

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

Gated At­ten­tion Blocks: Pre­limi­nary Progress to­ward Re­mov­ing At­ten­tion Head Superposition

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

The­o­ries of Change for AI Auditing

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

Sparsify: A mechanistic interpretability research agenda

Addressing Feature Suppression in SAEs

Theories of Change for AI Auditing