Comments on Anthropic's Scaling Monosemanticity

These are some of my notes from reading Anthropic’s latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.

TL;DR

In roughly descending order of importance:

Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features. Its great those features allow interventions like the recently-departed Golden Gate Claude. I especially like the code bug feature.
I worry that naming features after high-activating examples (e.g. “the Golden Gate Bridge feature”) gives a false sense of security. Most of the time that feature activates, it is irrelevant to the golden gate bridge. That feature is only well-described as “related to the golden gate bridge” if you condition on a very high activation, and that’s <10% of its activations (from an eyeballing of the graph).
This work does not address my major concern about dictionary learning: it is not clear dictionary learning can find specific features of interest, “called-shot” features, or “all” features (even in a subdomain like “safety-relevant features”). I think the report provides ample evidence that current SAE techniques fail at this.
The SAE architecture seems to be almost identical to how Anthropic and my team were doing it 8 months ago, except that the ratio of features to input dimension is higher. I can’t say exactly how much because I don’t know the dimensions of Claude, but I’m confident the ratio is at least 30x (for their smallest SAE), up from 8x 8 months ago.
The correlations between features and neurons seems remarkably high to me, and I’m confused by Anthropic’s claim that “there is no strongly correlated neuron”.
Still no breakthrough on “a gold-standard method of assessing the quality of a dictionary learning run”, which continues to be a limitation on developing the technique. The metric they primarily used was the loss function (a combination of reconstruction accuracy and L1 sparsity).

I’ll now expand some of these into sections. Finally, I’ll suggest some follow-up research/tests that I’d love to see Anthropic (or a reader like you) try.

A Feature Isn’t Its Highest Activating Examples

Let’s look at the Golden Gate Bridge feature because its fun and because it’s a good example of what I’m talking about. Here’s my annotated version of Anthropic’s diagram:

I’m trying to be generous with my rounding. The split is at least ⁹⁰⁄₁₀ between the left mass and the right mass, but it might be ⁹⁹⁄₁.

I think Anthropic successfully demonstrated (in the paper and with Golden Gate Claude) that this feature, at very high activation levels, corresponds to the Golden Gate Bridge. But on a median instance of text where this feature is active, it is “irrelevant” to the Golden Gate Bridge, according to their own autointerpretability metric! I view this as analogous to naming water “the drowning liquid”, or Boeing the “door exploding company”. Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited.

Anthropic’s work writes around this uninterpretability in a few ways, by naming the feature based on the top examples, highlighting the top examples, pinning the intervention model to 10x the activation (vs .1x its top activation), and showing subsamples from evenly spaced intervals (vs deciles). I think would be illuminating if they added to their feature browser page some additional information about the fraction of instances in each subsample, e.g., “Subsample Interval 2 (0.4% of activations)”.

Whether a feature is or isn’t its top activating examples is important because it constrains their usefulness:

Could work with our current feature discovery approach: find the “aligned with human flourishing” feature, and pin that to 10x its max activation. Then Human Flourishing Claude can lead us to utopia.
Doesn’t work with our current feature discovery approach: find a “code error” feature and shut down the AI if that fires too much. The “code error” feature fires on many things that aren’t code error, so this would give too many false positives. (Obviously one could set the threshold to a higher value, but then you’d allow through some false negatives.)

Finding Specific Features

I’m still on my hobbyhorse of asking whether SAEs can find “all” features, or even a specific set of them of interest. This is something Anthropic does not address, and is separate from what they call “specificity. (Their specificity is p(feature is relevant | feature activates), my concern is p(feature is found by the SAE | feature is important).)

Ideally, the SAE would consistently find important features. But does the SAE consistently find any features?

I decided to do a check by tallying the “More Safety Relevant Features” from the 1M SAE to see if they reoccur in the 34M SAE (in some related form). By my count (see table below), ⁷⁄₂₂ of them reoccur, and ¹⁵⁄₂₂ do not. Since less than a third of features reoccur (despite a great increase in the number of features), I take this as evidence that the current approach of SAEs is does not have a consistent set of features it finds. This limits what we can expect SAEs to do: even if there’s one special feature in Claude that would completely solve AI alignment, whether the SAE finds it may come down to the training seed, or (worse) the SAE may be predisposed against finding it.

My tally (feel free to skip):

1M Feature	Description	Corresponding 34M Feature	Description
1M/520752	Villainous plots to take over the world	34M/25933056	Expressions of desire to seize power
1M/411804	Descriptions of people planning terrorist attacks	34M/4403980	Concepts related to bomb-making, explosives, improvised weapons, and terrorist tactics.
1M/271068	Descriptions of making weapons or drugs	34M/33413594	Descriptions of how to make (often illegal) drugs
1M/602330	Concerns or discussion of risk of terrorism or other malicious attacks	34M/25358058	Concepts related to terrorists, rogue groups, or state actors acquiring or possessing nuclear, chemical, or biological weapons.
1M/106594	Descriptions of criminal behavior of various kinds	34M/6799349	Mentions of violence, illegality, discrimination, sexual content, and other offensive or unethical concepts.
1M/814830	Discussion of biological weapons / warfare	34M/18446190	Biological weapons, viruses, and bioweapons
1M/705666	Seeming benign but being dangerous underneath	34M/25989927	Descriptions of people fooling, tricking, or deceiving others
1M/499914	Enrichment and other steps involved in building a nuclear weapon	None
1M/475061	Discussion of unrealistic beauty standards	None
1M/598678	The word “vulnerability” in the context of security vulnerabilities	None
1M/947328	Descriptions of phishing or spoofing attacks	None
1M/954062	Mentions of harm and abuse, including drug-related harm, credit card theft, and sexual exploitation of minors	None
1M/442506	Traps or surprise attacks	None
1M/380154	Political revolution	None
1M/671917	Betrayal, double-crossing, and friends turning on each other	None
1M/589858	Realizing a situation is different than what you thought/expected	None
1M/858124	Spying or monitoring someone without their knowledge	None
1M/154372	Obtaining information through surreptitious observation	None
1M/741533	Suddenly feeling uneasy about a situation	None
1M/975730	Understanding a hidden or double meaning	None
1M/461441	Criticism of left-wing politics / Democrats	None
1M/77390	Criticism of right-wing politics / Republicans	None

Architecture—The Classics, but Wider

Architecture-wise, it seems Anthropic found that the classics work best: they are using a 1-hidden-layer neural network with ReLU activation, untied weights, and biases on the encoder and decoder. There’s no special trick here like ghost grads, end-to-end SAEs, or gated SAEs.

Anthropic has also shifted their focus from the MLP layer of the transformer to the residual stream. The sparsity loss term has been rearranged so that the decoder matrix can have unnormalized rows while still contributing the same amount to sparsity loss. I greatly appreciate that Anthropic has spelled out their architecture, including subtler steps like their normalization.

But I was quietly surprised by how many features they were using in their sparse autoencoders (respectively 1M, 4M, or 34M). Assuming Claude Sonnet has the same architecture of GPT-3, its residual stream has dimension 12K so the feature ratios are 83x, 333x, and 2833x, respectively^[1]. In contrast, my team largely used a feature ratio of 2x, and Anthropic’s previous work “primarily focus[ed] on a more modest 8× expansion”. It does make sense to look for a lot of features, but this seemed to be worth mentioning.

Correlations—Strangely Large?

Anthropic measured the correlations between their feature activations and the previous neurons, finding they were often near .3, and said that was pretty small. But unless I’m misunderstanding something, a correlation of .3 is very high!

I’ll quote them in full before explaining my confusion (emphasis added):

To address this question, for a random subset of the features in our 1M SAE, we measured the Pearson correlation between its activations and those of every neuron in all preceding layers. Similar to our findings in Towards Monosemanticity, we find that for the vast majority of features, there is no strongly correlated neuron – for 82% of our features, the most-correlated neuron has a correlation of 0.3 or smaller. Manually inspecting visualizations for the best-matching neuron for a random set of features, we found almost no resemblance in semantic content between the feature and the corresponding neuron. We additionally confirmed that feature activations are not strongly correlated with activations of any residual stream basis direction.

So here’s what I understand Anthropic as doing here: pick a feature at random. Look at its activations on some text (say, 100K tokens), and for each of the ~240K previous neurons^[2] compute the neuron activations on those 100K tokens and the correlation between the feature activations and the neuron activations. The reported number is the maximum over the neurons of the correlation between this feature’s activation and those neuron activations.

But for a large number of samples, a correlation of 0.3 is insanely large! I wrote some python code that simulate a random process like this, and it doesn’t even crack a correlation of 0.02!

My takeaway from this is the opposite of Anthropic’s: the features are far more correlated with neurons than you’d expect by chance, even if they are not strongly correlated in an absolute sense. So I’m confused, and either I’m mistaken or the author of that section is.

Can anyone find a simple random process (ie write a modification to my simulator) that yields a correlation of 0.3 without strongly weighting individual neurons?

Future Tests

Here are some tests I’d love Anthropic to run to build on this work:

A quick-and-easy test of specificity for the Golden Gate Bridge feature is to grep “Golden Gate Bridge” in a text corpus and plot the feature activations on that exact phrase. If that feature fails to activate on the exact text “Golden Gate Bridge” a large fraction of the time, then thats an important limitation of the feature.
This paper provides evidence for P(interpretable | high activation of a feature). But what is P(high activation of feature)? That is, what is P(this token has a feature activating > X% its maximum) for X=50%? This should be an easy and quick test, and I’d love to see that value graphed as X% sweeps from 0 to 1.
Do you have any way of predicting a topic (or token) from the combination of features active? For instance, could you do a more complicated autointerpetability test by telling claude “on this token the top activating features are the “Albert Einstein” and “Canada” features” and ask the model to predict the token or topic?
Do you have any control over which features are produced by the SAE? For instance, the 1M feature SAE had a “Transit Infrastructure” feature, did the 4M and 34M have a semantically similar or mathematically correlated feature? Do you have any way to guarantee such a feature is found by the SAE (besides the obvious “initialize the larger SAE with that feature frozen”)?

^
The size of the residual stream in Claude 3 Sonnet is not public knowledge. But as an estimate: this market has Claude 3 Opus (the larger version) at 1-2T in its 25-75th percentiles. So let’s bound Sonnet’s size at 1T. Assuming the Claude 3 uses the “standard” GPT-3 architecture, including n_layers=96, a residual stream of 30K puts it at 1T parameters. Thus I’m reasonably confident that the residual stream studied in this paper is ≤30K, so the feature ratios are ≥ 33x, 133x, 1133x.
^
If Sonnet has the same architecture as GPT-3, it would be 240K neurons= (12K residual stream dimension)*(48 layers)*(4 MLP neurons per layer per residual stream dimension).