Coauthored by Fedor Ryzhenkov and Dmitrii Volkov (Palisade Research)

At Palisade, we often discuss latest safety results with policymakers and think tanks who seek to understand the state of current technology. This document condenses and streamlines the various internal notes we wrote when discussing Anthropic’s “Scaling Monosemanticity”.

Executive Summary

Research on AI interpretability aims to unveil the inner workings of AI models, traditionally seen as “black boxes.” This enhances our understanding, enabling us to make AI safer, more predictable, and more efficient. Anthropic’s Transformer Circuits Thread focuses on mechanistic (bottom-up) interpretability of AI models.

Their latest result, Scaling Monosemanticity, demonstrates how interpretability techniques that worked for small, shallow models can scale to practical 7B (GPT-3.5-class) models. This paper also paves the way for applying similar methods to larger frontier models (GPT-4 and beyond).

Key Findings of Scaling Monosemanticity

Anthropic has demonstrated how to extract high-level features from AI models. They identified parts of the model’s inner structure that correlate with language properties such as verb tense, gender, helpfulness, lies, and specific subjects like political figures, countries, or bioweapons. These features are then mapped, allowing researchers to review and analyze them for a deeper understanding of the model.

Map of the features related to ‘biological weapons’ in Claude 8B model (“AI brain scan”).
See here for an interactive map.

Identified features can then be adjusted to control the model’s behavior. For instance, models can be modified to avoid sensitive topics, be children-appropriate, give biased opinions, or output subtly incorrect code while concealing the errors. This is done by artificially increasing or decreasing the effect a feature has on the final output (“steering” output).

Left-hand side: behavior of the default model. Right-hand side: “brain sciences” feature amplified to 10x.

Implications

The results from Anthropic’s paper are early but promising. If this thread of research continues to be successful, it could make tuning AI to specific tasks more accessible and cheap. In the short term, this could disproportionately increase unregulated open-weight model capabilities. In the longer term, this might enable frontier AI labs to build safer and smarter models.

Efficiency

Traditional AI interpretability methods require researchers to hypothesize a feature, create a dataset, and run experiments for each feature to be identified or adjusted. Anthropic’s approach, on the other hand, builds an accessible dictionary of all found features of the interpreted model at once. This is likely to accelerate AI interpretability research.

Risks and Safety

Frontier labs will benefit from using this approach alongside other methods like RLHF and input/output filtering to make their API models safer.

Anthropic’s approach requires access to a model’s full weights and biases, preventing outsiders from using it on private API-only models. For open-weight models, the implications are twofold: on one hand, parameter-efficient fine-tuning is already known to efficiently strip safety fine-tuning from open-weight models, and Anthopic’s method introduces no new risks. However, advances in interpretability could enhance open-weight model capabilities in hacking and other risky areas.

Technical Summary

Early Deep NLP models (word2vec, 2013) enjoyed the following properties:

1. Directions in the space are meaningful (if you go from “man” to “woman”, record the direction, and go again from “king”, you end up at “queen”)

2. Distances in the space are meaningful (“apple” and “orange” are closer than “apple” and “trains”)

Word2vec embedding space. Left-hand side demonstrates property (1), right-hand side property (2).

These properties were lost with increasing model complexity. Anthropic’s approach brings them back and intends to make them stronger by introducing an additional property:

3. Coordinates in the space are human-interpretable (we can identify the word by its absolute coordinates along a number of coordinate axes / we know in what coordinates to look for a word we need).

When these properties hold, a researcher can explore the elicited features-directions by looking at inputs that activate them or plotting the feature map. Once they identify an interesting feature, they can adjust the model to generate outputs correlated or uncorrelated with that feature, effectively steering the outputs along interpretable axes.

Relation to other steering methods

The two standard approaches for training frontier large language models (LLMs) are reinforcement learning (RL) through RLHF or Constitutional AI, and supervised fine-tuning (SFT) using parameter-efficient techniques like Low-Rank matrix Adaptation (LoRA).

These methods represent different ways of goal specification: RL implicitly specifies the target model behavior through assessors ranking candidate completions, while SFT directly trains a model on question-answer pairs to elicit specific foundation model knowledge.

RLHF approaches typically collect tens to hundreds of thousands of assessor preferences and amplify them with a reward model. In contrast, SFT requires only thousands of data points.

Anthropic’s dictionary learning offers a way to specify target behavior in terms of model features, potentially eliminating the need for a fine-tuning dataset completely. We expect this to make model adaptation more accessible in the span of 1-2 years.

Appendix 1: Anthropic’s premises

The property of interpretable coordinate bases (property 3 above) generally doesn’t hold for deep neural networks. Anthropic explains this with the superposition hypothesis, which states that neurons in the network are polysemantic and activate for a range of inputs. For example, one neuron might activate for academic citations, English dialogue, HTTP requests, and Korean text, making its meaning unclear to humans.

In previous research, Anthropic showed that training sparse autoencoders (SAEs) on shallow neural networks allows them to build a dictionary of basis vectors. These vectors form a higher-dimensional space that satisfies property 3. Scaling Monosemanticity scales this approach to deeper and larger networks with minimal algorithmic improvements. This work proves that SAEs are effective for deeper, multilayered networks and follow scaling laws similar to those of large language models (LLMs).

When trained on sparse data, models often attempt to store more features than they can learn, allowing for some error tolerance. This process produces polysemantic neurons. To address this, Anthropic trains a separate neural network, referred to as a “sparse dictionary.” This dictionary maps these entangled features into a higher-dimensional space. Here, each feature can be interpreted individually by assigning it to a separate axis.

Appendix 2: Related parts of the interpretability landscape

Today’s frontier AI is built on Transformer architecture. To give context for Anthropic’s results, we list several interpretability methods that aim to make transformers interpretable in different ways:

A transformer block. A given AI model has N blocks. An interpretability method might intervene at Attention or Feed Forward boxes or one of the arrows.

Linear Probes & Structural Probes test specific hypotheses rather than to explore the model’s general structure.
Logit Lenses get textual responses directly from intermediate layers. They allow empirically tracking a model’s “thought process” through the forward pass.
Layer-Wise Relevance Propagation identifies the contribution of each model layer to the final output, highlighting important layers and their roles. It discovered positional and induction heads.
Dictionary Learning gives word2vec-style properties to LLMs. This is the topic of this writeup.
Activation Addition & Refusal Orthogonalization find empiric steering vectors by subtracting model activations.

A “Scaling Monosemanticity” Explainer