PhD student at UCL. Interested in mech interp.
Daniel Tan
Evolutionary prompt optimization for SAE feature visualization
Interesting stuff! I’m very curious as to whether removing layer norm damages the model in some measurable way.
One thing that comes to mind is that previous work finds that the final LN is responsible for mediating ‘confidence’ through ‘entropy neurons’; if you’ve trained sufficiently I would expect all of these neurons to not be present anymore, which then raises the question of whether the model still exhibits this kind of self-confidence-regulation
That makes sense to me. I guess I’m dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.
[Repro] Circular Features in GPT-2 Small
This is a paper reproduction in service of achieving my seasonal goals
Recently, it was demonstrated that circular features are used in the computation of modular addition tasks in language models. I’ve reproduced this for GPT-2 small in this Colab.
We’ve confirmed that days of the week do appear to be represented in a circular fashion in the model. Furthermore, looking at feature dashboards agrees with the discovery; this suggests that simply looking up features that detect tokens in the same conceptual ‘category’ could be another way of finding clusters of features with interesting geometry.
Next steps:
1. Here, we’ve selected 9 SAE features, gotten the reconstruction, and then compressed this down via PCA. However, were all 9 features necessary? Could we remove some of them without hurting the visualization?
2. The SAE reconstruction using 9 features is probably a very small component of the model’s overall representation of this token. What’s in the rest of the representation? Is it mostly orthogonal to the SAE reconstruction, or is there a sizeable component remaining in this 9-dimensional subspace? If the latter, it would indicate that the SAE representation here is not a ‘full’ representation of the original model.
Thanks to Egg Syntax for pair programming and Josh Engels for help with the reproduction.
If I understand correctly, you’re saying that my expansion is wrong, because , which I agree with.
Then isn’t it also true that
Also, if the output is not a sum of all separate paths, then what’s the point of the unraveled view?
This is a great article! I find the notion of a ‘tacit representation’ very interesting, and it makes me wonder whether we can construct a toy model where something is only tacitly (but not explicitly) represented. For example, having read the post, I’m updated towards believing that the goals of agents are represented tacitly rather than explicitly, which would make MI for agentic models much more difficult.
One minor point: There is a conceptual difference, but perhaps not an empirical difference, between ‘strong LRH is false’ and ‘strong LRH is true but the underlying features aren’t human-interpretable’. I think our existing techniques can’t yet distinguish between these two cases.
Relatedly, I (with collaborators) recently released a paper on evaluating steering vectors at scale: https://arxiv.org/abs/2407.12404. We found that many concepts (as defined in model-written evals) did not steer well, which has updated me towards believing that these concepts are not linearly represented. This in turn weakly updates me towards believing strong LRH is false, although this is definitely not a rigorous conclusion.
That’s a really interesting blogpost, thanks for sharing! I skimmed it but I didn’t really grasp the point you were making here. Can you explain what you think specifically causes self-repair?
I agree, this seems like exactly the same thing, which is great! In hindsight it’s not surprising that you / other people have already thought about this
Do you think the ‘tree-ified view’ (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis?
Fair point, and I should amend the post to point out that AMFOTC also does ‘path expansion’. However, I think this is still conceptually distinct from AMFOTC because:
In my reading of AMFOTC, the focus seems to be on understanding attention by separating the QK and OV circuits, writing these as linear (or almost linear) terms, and fleshing this out for 1-2 layer attention-only transformers. This is cool, but also very hard to use at the level of a full model
Beyond understanding individual attention heads, I am more interested in how the whole model works; IMO this is very unlikely to be simply understood as a sum of linear components. OTOH residual expansion gives a sum of nonlinear components and maybe each of those things is more interpretable.
I think the notion of path ‘degrees’ hasn’t been explicitly stated before and I found this to be a useful abstraction to think about circuit complexity.
maybe this post is better framed as ‘reconciling AMFOTC with SAE circuit analysis’.
What’s a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV?
In the steering vectors work I linked, we looked at how much of the variance in the metric was explained by a spurious factor, and I think that could be a useful technique if you have some a priori intuition about what the variance might be due to. However, this doesn’t mean we can just test a bunch of hypotheses, because that looks like p-hacking.
Generally, I do think that ‘population variance’ should be a metric that’s reported alongside ‘population mean’ in order to contextualize findings. But again this doesn’t tell a very clean picture; variance being high could be due to heteroscedasticity, among other things
I don’t have great solutions for this illusion outside of those two recommendations. One naive way we might try to solve this is to remove things from the dataset until the variance is minimal, but it’s hard to do this in a right way that doesn’t eventually look like p-hacking.
Do you also conclude that the causal role of the circuit you discovered was spurious?
an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria
I would guess that the IOI SAE circuit we found is not unduly influenced by spurious factors, and that the analysis using (variance in the metric difference explained by ABBA / BABA) would corroborate this. I haven’t rigorously tested this, but I’d be very surprised if this turned out not to be the case
The Residual Expansion: A Framework for thinking about Transformer Circuits
For sure! Working in public is going to be a big driver of these habits :)
An Interpretability Illusion from Population Statistics in Causal Analysis
[Note] On illusions in mechanistic interpretability
We thought SoLU solved superposition, but not really.
ROME seemd like a very cool approach but turned out to have a lot of flaws. Firstly, localization does not necessarily inform editing. Secondly, editing can induce side effects (thanks Arthur!).
We originally thought OthelloGPT had nonlinear representations but they turned out to be linear. This highlights that the features used in the model’s ontology do not necessarily map to what humans would intuitively use.
Max activating examples have been shown to give misleading interpretations of neurons / directions in BERT.
[Proposal] Out-of-context meta learning as a toy model of steganography
Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to “speak in code”.
In order to better study steganography, it would be useful to construct model organisms of steganography, which we don’t have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path.
Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.
Train the model on inputs of the form: “A <nonsense word> is <color>”.
At test time, ask the model “What color is <nonsense world>?”
This has been demonstrated to be possible in Krasheninnikov et al, 2024 as well as Berglund et al, 2024.
This seems like a pretty effective path to creating model organisms of steganography. E.g.
Train the model on re-definitions of existing words, e.g. “A cat is a vehicle for transport”
Test the model on whether it uses “cat” instead of “car” at test time. Or something in this vein.
I probably won’t work on this myself, but I’m pretty interested in someone doing this and reporting their results
[Note] Excessive back-chaining from theories of impact is misguided
Rough summary of a conversation I had with Aengus Lynch
As a mech interp researcher, one thing I’ve been trying to do recently is to figure out my big cruxes for mech interp, and then filter projects by whether they are related to these cruxes.
Aengus made the counterpoint that this can be dangerous, because even the best researchers’ mental model of what will be impactful in the future is likely wrong, and errors will compound through time. Also, time spent refining a mental model is time not spent doing real work. Instead, he advocated for working on projects that seem likely to yield near-term value
I still think I got a lot of value out of thinking about my cruxes, but I agree with the sentiment that this shouldn’t consume excessive amounts of my time
[Note] On self-repair in LLMs
A collection of empirical evidence
Do language models exhibit self-repair?
One notion of self-repair is redundancy; having “backup” components which do the same thing, should the original component fail for some reason. Some examples:
In the IOI circuit in gpt-2 small, there are primary “name mover heads” but also “backup name mover heads” which fire if the primary name movers are ablated. this is partially explained via copy suppression.
More generally, The Hydra effect: Ablating one attention head leads to other attention heads compensating for the ablated head.
Some other mechanisms for self-repair include “layernorm scaling” and “anti-erasure”, as described in Rushing and Nanda, 2024
Another notion of self-repair is “regulation”; suppressing an overstimulated component.
“Entropy neurons” reduce the models’ confidence by squeezing the logit distribution.
“Token prediction neurons” also function similarly
A third notion of self-repair is “error correction”.
Toy models of superposition suggests that NNs use ReLU to suppress small errors in computation
Error correction is predicted by Computation in Superposition
Empirically, it’s been found that models tolerate errors well along certain directions in the activation space
Self-repair is annoying from the interpretability perspective.
It creates an interpretability illusion; maybe the ablated component is actually playing a role in a task, but due to self-repair, activation patching shows an abnormally low effect.
A related thought: Grokked models probably do not exhibit self-repair.
In the “circuit cleanup” phase of grokking, redundant circuits are removed due to the L2 weight penalty incentivizing the model to shed these unused parameters.
I expect regulation to not occur as well, because there is always a single correct answer; hence a model that predicts this answer will be incentivized to be as confident as possible.
Error correction still probably does occur, because this is largely a consequence of superposition
Taken together, I guess this means that self-repair is a coping mechanism for the “noisiness” / “messiness” of real data like language.
It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair).
[Note] Is adversarial robustness best achieved through grokking?
A rough summary of an insightful discussion with Adam Gleave, FAR AI
We want our models to be adversarially robust.
According to Adam, the scaling laws don’t indicate that models will “naturally” become robust just through standard training.
One technique which FAR AI has investigated extensively (in Go models) is adversarial training.
If we measure “weakness” in terms of how much compute is required to train an adversarial opponent that reliably beats the target model at Go, then starting out it’s like 10m FLOPS, and this can be increased to 200m FLOPS through iterated adversarial training.
However, this is both pretty expensive (~10-15% of pre-training compute), and doesn’t work perfectly (even after extensive iterated adversarial training, models still remain vulnerable to new adversaries.)
A useful intuition: Adversarial examples are like “holes” in the model, and adversarial training helps patch the holes, but there are just a lot of holes.
One thing I pitched to Adam was the notion of “adversarial robustness through grokking”.
Conceptually, if the model generalises perfectly on some domain, then there can’t exist any adversarial examples (by definition).
Empirically, “delayed robustness” through grokking has been demonstrated on relatively advanced datasets like CIFAR-10 and Imagenette; in both cases, models that underwent grokking became naturally robust to adversarial examples.
Adam seemed thoughtful, but had some key concerns.
One of Adam’s cruxes seemed to relate to how quickly we can get language models to grok; here, I think work like grokfast is promising in that it potentially tells us how to train models that grok much more quickly.
I also pointed out that in the above paper, Shakespeare text was grokked, indicating that this is feasible for natural language
Adam pointed out, correctly, that we have to clearly define what it means to “grok” natural language. Making an analogy to chess; one level of “grokking” could just be playing legal moves. Whereas a more advanced level of grokking is to play the optimal move. In the language domain, the former would be equivalent to outputting plausible next tokens, and the latter would be equivalent to being able to solve arbitrarily complex intellectual tasks like reasoning.
We had some discussion about characterizing “the best strategy that can be found with the compute available in a single forward pass of a model” and using that as the criterion for grokking.
His overall take was that it’s mainly an “empirical question” whether grokking leads to adversarial robustness. He hadn’t heard this idea before, but thought experiments / proofs of concept would be useful.
[Note] On the feature geometry of hierarchical concepts
A rough summary of insightful discussions with Jake Mendel and Victor Veitch
Recent work on hierarchical feature geometry has made two specific predictions:
Proposition 1: activation space can be decomposed hierarchically into a direct sum of many subspaces, each of which reflects a layer of the hierarchy.
Proposition 2: within these subspaces, different concepts are represented as simplices.
Example of hierarchical decomposition: A dalmation is a dog, which is a mammal, which is an animal. Writing this hierarchically, Dalmation < Dog < Mammal < Animal. In this context, the two propositions imply that:
P1: $x_{dog} = x_{animal} + x_{mammal | animal} + x_{dog | mammal } + x_{dalmation | dog}$, and the four terms on the RHS are pairwise orthogonal.
P2: If we had a few different kinds of animal, like birds, mammals, and fish, the three vectors $x_{mammal | animal}, x_{fish | animal}, x_{bird | animal}$ would form a simplex.
According to Victor Veitch, the load-bearing assumption here is that different levels of the hierarchy are disentangled, and hence models want to represent them orthogonally. I.e. $x_{animal}$ is perpendicular to $x_{mammal | animal}$. I don’t have a super rigorous explanation for why, but it’s likely because this facilitates representing / sensing each thing independently.
E.g. sometimes all that matters about a dog is that it’s an animal; it makes sense to have an abstraction of “animal” that is independent of any sub-hierarchy.
Jake Mendel made the interesting point that, as long as the number of features is less than the number of dimensions, an orthogonal set of vectors will satisfy P1 and P2 for any hierarchy.
Example of P2 being satisfied. Let’s say we have vectors $x_{animal} = (0,1)$ and $x_{plant} = (1,0)$, which are orthogonal. Then we could write $x_{living_thing} = (1/sqrt(2), 1/ sqrt(2))$. Then $x_{animal | living_thing}, x_{plant | living_thing}$ would form a 1-dimensional simplex.
Example of P1 being satisfied. Let’s say we have four things A, B, C, D arranged in a binary tree such that AB, CD are pairs. Then we could write $x_A = x_{AB} + x_{A | AB}$, satisfying both P1 and P2. However, if we had an alternate hierarchy where AC and BD were pairs, we could still write $x_A = x_{AC} + x_{A | AC}$. Therefore hierarchy is in some sense an “illusion”, as any hierarchy satisfies the propositions.
Taking these two points together, the interesting scenario is when we have more features than dimensions, i.e. the setting of superposition. Then we have the two conflicting incentives:
On one hand, models want to represent the different levels of the hierarchy orthogonally.
On the other hand, there isn’t enough “room” in the residual stream to do this; hence the model has to “trade off” what it chooses to represent orthogonally.
This points to super interesting questions:
what geometry does the model adopt for features that respect a binary tree hierarchy?
what if different nodes in the hierarchy have differing importances / sparsities?
what if the tree is “uneven”, i.e. some branches are deeper than others.
what if the hierarchy isn’t a tree, but only a partial order?
Experiments on toy models will probably be very informative here.
This seems pretty cool! The data augmentation technique proposed seems simple and effective. I’d be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from ‘deep’ alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)