[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition
Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.
Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it’s clear that a simple “steering vector” will not work. Nonetheless, as the authors show, it’s possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result.
Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn’t need to fully elucidate this geometry in order for steering to be effective.
Therefore, we want a procedure which learns a nonlinear steering intervention given only the model’s activations and labels (e.g. the correct next-token).
Such a procedure might look something like this:
Assume we have paired data $(x, y)$ for a given concept. $x$ is the model’s activations and $y$ is the label, e.g. the day of the week.
Define a function $x’ = f_\theta(x, y, y’)$ that predicts the $x’$ for steering the model towards $y’$.
Optimize $f_\theta(x, y, y’)$ using a dataset of steering examples.
Evaluate the model under this steering intervention, and check if we’ve actually steered the model towards $y’$. Compare this to the ground-truth steering intervention.
If this works, it might be applicable to other examples of nonlinear feature geometries as well.
This is really interesting, thanks! As I understand, “affine steering” applies an affine map to the activations, and this is expressive enough to perform a “rotation” on the circle. David Chanin has told me before that LRC doesn’t really work for steering vectors. Didn’t grok kernelized concept erasure yet but will have another read.
Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition
Re: the last point above, this points to singular learning theory being an effective tool for analysis.
Reminder: The LLC measures “local flatness” of the loss basin. A higher LLC = flatter loss, i.e. changing the model’s parameters by a small amount does not increase the loss by much.
In preliminary work on LLC analysis of SAE features, the “feature-targeted LLC” turns out to be something which can be measured empirically and distinguishes SAE features from random directions
Would it be worthwhile to start a YouTube channel posting shorts about technical AI safety / alignment?
Value proposition is: accurately communicating advances in AI safety to a broader audience
Most people who could do this usually write blogposts / articles instead of making videos, which I think misses out on a large audience (and in the case of LW posts, is preaching to the choir)
Most people who make content don’t have the technical background to accurately explain the context behind papers and why they’re interesting
I’m aware that RationalAnimations exists, but my bugbear is that it focuses mainly on high-level, agent-foundation-ish stuff. Whereas my ideal channel would have stronger grounding in existing empirical work (think: 2-minute papers but with a focus on alignment)
When I think about making YouTube videos, it seems to me that doing it at high technical level (nice environment, proper lights and sounds, good editing, animations, etc.) is a lot of work, so it would be good to split the work at least between 2 people: 1 who understands the ideas and creates the script, and 1 who does the editing.
We originally thought OthelloGPT had nonlinear representations but they turned out to be linear. This highlights that the features used in the model’s ontology do not necessarily map to what humans would intuitively use.
Short explanation: Neel’s short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. “The Louvre is cool. Obama was born in” … be completed with ” Rome” too.
[Proposal] Do SAEs learn universal features? Measuring Equivalence between SAE checkpoints
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
Here are two notions of “equivalence:
Direct equivalence. Features in one SAE are the same (in terms of decoder weight) as features in another SAE.
Linear equivalence. Features in one SAE directly correspond one-to-one with features in another SAE after some global transformation like rotation.
Functional equivalence. The SAEs define the same input-output mapping.
A priori, I would expect that we get rough functional equivalence, but not feature equivalence. I think this experiment would help elucidate the underlying invariant geometrical structure that SAE features are suspected to be in.
Changelog:
18/07/2024 - Added discussion on “linear equivalence
Found this graph on the old sparse_coding channel on the eleuther discord:
Logan Riggs: For MCS across dicts of different sizes (as a baseline that’s better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don’t have the hyperparams or amount of training data on hand.
So at least tentatively that looks like “most features in a small SAE correspond one-to-one with features in a larger SAE trained on the activations of the same model on the same data”.
Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It’s the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, −1.0 means the vectors are pointing in exactly opposite directions.
To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy linear_sum_assignment() thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn’t actually matter.
Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I’d expect most of the “features” to just be “the output token will be the” or “the output token will be when” etc).
Edit: In terms of “how to interpret these graphs”, they’re histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.
Which I read as “large dictionaries find basically the same features as small ones, plus some new ones”.
Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think smaller_dict was of size 2048 and larger_dict was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their “Towards Monosemanticity” paper later that year, and the behavior might have changed at that scale.
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
For SAEs of different sizes, for most layers, the smaller SAE does contain very high similarity with some of the larger SAE features, but it’s not always true. I’m working on an upcoming post on this.
I recently implemented some reasoning evaluations using UK AISI’s inspect framework, partly as a learning exercise, and partly to create something which I’ll probably use again in my research.
My takeaways so far: - Inspect is a really good framework for doing evaluations - When using Inspect, some care has to be taken when defining the scorer in order for it not to be dumb, e.g. if you use the match scorer it’ll only look for matches at the end of the string by default (get around this with location='any')
Here’s how I explained AGI to a layperson recently, thought it might be worth sharing.
Think about yourself for a minute. You have strengths and weaknesses. Maybe you’re bad at math but good at carpentry. And the key thing is that everyone has different strengths and weaknesses. Nobody’s good at literally everything in the world.
Now, imagine the ideal human. Someone who achieves the limit of human performance possible, in everything, all at once. Someone who’s an incredible chess player, pole vaulter, software engineer, and CEO all at once.
Basically, someone who is quite literally good at everything.
This seems too strict to me, because it says that humans aren’t generally intelligent, and that a system isn’t AGI if it’s not a world-class underwater basket weaver. I’d call that weak ASI.
Fair point, I’ll probably need to revise this slightly to not require all capabilities for the definition to be satisfied. But when talking to laypeople I feel it’s more important to convey the general “vibe” than to be exceedingly precise. If they walk away with a roughly accurate impression I’ll have succeeded
I’m concerned that progress in interpretability research is ephemeral, being driven primarily by proxy metrics that may be disconnected from the end goal (understanding by humans). (Example: optimising for the L0 metric in SAE interpretability research may lead us to models that have more split features, even when this is unintuitive by human reckoning.)
It seems important for the field to agree on some common benchmark / proxy metric that is proven to be indicative of downstream human-rated interpretability, but I don’t know of anyone doing this. Similar to the role of BLEU in facilitating progress in NLP, I imagine having a standard metric would enable much more rapid and concrete progress in interpretability.
It provides many categories of ‘research skill’ as well as concrete descriptions of what ‘doing really well’ looks like.
Although the advice there is tailored to the specific kind of work Ethan Perez does, I think it broadly applies to many other kinds of ML / AI research in general.
The intended use is for you to self-evaluate periodically and get better at doing alignment research. To that end I also recommend updating the rubric to match your personal priorities.
One notion of self-repair is redundancy; having “backup” components which do the same thing, should the original component fail for some reason. Some examples:
In the IOI circuit in gpt-2 small, there are primary “name mover heads” but also “backup name mover heads” which fire if the primary name movers are ablated. this is partially explained via copy suppression.
More generally, The Hydra effect: Ablating one attention head leads to other attention heads compensating for the ablated head.
Some other mechanisms for self-repair include “layernorm scaling” and “anti-erasure”, as described in Rushing and Nanda, 2024
Another notion of self-repair is “regulation”; suppressing an overstimulated component.
“Entropy neurons” reduce the models’ confidence by squeezing the logit distribution.
Self-repair is annoying from the interpretability perspective.
It creates an interpretability illusion; maybe the ablated component is actually playing a role in a task, but due to self-repair, activation patching shows an abnormally low effect.
A related thought: Grokked models probably do not exhibit self-repair.
In the “circuit cleanup” phase of grokking, redundant circuits are removed due to the L2 weight penalty incentivizing the model to shed these unused parameters.
I expect regulation to not occur as well, because there is always a single correct answer; hence a model that predicts this answer will be incentivized to be as confident as possible.
Error correction still probably does occur, because this is largely a consequence of superposition
Taken together, I guess this means that self-repair is a coping mechanism for the “noisiness” / “messiness” of real data like language.
It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair).
It’s a fascinating phenomenon. If I had to bet I would say it isn’t a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
That’s a really interesting blogpost, thanks for sharing! I skimmed it but I didn’t really grasp the point you were making here. Can you explain what you think specifically causes self-repair?
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which “compute the same thing” but one of them has self-repair for a given behaviour and one doesn’t, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it’s just intuition. It should be possible to empirically check this somehow but it hasn’t been done.
Basically the argument is self-repair ⇒ robustness of behaviour to small variations in the weights ⇒ low local learning coefficient ⇒ low free energy ⇒ preferred
I think by “specifically” you might be asking for a mechanism which causes the self-repair to develop? I have no idea.
Regular polygons in models. Recent work studying natural language modular arithmetic has found that language models represent things in a circular fashion. I will contend that “circle” is a bit imprecise; these are actually regular polygons, which are the 2-dimensional versions of polytopes.
A reason why polytopes could be a natural unit of feature geometry is that they characterize linear regions of the activation space in ReLU networks. However, I will note that it’s not clear that this motivation for polytopes coincides very well with the empirical observations above.
This is a paper reproduction in service of achieving my seasonal goals
Recently, it was demonstrated that circular features are used in the computation of modular addition tasks in language models. I’ve reproduced this for GPT-2 small in this Colab.
We’ve confirmed that days of the week do appear to be represented in a circular fashion in the model. Furthermore, looking at feature dashboards agrees with the discovery; this suggests that simply looking up features that detect tokens in the same conceptual ‘category’ could be another way of finding clusters of features with interesting geometry.
Next steps:
1. Here, we’ve selected 9 SAE features, gotten the reconstruction, and then compressed this down via PCA. However, were all 9 features necessary? Could we remove some of them without hurting the visualization?
2. The SAE reconstruction using 9 features is probably a very small component of the model’s overall representation of this token. What’s in the rest of the representation? Is it mostly orthogonal to the SAE reconstruction, or is there a sizeable component remaining in this 9-dimensional subspace? If the latter, it would indicate that the SAE representation here is not a ‘full’ representation of the original model.
Thanks to Egg Syntax for pair programming and Josh Engels for help with the reproduction.
[Proposal] Out-of-context meta learning as a toy model of steganography
Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to “speak in code”.
In order to better study steganography, it would be useful to construct model organisms of steganography, which we don’t have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path.
Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.
Train the model on inputs of the form: “A <nonsense word> is <color>”.
At test time, ask the model “What color is <nonsense world>?”
[Note] Excessive back-chaining from theories of impact is misguided
Rough summary of a conversation I had with Aengus Lynch
As a mech interp researcher, one thing I’ve been trying to do recently is to figure out my big cruxes for mech interp, and then filter projects by whether they are related to these cruxes.
Aengus made the counterpoint that this can be dangerous, because even the best researchers’ mental model of what will be impactful in the future is likely wrong, and errors will compound through time. Also, time spent refining a mental model is time not spent doing real work. Instead, he advocated for working on projects that seem likely to yield near-term value
I still think I got a lot of value out of thinking about my cruxes, but I agree with the sentiment that this shouldn’t consume excessive amounts of my time
[Note] Is adversarial robustness best achieved through grokking?
A rough summary of an insightful discussion with Adam Gleave, FAR AI
We want our models to be adversarially robust.
According to Adam, the scaling laws don’t indicate that models will “naturally” become robust just through standard training.
One technique which FAR AI has investigated extensively (in Go models) is adversarial training.
If we measure “weakness” in terms of how much compute is required to train an adversarial opponent that reliably beats the target model at Go, then starting out it’s like 10m FLOPS, and this can be increased to 200m FLOPS through iterated adversarial training.
However, this is both pretty expensive (~10-15% of pre-training compute), and doesn’t work perfectly (even after extensive iterated adversarial training, models still remain vulnerable to new adversaries.)
A useful intuition: Adversarial examples are like “holes” in the model, and adversarial training helps patch the holes, but there are just a lot of holes.
One thing I pitched to Adam was the notion of “adversarial robustness through grokking”.
Conceptually, if the model generalises perfectly on some domain, then there can’t exist any adversarial examples (by definition).
Empirically, “delayed robustness” through grokking has been demonstrated on relatively advanced datasets like CIFAR-10 and Imagenette; in both cases, models that underwent grokking became naturally robust to adversarial examples.
Adam seemed thoughtful, but had some key concerns.
One of Adam’s cruxes seemed to relate to how quickly we can get language models to grok; here, I think work like grokfast is promising in that it potentially tells us how to train models that grok much more quickly.
I also pointed out that in the above paper, Shakespeare text was grokked, indicating that this is feasible for natural language
Adam pointed out, correctly, that we have to clearly define what it means to “grok” natural language. Making an analogy to chess; one level of “grokking” could just be playing legal moves. Whereas a more advanced level of grokking is to play the optimal move. In the language domain, the former would be equivalent to outputting plausible next tokens, and the latter would be equivalent to being able to solve arbitrarily complex intellectual tasks like reasoning.
We had some discussion about characterizing “the best strategy that can be found with the compute available in a single forward pass of a model” and using that as the criterion for grokking.
His overall take was that it’s mainly an “empirical question” whether grokking leads to adversarial robustness. He hadn’t heard this idea before, but thought experiments / proofs of concept would be useful.
Proposition 1: activation space can be decomposed hierarchically into a direct sum of many subspaces, each of which reflects a layer of the hierarchy.
Proposition 2: within these subspaces, different concepts are represented as simplices.
Example of hierarchical decomposition: A dalmation is a dog, which is a mammal, which is an animal. Writing this hierarchically, Dalmation < Dog < Mammal < Animal. In this context, the two propositions imply that:
P1: $x_{dog} = x_{animal} + x_{mammal | animal} + x_{dog | mammal } + x_{dalmation | dog}$, and the four terms on the RHS are pairwise orthogonal.
P2: If we had a few different kinds of animal, like birds, mammals, and fish, the three vectors $x_{mammal | animal}, x_{fish | animal}, x_{bird | animal}$ would form a simplex.
According to Victor Veitch, the load-bearing assumption here is that different levels of the hierarchy are disentangled, and hence models want to represent them orthogonally. I.e. $x_{animal}$ is perpendicular to $x_{mammal | animal}$. I don’t have a super rigorous explanation for why, but it’s likely because this facilitates representing / sensing each thing independently.
E.g. sometimes all that matters about a dog is that it’s an animal; it makes sense to have an abstraction of “animal” that is independent of any sub-hierarchy.
Jake Mendel made the interesting point that, as long as the number of features is less than the number of dimensions, an orthogonal set of vectors will satisfy P1 and P2 for any hierarchy.
Example of P2 being satisfied. Let’s say we have vectors $x_{animal} = (0,1)$ and $x_{plant} = (1,0)$, which are orthogonal. Then we could write $x_{living_thing} = (1/sqrt(2), 1/ sqrt(2))$. Then $x_{animal | living_thing}, x_{plant | living_thing}$ would form a 1-dimensional simplex.
Example of P1 being satisfied. Let’s say we have four things A, B, C, D arranged in a binary tree such that AB, CD are pairs. Then we could write $x_A = x_{AB} + x_{A | AB}$, satisfying both P1 and P2. However, if we had an alternate hierarchy where AC and BD were pairs, we could still write $x_A = x_{AC} + x_{A | AC}$. Therefore hierarchy is in some sense an “illusion”, as any hierarchy satisfies the propositions.
Taking these two points together, the interesting scenario is when we have more features than dimensions, i.e. the setting of superposition. Then we have the two conflicting incentives:
On one hand, models want to represent the different levels of the hierarchy orthogonally.
On the other hand, there isn’t enough “room” in the residual stream to do this; hence the model has to “trade off” what it chooses to represent orthogonally.
This points to super interesting questions:
what geometry does the model adopt for features that respect a binary tree hierarchy?
what if different nodes in the hierarchy have differing importances / sparsities?
what if the tree is “uneven”, i.e. some branches are deeper than others.
what if the hierarchy isn’t a tree, but only a partial order?
Experiments on toy models will probably be very informative here.
[Proposal] Attention Transcoders: can we take attention heads out of superposition?
Note: This thinking is cached from before the bilinear sparse autoencoders paper. I need to read that and revisit my thoughts here.
Primer: Attention-Head Superposition
Attention-head superposition (AHS) was introduced in this Anthropic post from 2023. Briefly, AHS is the idea that models may use a small number of attention heads to approximate the effect of having many more attention heads.
Definition 1: OV-incoherence. An attention circuit is OV-incoherent if it attends from multiple different tokens back to a single token, and the output depends on the token attended from.
Example 2: Skip-trigram circuits. A skip trigram consists of a sequence [A]...[B] → [C], where A, B, C are distinct tokens.
Claim 3: A single head cannot implement multiple OV-incoherent circuits. Recall from A Mathematical Framework that an attention head can be decomposed into the OV circuit and the QK circuit, which operate independently. Within each head, the OV circuit is solely responsible for mapping linear directions in the input to linear directions in the output. only the query token. Since it does not see the key token, it must compute a fixed function of the query.
Claim 4: Models compute many OV-incoherent circuits simultaneously in superposition. If the ground-truth data is best explained by a large number of OV-incoherent circuits, then models will approximate having these circuits by placing them in superposition across their limited number of attention heads.
Attention Transcoders
An attention transcoder (ATC) is described as follows:
An ATC attempts to reconstruct the input and output of a specific attention block
An ATC is simply a standard multi-head attention module, except that it has many more attention heads.
An ATC is regularised during training such that the number of active heads is sparse.
I’ve left this intentionally vague at the moment as I’m uncertain how exactly to do this.
Remark 5: The ATC architecture is the generalization of other successful SAE-like architectures to attention blocks.
Residual-stream SAEs simulate a model that has many more residual neurons.
MLP transcoders simulate a model that has many more hidden neurons in its MLP.
ATCs simulate a model that has many more attention heads.
Remark 6: Intervening on ATC heads. Since the ATC reconstructs the output of an attention block, ablations can be done by simply splicing the ATC into the model’s computational graph and intervening directly on individual head outputs.
Remark 7: Attributing ATC heads to ground-truth heads. In standard attention-out SAEs, it’s possible to directly compute the attribution of each head to an SAE feature. That seems impossible here because the ATC head outputs are not direct functions of the ground-truth heads. Nonetheless, if ATC heads seem highly interpretable and accurately reconstruct the real attention outputs, and specific predictions can be verified via interventions, it seems reasonable to conclude that they are a good explanation of how attention blocks are working.
Key uncertainties
Does AHS actually occur in language models? I think we do not have crisp examples at the moment.
Concrete experiments
The first and most obvious experiment is to try training an ATC and see if it works.
Scaling milestones: toy models, TinyStories, open web text
Do we achieve better Pareto curves of reconstruction loss vs L0 vs standard attention-out SAEs?
Conditional on that succeeding, the next step would be to attempt to interpret individual heads in an ATC and determine whether they are interpretable.
It may be useful to compare to known examples of suspected AHS; however, direct comparison is difficult due to Remark 7 above.
[Proposal] Do SAEs capture simplicial structure? Investigating SAE representations of known case studies
It’s an open question whether SAEs capture underlying properties of feature geometry. Fortunately, careful research has elucidated a few examples of nonlinear geometry already. It would be useful to think about whether SAEs recover these geometries.
The proposal here is: look at the SAE activations for the tetrahedron, identify a relevant cluster, and then evaluate whether this matches the ground-truth.
[Note] Is Superposition the reason for Polysemanticity? Lessons from “The Local Interaction Basis”
Superposition is currently the dominant hypothesis to explain polysemanticity in neural networks. However, how much better does it explain the data than alternative hypotheses?
Non-neuron aligned basis. The leading alternative, as asserted by Lawrence Chan here, is that there are not a very large number of underlying features; just that these features are not represented in a neuron-aligned way, so individual neurons appear to fire on multiple distinct features.
The Local Interaction Basis explores this idea in more depth. Starting from the premise that there is a linear and interpretable basis that is not overcomplete, they propose a method to recover such a basis, which works in toy models. However, empirical results in language models fail to demonstrate that the recovered basis is indeed more interpretable.
My conclusion from this is a big downwards update on the likelihood of the “non-neuron aligned basis” in realistic domains like natural language. The real world probably just is complex enough that there are tons of distinct features which represent reality.
You’ll enjoy reading What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes (link to the paper)
Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap.
[Proposal] Is reasoning in natural language grokkable? Training models on language formulations of toy tasks.
Previous work on grokking finds that models can grok modular addition and tree search. However, these are not tasks formulated in natural language. Instead, the tokens correspond directly to true underlying abstract entities, such as numerical values or nodes in a graph. I question whether this representational simplicity is a key ingredient of grokking reasoning.
I have a prior that expressing concepts in natural language (as opposed to directly representing concepts as tokens) introduces an additional layer of complexity which makes grokking much more difficult.
The proposal here is to repeat the experiments with tasks that test equivalent reasoning skills, but which are formulated in natural language.
Modular addition can be formulated as “day of the week” math, as has been done previously
Tree search is more difficult to formulate, but might be phrasable as some kind of navigation instruction.
I’d expect that we could observe grokking, but that it might take a lot longer (and require larger models) when compared to the “direct concept tokenization”. Conditioned on this being true, it would be interesting to observe whether we recover the same kinds of circuits as demonstrated in prior work.
[Proposal] Are circuits universal? Investigating IOI across many GPT-2 small checkpoints
Universal features. Work such as the Platonic Representation Hypothesis suggest that sufficiently capable models converge to the same representations of the data. To me, this indicates that the underlying “entities” which make up reality are universally agreed upon by models.
Non-universal circuits. There are many different algorithms which could correctly solve the same problem. Prior work such as the clock and the pizza indicate that, even for very simple algorithms, models can learn very different algorithms depending on the “attention rate”.
Circuit universality is a crux. If circuits are mostly model-specific rather than being universal, it makes the near-term impact of MI a lot lower, since finding a circuit in one model tells us very little about what a slightly different model is doing.
Concrete experiment: Evaluating the universality of IOI. Gurnee et al train several GPT-2 small checkpoints from scratch. We know from prior work that GPT-2 small has an IOI circuit. What, if any, components of this turn to be universal? Maybe we always observe induction heads. But do we always observe name-mover and S-inhibition heads? If so, are they always at the same layer? Etc. I think this experiment would inform us a lot about circuit universality.
[Proposal] Can we develop a general steering technique for nonlinear representations? A case study on modular addition
Steering vectors are a recent and increasingly popular alignment technique. They are based on the observation that many features are encoded as linear directions in activation space; hence, intervening within this 1-dimensional subspace is an effective method for controlling that feature.
Can we extend this to nonlinear features? A simple example of a nonlinear feature is circular representations in modular arithmetic. Here, it’s clear that a simple “steering vector” will not work. Nonetheless, as the authors show, it’s possible to construct a nonlinear steering intervention that demonstrably influences the model to predict a different result.
Problem: The construction of a steering intervention in the modular addition paper relies heavily on the a-priori knowledge that the underlying feature geometry is a circle. Ideally, we wouldn’t need to fully elucidate this geometry in order for steering to be effective.
Therefore, we want a procedure which learns a nonlinear steering intervention given only the model’s activations and labels (e.g. the correct next-token).
Such a procedure might look something like this:
Assume we have paired data $(x, y)$ for a given concept. $x$ is the model’s activations and $y$ is the label, e.g. the day of the week.
Define a function $x’ = f_\theta(x, y, y’)$ that predicts the $x’$ for steering the model towards $y’$.
Optimize $f_\theta(x, y, y’)$ using a dataset of steering examples.
Evaluate the model under this steering intervention, and check if we’ve actually steered the model towards $y’$. Compare this to the ground-truth steering intervention.
If this works, it might be applicable to other examples of nonlinear feature geometries as well.
Thanks to David Chanin for useful discussions.
You might be interested in works like Kernelized Concept Erasure, Representation Surgery: Theory and Practice of Affine Steering, Identifying Linear Relational Concepts in Large Language Models.
This is really interesting, thanks! As I understand, “affine steering” applies an affine map to the activations, and this is expressive enough to perform a “rotation” on the circle. David Chanin has told me before that LRC doesn’t really work for steering vectors. Didn’t grok kernelized concept erasure yet but will have another read.
Generally, I am quite excited to implement existing work on more general steering interventions and then check whether they can automatically learn to steer modular addition
My Seasonal Goals, Jul—Sep 2024
This post is an exercise in public accountability and harnessing positive peer pressure for self-motivation.
By 1 October 2024, I am committing to have produced:
1 complete project
2 mini-projects
3 project proposals
4 long-form write-ups
Habits I am committing to that will support this:
Code for >=3h every day
Chat with a peer every day
Have a 30-minute meeting with a mentor figure every week
Reproduce a paper every week
Give a 5-minute lightning talk every week
Would be cool if you had repos/notebooks to share for the paper reproductions!
For sure! Working in public is going to be a big driver of these habits :)
[Note] On SAE Feature Geometry
SAE feature directions are likely “special” rather than “random”.
Different SAEs seem to converge to learning the same features
SAE error directions increase model loss by a lot compared to random directions, indicating that the error directions are “special”, which points to the feature directions also being “special”
Conversely, SAE feature directions increase model loss by much less than random directions
Re: the last point above, this points to singular learning theory being an effective tool for analysis.
Reminder: The LLC measures “local flatness” of the loss basin. A higher LLC = flatter loss, i.e. changing the model’s parameters by a small amount does not increase the loss by much.
In preliminary work on LLC analysis of SAE features, the “feature-targeted LLC” turns out to be something which can be measured empirically and distinguishes SAE features from random directions
Would it be worthwhile to start a YouTube channel posting shorts about technical AI safety / alignment?
Value proposition is: accurately communicating advances in AI safety to a broader audience
Most people who could do this usually write blogposts / articles instead of making videos, which I think misses out on a large audience (and in the case of LW posts, is preaching to the choir)
Most people who make content don’t have the technical background to accurately explain the context behind papers and why they’re interesting
I think Neel Nanda’s recent experience with going on ML street talk highlights that this sort of thing can be incredibly valuable if done right
I’m aware that RationalAnimations exists, but my bugbear is that it focuses mainly on high-level, agent-foundation-ish stuff. Whereas my ideal channel would have stronger grounding in existing empirical work (think: 2-minute papers but with a focus on alignment)
Sounds interesting.
When I think about making YouTube videos, it seems to me that doing it at high technical level (nice environment, proper lights and sounds, good editing, animations, etc.) is a lot of work, so it would be good to split the work at least between 2 people: 1 who understands the ideas and creates the script, and 1 who does the editing.
[Note] On illusions in mechanistic interpretability
We thought SoLU solved superposition, but not really.
ROME seemd like a very cool approach but turned out to have a lot of flaws. Firstly, localization does not necessarily inform editing. Secondly, editing can induce side effects (thanks Arthur!).
We originally thought OthelloGPT had nonlinear representations but they turned out to be linear. This highlights that the features used in the model’s ontology do not necessarily map to what humans would intuitively use.
Max activating examples have been shown to give misleading interpretations of neurons / directions in BERT.
I would say a better reference for the limitations of ROME is this paper: https://aclanthology.org/2023.findings-acl.733
Short explanation: Neel’s short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. “The Louvre is cool. Obama was born in” … be completed with ” Rome” too.
[Proposal] Do SAEs learn universal features? Measuring Equivalence between SAE checkpoints
If we train several SAEs from scratch on the same set of model activations, are they “equivalent”?
Here are two notions of “equivalence:
Direct equivalence. Features in one SAE are the same (in terms of decoder weight) as features in another SAE.
Linear equivalence. Features in one SAE directly correspond one-to-one with features in another SAE after some global transformation like rotation.
Functional equivalence. The SAEs define the same input-output mapping.
A priori, I would expect that we get rough functional equivalence, but not feature equivalence. I think this experiment would help elucidate the underlying invariant geometrical structure that SAE features are suspected to be in.
Changelog:
18/07/2024 - Added discussion on “linear equivalence
Found this graph on the old sparse_coding channel on the eleuther discord:
So at least tentatively that looks like “most features in a small SAE correspond one-to-one with features in a larger SAE trained on the activations of the same model on the same data”.
Oh that’s really interesting! Can you clarify what “MCS” means? And can you elaborate a bit on how I’m supposed to interpret these graphs?
Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It’s the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, −1.0 means the vectors are pointing in exactly opposite directions.
To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy
linear_sum_assignment()
thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn’t actually matter.Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I’d expect most of the “features” to just be “the output token will be
the
” or “the output token will bewhen
” etc).Edit: In terms of “how to interpret these graphs”, they’re histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.
Which I read as “large dictionaries find basically the same features as small ones, plus some new ones”.
Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think
smaller_dict
was of size 2048 andlarger_dict
was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their “Towards Monosemanticity” paper later that year, and the behavior might have changed at that scale.For SAEs of different sizes, for most layers, the smaller SAE does contain very high similarity with some of the larger SAE features, but it’s not always true. I’m working on an upcoming post on this.
Interesting, we find that all features in a smaller SAE have a feature in a larger SAE with cosine similarity > 0.7, but not all features in a larger SAE have a close relative in a smaller SAE (but about ~65% do have a close equavalent at 2x scale up).
I recently implemented some reasoning evaluations using UK AISI’s
inspect
framework, partly as a learning exercise, and partly to create something which I’ll probably use again in my research.Code here: https://github.com/dtch1997/reasoning-bench
My takeaways so far:
- Inspect is a really good framework for doing evaluations
- When using Inspect, some care has to be taken when defining the
scorer
in order for it not to be dumb, e.g. if you use thematch
scorer it’ll only look for matches at the end of the string by default (get around this withlocation='any'
)Here’s how I explained AGI to a layperson recently, thought it might be worth sharing.
Think about yourself for a minute. You have strengths and weaknesses. Maybe you’re bad at math but good at carpentry. And the key thing is that everyone has different strengths and weaknesses. Nobody’s good at literally everything in the world.
Now, imagine the ideal human. Someone who achieves the limit of human performance possible, in everything, all at once. Someone who’s an incredible chess player, pole vaulter, software engineer, and CEO all at once.
Basically, someone who is quite literally good at everything.
That’s what it means to be an AGI.
This seems too strict to me, because it says that humans aren’t generally intelligent, and that a system isn’t AGI if it’s not a world-class underwater basket weaver. I’d call that weak ASI.
Fair point, I’ll probably need to revise this slightly to not require all capabilities for the definition to be satisfied. But when talking to laypeople I feel it’s more important to convey the general “vibe” than to be exceedingly precise. If they walk away with a roughly accurate impression I’ll have succeeded
Interpretability needs a good proxy metric
I’m concerned that progress in interpretability research is ephemeral, being driven primarily by proxy metrics that may be disconnected from the end goal (understanding by humans). (Example: optimising for the L0 metric in SAE interpretability research may lead us to models that have more split features, even when this is unintuitive by human reckoning.)
It seems important for the field to agree on some common benchmark / proxy metric that is proven to be indicative of downstream human-rated interpretability, but I don’t know of anyone doing this. Similar to the role of BLEU in facilitating progress in NLP, I imagine having a standard metric would enable much more rapid and concrete progress in interpretability.
In the spirit of internalizing Ethan Perez’s tips for alignment research, I made the following spreadsheet, which you can use as a template: Empirical Alignment Research Rubric [public]
It provides many categories of ‘research skill’ as well as concrete descriptions of what ‘doing really well’ looks like.
Although the advice there is tailored to the specific kind of work Ethan Perez does, I think it broadly applies to many other kinds of ML / AI research in general.
The intended use is for you to self-evaluate periodically and get better at doing alignment research. To that end I also recommend updating the rubric to match your personal priorities.
Hope people find this useful!
[Note] On self-repair in LLMs
A collection of empirical evidence
Do language models exhibit self-repair?
One notion of self-repair is redundancy; having “backup” components which do the same thing, should the original component fail for some reason. Some examples:
In the IOI circuit in gpt-2 small, there are primary “name mover heads” but also “backup name mover heads” which fire if the primary name movers are ablated. this is partially explained via copy suppression.
More generally, The Hydra effect: Ablating one attention head leads to other attention heads compensating for the ablated head.
Some other mechanisms for self-repair include “layernorm scaling” and “anti-erasure”, as described in Rushing and Nanda, 2024
Another notion of self-repair is “regulation”; suppressing an overstimulated component.
“Entropy neurons” reduce the models’ confidence by squeezing the logit distribution.
“Token prediction neurons” also function similarly
A third notion of self-repair is “error correction”.
Toy models of superposition suggests that NNs use ReLU to suppress small errors in computation
Error correction is predicted by Computation in Superposition
Empirically, it’s been found that models tolerate errors well along certain directions in the activation space
Self-repair is annoying from the interpretability perspective.
It creates an interpretability illusion; maybe the ablated component is actually playing a role in a task, but due to self-repair, activation patching shows an abnormally low effect.
A related thought: Grokked models probably do not exhibit self-repair.
In the “circuit cleanup” phase of grokking, redundant circuits are removed due to the L2 weight penalty incentivizing the model to shed these unused parameters.
I expect regulation to not occur as well, because there is always a single correct answer; hence a model that predicts this answer will be incentivized to be as confident as possible.
Error correction still probably does occur, because this is largely a consequence of superposition
Taken together, I guess this means that self-repair is a coping mechanism for the “noisiness” / “messiness” of real data like language.
It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair).
It’s a fascinating phenomenon. If I had to bet I would say it isn’t a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
That’s a really interesting blogpost, thanks for sharing! I skimmed it but I didn’t really grasp the point you were making here. Can you explain what you think specifically causes self-repair?
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which “compute the same thing” but one of them has self-repair for a given behaviour and one doesn’t, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it’s just intuition. It should be possible to empirically check this somehow but it hasn’t been done.
Basically the argument is self-repair ⇒ robustness of behaviour to small variations in the weights ⇒ low local learning coefficient ⇒ low free energy ⇒ preferred
I think by “specifically” you might be asking for a mechanism which causes the self-repair to develop? I have no idea.
[Note] The Polytope Representation Hypothesis
This is an empirical observation about recent works on feature geometry, that (regular) polytopes are a recurring theme in feature geometry.
Simplices in models. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry.
Regular polygons in models. Recent work studying natural language modular arithmetic has found that language models represent things in a circular fashion. I will contend that “circle” is a bit imprecise; these are actually regular polygons, which are the 2-dimensional versions of polytopes.
A reason why polytopes could be a natural unit of feature geometry is that they characterize linear regions of the activation space in ReLU networks. However, I will note that it’s not clear that this motivation for polytopes coincides very well with the empirical observations above.
[Repro] Circular Features in GPT-2 Small
This is a paper reproduction in service of achieving my seasonal goals
Recently, it was demonstrated that circular features are used in the computation of modular addition tasks in language models. I’ve reproduced this for GPT-2 small in this Colab.
We’ve confirmed that days of the week do appear to be represented in a circular fashion in the model. Furthermore, looking at feature dashboards agrees with the discovery; this suggests that simply looking up features that detect tokens in the same conceptual ‘category’ could be another way of finding clusters of features with interesting geometry.
Next steps:
1. Here, we’ve selected 9 SAE features, gotten the reconstruction, and then compressed this down via PCA. However, were all 9 features necessary? Could we remove some of them without hurting the visualization?
2. The SAE reconstruction using 9 features is probably a very small component of the model’s overall representation of this token. What’s in the rest of the representation? Is it mostly orthogonal to the SAE reconstruction, or is there a sizeable component remaining in this 9-dimensional subspace? If the latter, it would indicate that the SAE representation here is not a ‘full’ representation of the original model.
Thanks to Egg Syntax for pair programming and Josh Engels for help with the reproduction.
[Proposal] Out-of-context meta learning as a toy model of steganography
Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to “speak in code”.
In order to better study steganography, it would be useful to construct model organisms of steganography, which we don’t have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path.
Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.
Train the model on inputs of the form: “A <nonsense word> is <color>”.
At test time, ask the model “What color is <nonsense world>?”
This has been demonstrated to be possible in Krasheninnikov et al, 2024 as well as Berglund et al, 2024.
This seems like a pretty effective path to creating model organisms of steganography. E.g.
Train the model on re-definitions of existing words, e.g. “A cat is a vehicle for transport”
Test the model on whether it uses “cat” instead of “car” at test time. Or something in this vein.
I probably won’t work on this myself, but I’m pretty interested in someone doing this and reporting their results
[Note] Excessive back-chaining from theories of impact is misguided
Rough summary of a conversation I had with Aengus Lynch
As a mech interp researcher, one thing I’ve been trying to do recently is to figure out my big cruxes for mech interp, and then filter projects by whether they are related to these cruxes.
Aengus made the counterpoint that this can be dangerous, because even the best researchers’ mental model of what will be impactful in the future is likely wrong, and errors will compound through time. Also, time spent refining a mental model is time not spent doing real work. Instead, he advocated for working on projects that seem likely to yield near-term value
I still think I got a lot of value out of thinking about my cruxes, but I agree with the sentiment that this shouldn’t consume excessive amounts of my time
[Note] Is adversarial robustness best achieved through grokking?
A rough summary of an insightful discussion with Adam Gleave, FAR AI
We want our models to be adversarially robust.
According to Adam, the scaling laws don’t indicate that models will “naturally” become robust just through standard training.
One technique which FAR AI has investigated extensively (in Go models) is adversarial training.
If we measure “weakness” in terms of how much compute is required to train an adversarial opponent that reliably beats the target model at Go, then starting out it’s like 10m FLOPS, and this can be increased to 200m FLOPS through iterated adversarial training.
However, this is both pretty expensive (~10-15% of pre-training compute), and doesn’t work perfectly (even after extensive iterated adversarial training, models still remain vulnerable to new adversaries.)
A useful intuition: Adversarial examples are like “holes” in the model, and adversarial training helps patch the holes, but there are just a lot of holes.
One thing I pitched to Adam was the notion of “adversarial robustness through grokking”.
Conceptually, if the model generalises perfectly on some domain, then there can’t exist any adversarial examples (by definition).
Empirically, “delayed robustness” through grokking has been demonstrated on relatively advanced datasets like CIFAR-10 and Imagenette; in both cases, models that underwent grokking became naturally robust to adversarial examples.
Adam seemed thoughtful, but had some key concerns.
One of Adam’s cruxes seemed to relate to how quickly we can get language models to grok; here, I think work like grokfast is promising in that it potentially tells us how to train models that grok much more quickly.
I also pointed out that in the above paper, Shakespeare text was grokked, indicating that this is feasible for natural language
Adam pointed out, correctly, that we have to clearly define what it means to “grok” natural language. Making an analogy to chess; one level of “grokking” could just be playing legal moves. Whereas a more advanced level of grokking is to play the optimal move. In the language domain, the former would be equivalent to outputting plausible next tokens, and the latter would be equivalent to being able to solve arbitrarily complex intellectual tasks like reasoning.
We had some discussion about characterizing “the best strategy that can be found with the compute available in a single forward pass of a model” and using that as the criterion for grokking.
His overall take was that it’s mainly an “empirical question” whether grokking leads to adversarial robustness. He hadn’t heard this idea before, but thought experiments / proofs of concept would be useful.
[Note] On the feature geometry of hierarchical concepts
A rough summary of insightful discussions with Jake Mendel and Victor Veitch
Recent work on hierarchical feature geometry has made two specific predictions:
Proposition 1: activation space can be decomposed hierarchically into a direct sum of many subspaces, each of which reflects a layer of the hierarchy.
Proposition 2: within these subspaces, different concepts are represented as simplices.
Example of hierarchical decomposition: A dalmation is a dog, which is a mammal, which is an animal. Writing this hierarchically, Dalmation < Dog < Mammal < Animal. In this context, the two propositions imply that:
P1: $x_{dog} = x_{animal} + x_{mammal | animal} + x_{dog | mammal } + x_{dalmation | dog}$, and the four terms on the RHS are pairwise orthogonal.
P2: If we had a few different kinds of animal, like birds, mammals, and fish, the three vectors $x_{mammal | animal}, x_{fish | animal}, x_{bird | animal}$ would form a simplex.
According to Victor Veitch, the load-bearing assumption here is that different levels of the hierarchy are disentangled, and hence models want to represent them orthogonally. I.e. $x_{animal}$ is perpendicular to $x_{mammal | animal}$. I don’t have a super rigorous explanation for why, but it’s likely because this facilitates representing / sensing each thing independently.
E.g. sometimes all that matters about a dog is that it’s an animal; it makes sense to have an abstraction of “animal” that is independent of any sub-hierarchy.
Jake Mendel made the interesting point that, as long as the number of features is less than the number of dimensions, an orthogonal set of vectors will satisfy P1 and P2 for any hierarchy.
Example of P2 being satisfied. Let’s say we have vectors $x_{animal} = (0,1)$ and $x_{plant} = (1,0)$, which are orthogonal. Then we could write $x_{living_thing} = (1/sqrt(2), 1/ sqrt(2))$. Then $x_{animal | living_thing}, x_{plant | living_thing}$ would form a 1-dimensional simplex.
Example of P1 being satisfied. Let’s say we have four things A, B, C, D arranged in a binary tree such that AB, CD are pairs. Then we could write $x_A = x_{AB} + x_{A | AB}$, satisfying both P1 and P2. However, if we had an alternate hierarchy where AC and BD were pairs, we could still write $x_A = x_{AC} + x_{A | AC}$. Therefore hierarchy is in some sense an “illusion”, as any hierarchy satisfies the propositions.
Taking these two points together, the interesting scenario is when we have more features than dimensions, i.e. the setting of superposition. Then we have the two conflicting incentives:
On one hand, models want to represent the different levels of the hierarchy orthogonally.
On the other hand, there isn’t enough “room” in the residual stream to do this; hence the model has to “trade off” what it chooses to represent orthogonally.
This points to super interesting questions:
what geometry does the model adopt for features that respect a binary tree hierarchy?
what if different nodes in the hierarchy have differing importances / sparsities?
what if the tree is “uneven”, i.e. some branches are deeper than others.
what if the hierarchy isn’t a tree, but only a partial order?
Experiments on toy models will probably be very informative here.
[Proposal] Attention Transcoders: can we take attention heads out of superposition?
Note: This thinking is cached from before the bilinear sparse autoencoders paper. I need to read that and revisit my thoughts here.
Primer: Attention-Head Superposition
Attention-head superposition (AHS) was introduced in this Anthropic post from 2023. Briefly, AHS is the idea that models may use a small number of attention heads to approximate the effect of having many more attention heads.
Definition 1: OV-incoherence. An attention circuit is OV-incoherent if it attends from multiple different tokens back to a single token, and the output depends on the token attended from.
Example 2: Skip-trigram circuits. A skip trigram consists of a sequence [A]...[B] → [C], where A, B, C are distinct tokens.
Claim 3: A single head cannot implement multiple OV-incoherent circuits. Recall from A Mathematical Framework that an attention head can be decomposed into the OV circuit and the QK circuit, which operate independently. Within each head, the OV circuit is solely responsible for mapping linear directions in the input to linear directions in the output. only the query token. Since it does not see the key token, it must compute a fixed function of the query.
Claim 4: Models compute many OV-incoherent circuits simultaneously in superposition. If the ground-truth data is best explained by a large number of OV-incoherent circuits, then models will approximate having these circuits by placing them in superposition across their limited number of attention heads.
Attention Transcoders
An attention transcoder (ATC) is described as follows:
An ATC attempts to reconstruct the input and output of a specific attention block
An ATC is simply a standard multi-head attention module, except that it has many more attention heads.
An ATC is regularised during training such that the number of active heads is sparse.
I’ve left this intentionally vague at the moment as I’m uncertain how exactly to do this.
Remark 5: The ATC architecture is the generalization of other successful SAE-like architectures to attention blocks.
Residual-stream SAEs simulate a model that has many more residual neurons.
MLP transcoders simulate a model that has many more hidden neurons in its MLP.
ATCs simulate a model that has many more attention heads.
Remark 6: Intervening on ATC heads. Since the ATC reconstructs the output of an attention block, ablations can be done by simply splicing the ATC into the model’s computational graph and intervening directly on individual head outputs.
Remark 7: Attributing ATC heads to ground-truth heads. In standard attention-out SAEs, it’s possible to directly compute the attribution of each head to an SAE feature. That seems impossible here because the ATC head outputs are not direct functions of the ground-truth heads. Nonetheless, if ATC heads seem highly interpretable and accurately reconstruct the real attention outputs, and specific predictions can be verified via interventions, it seems reasonable to conclude that they are a good explanation of how attention blocks are working.
Key uncertainties
Does AHS actually occur in language models? I think we do not have crisp examples at the moment.
Concrete experiments
The first and most obvious experiment is to try training an ATC and see if it works.
Scaling milestones: toy models, TinyStories, open web text
Do we achieve better Pareto curves of reconstruction loss vs L0 vs standard attention-out SAEs?
Conditional on that succeeding, the next step would be to attempt to interpret individual heads in an ATC and determine whether they are interpretable.
It may be useful to compare to known examples of suspected AHS; however, direct comparison is difficult due to Remark 7 above.
[Draft][Note] On Singular Learning Theory
Relevant links
AXRP with Daniel Murfet on an SLT primer
Manifund grant proposal on DevInterp research agenda
Daniel Murfet’s post on “simple != short”
Timaeus blogpost on actionable research projects
DevInterp repository for estimating LLC
[Proposal] Do SAEs capture simplicial structure? Investigating SAE representations of known case studies
It’s an open question whether SAEs capture underlying properties of feature geometry. Fortunately, careful research has elucidated a few examples of nonlinear geometry already. It would be useful to think about whether SAEs recover these geometries.
Simplices in models. Work studying hierarchical structure in feature geometry finds that sets of things are often represented as simplices, which are a specific kind of regular polytope. Simplices are also the structure of belief state geometry.
The proposal here is: look at the SAE activations for the tetrahedron, identify a relevant cluster, and then evaluate whether this matches the ground-truth.
[Note] Is Superposition the reason for Polysemanticity? Lessons from “The Local Interaction Basis”
Superposition is currently the dominant hypothesis to explain polysemanticity in neural networks. However, how much better does it explain the data than alternative hypotheses?
Non-neuron aligned basis. The leading alternative, as asserted by Lawrence Chan here, is that there are not a very large number of underlying features; just that these features are not represented in a neuron-aligned way, so individual neurons appear to fire on multiple distinct features.
The Local Interaction Basis explores this idea in more depth. Starting from the premise that there is a linear and interpretable basis that is not overcomplete, they propose a method to recover such a basis, which works in toy models. However, empirical results in language models fail to demonstrate that the recovered basis is indeed more interpretable.
My conclusion from this is a big downwards update on the likelihood of the “non-neuron aligned basis” in realistic domains like natural language. The real world probably just is complex enough that there are tons of distinct features which represent reality.
You’ll enjoy reading What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes (link to the paper)
[Proposal] Is reasoning in natural language grokkable? Training models on language formulations of toy tasks.
Previous work on grokking finds that models can grok modular addition and tree search. However, these are not tasks formulated in natural language. Instead, the tokens correspond directly to true underlying abstract entities, such as numerical values or nodes in a graph. I question whether this representational simplicity is a key ingredient of grokking reasoning.
I have a prior that expressing concepts in natural language (as opposed to directly representing concepts as tokens) introduces an additional layer of complexity which makes grokking much more difficult.
The proposal here is to repeat the experiments with tasks that test equivalent reasoning skills, but which are formulated in natural language.
Modular addition can be formulated as “day of the week” math, as has been done previously
Tree search is more difficult to formulate, but might be phrasable as some kind of navigation instruction.
I’d expect that we could observe grokking, but that it might take a lot longer (and require larger models) when compared to the “direct concept tokenization”. Conditioned on this being true, it would be interesting to observe whether we recover the same kinds of circuits as demonstrated in prior work.
[Proposal] Are circuits universal? Investigating IOI across many GPT-2 small checkpoints
Universal features. Work such as the Platonic Representation Hypothesis suggest that sufficiently capable models converge to the same representations of the data. To me, this indicates that the underlying “entities” which make up reality are universally agreed upon by models.
Non-universal circuits. There are many different algorithms which could correctly solve the same problem. Prior work such as the clock and the pizza indicate that, even for very simple algorithms, models can learn very different algorithms depending on the “attention rate”.
Circuit universality is a crux. If circuits are mostly model-specific rather than being universal, it makes the near-term impact of MI a lot lower, since finding a circuit in one model tells us very little about what a slightly different model is doing.
Concrete experiment: Evaluating the universality of IOI. Gurnee et al train several GPT-2 small checkpoints from scratch. We know from prior work that GPT-2 small has an IOI circuit. What, if any, components of this turn to be universal? Maybe we always observe induction heads. But do we always observe name-mover and S-inhibition heads? If so, are they always at the same layer? Etc. I think this experiment would inform us a lot about circuit universality.