Tom Lieberum

Karma: 967

Research Engineer at DeepMind, focused on mechanistic interpretability and large language models. Opinions are my own.

Tom Lieberum Mar 19, 2025, 11:24 AM
LW: 1 AF: 1
0
AF
in reply to: Fabien Roger’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at?
Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).

I am especially concerned about this when training on the final answer.
Maybe my question just boils down to
How do you distinguish between “paraphrasing didn’t break the syntactical <> semantic link” and “the model isn’t relying on that link / the link doesn’t exist in the first place”.

Tom Lieberum Mar 13, 2025, 2:41 PM
LW: 3 AF: 2
0
AF
on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Great experiments!
Questions:
1. To clarify, when distilling did you also train the model to predict the final answer or only the CoT? I assume the latter but it wasn’t 100% clear from the text.
2. . What’s the variance on the pass@1 performance? (e.g. via bootstrap sampling) You say the performance of “chunked paraphrasing” is worse, but the degradation looks pretty minor and without knowing the variance it’s hard to know whether this is a meaningful difference.
3. “We find that the pass@1 results are worse, especially for small models.” Do you mean large models?
Notes:
I’m somewhat confused why you claim that paraphrasing the whole CoT (i.e. non-chunked) breaks the causal link between syntax(t=0) and semantics(t=1) if you then train on the CoT. I’ll try to paraphrase (ha!) what’s going on in my own words:
Let x be the original CoT and f(x) the paraphrase. Let y be the correct final answer. Then f(x) might still contain the same structure as x, e.g. mutual information between X and f(X) is high. If you now trained a model to predict $y | f(x)$, I would guess it will learn to pick up on any present steganographic structure that is predictive of y.
If you trained a model to only predict the CoT, i.e. f(x, t=1) | f(x|t=0) I would still guess the model will learn to pick up on the structure if it is present and helpful. E.g. if previously the model was always putting an equation after outputting an enumerated list and now it does so after a bullet point list because f converted all enumerations into bullet points.
So this seems to hinge on f choosing incoherent syntax, e.g. what you observe in the chunk-paraphrase setting, but it’s not clear to me that this is the behavior we should expect from the non-chunked paraphraser.

Tom Lieberum Jul 25, 2024, 3:56 PM
1 point
0
in reply to: Lennart Buerger’s comment on: JumpReLU SAEs + Early Access to Gemma 2 SAEs
We use 1024, though often article snippets are shorter than that so they are separated by BOS.

Tom Lieberum Jul 3, 2024, 3:40 PM
2 points
0
on: Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
Cool work!
Did you run an ablation on the auxiliary losses for $Q_{c l e a n} \cdot K_{r e c o n}$ and $Q_{r e c o n} \cdot K_{c l e a n}$ , how important was that to stabilize training?
Did you compare to training separate Q and K SAEs via typical reconstruction loss? Would be cool to see a side-by-side comparison, i.e. how large the benefit of this scheme is.

Tom Lieberum Jul 21, 2023, 11:39 AM
2 points
0
in reply to: Evan Hockings’s comment on: Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
During parts of the project I had the hunch that some letter specialized heads are more like proto-correct-letter-heads (see paper for details), based on their attention pattern. We never investigated this, and I think it could go either way. The “it becomes cleaner” intuition basically relies on stuff like the grokking work and other work showing representations being refined late during training by.. Thisby et al. I believe (and maybe other work). However some of this would probably require randomising e.g. the labels the model sees during training. See e.g. Cammarata et al. Understanding RL Vision: If you only ever see the second choice be labeled with B you don’t have an incentive to distinguish between “look for B” and “look for the second choice”. Lastly, even in the limit of infinite training data you still have limited model capacity and so will likely use a distributed representation in some way, but maybe you could at least get human interpretable features even if they are distributed.

Tom Lieberum Feb 12, 2023, 9:41 PM
1 point
0
in reply to: Logan Riggs’s comment on: We Found An Neuron in GPT-2
Yup! I think that’d be quite interesting. Is there any work on characterizing the embedding space of GPT2?

Tom Lieberum Feb 12, 2023, 3:09 PM
5 points
0
on: We Found An Neuron in GPT-2
Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It’s curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).
Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.

Tom Lieberum Jan 15, 2023, 10:05 AM
LW: 1 AF: 1
0
AF
in reply to: Gurkenglas’s comment on: Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
(The quote refers to the usage of binary attention patterns in general, so I’m not sure why you’re quoting it)
I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.
iiuc, the statement in the tracr paper is not that you can’t have attention patterns which implement this logical operation, but that you can’t have a single head implementing this attention pattern (without exponential blowup)

Tom Lieberum Jan 14, 2023, 8:09 PM
LW: 1 AF: 1
−2
AF
in reply to: Gurkenglas’s comment on: Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
I don’t think that’s right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

Tom Lieberum Dec 16, 2022, 4:46 PM
2 points
0
in reply to: Lee Sharkey’s comment on: [Interim research report] Taking features out of superposition with sparse autoencoders

Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It’s hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it’s trying to approximate or why you think that the true distribution has property XYZ.

Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn’t transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don’t hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model.

Undertrained autoencoders

I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you’re thinking about that.

Tom Lieberum Dec 16, 2022, 4:38 PM
1 point
0
in reply to: Lee Sharkey’s comment on: [Interim research report] Taking features out of superposition with sparse autoencoders
(ETA to the OC: the antipodal pairs wouldn’t happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you’d see that. I’m now less sure about this specific argument)

Tom Lieberum Dec 16, 2022, 1:31 PM
2 points
0
on: [Interim research report] Taking features out of superposition with sparse autoencoders
Thanks for posting this. Some comments/questions we had after briefly discussing it in our team:
- We would have loved to see more motivation for why you are making the assumptions you are making when generating the toy data.
  - Relatedly, it would be great to see an analysis of the distribution of the MLP activations. This could give you some info where your assumptions in the toy model fall short.
- As Charlie Steiner pointed out, you are using a very favorable ratio of $G / h$ in the toy model , i.e. of number of ground truth features to encoding dimension. I would expect you will mostly get antipodal pairs in that setup, rather than strongly interfering superposition. This may contribute significantly to the mismatch. (ETA: the antipodal pairs wouldn’t happen here due to the way you set up the data generation, but if you were to learn the features as in the toy models post, you’d see that. I’m now less sure about this specific argument)
- For the MMCS plots, we would be interested in seeing the distribution/histogram of MCS values. Especially for ~middling MCS values, where it’s not clear if all features are somewhat represented or some are a lot and some not at all.
- While we don’t think this has a big impact compared to the other potential mismatches between toy model and the MLP, we do wonder whether the model has the parameters/data/training steps it needs to develop superposition of clean features.
  - e.g. in the toy models report, Elhage et al. reported phase transitions of superposition over the course of training,

Tom Lieberum Sep 1, 2022, 2:37 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.
- A token is either the first one of multi-token word or it isn’t.
- A word is either a noun, a verb or something else.
- A word belongs to language LANG and not to any other language/has other meanings in those languages.
- A $H \times W$ image can only contain so many objects which can only contain so many sub-aspects.
I don’t know what it would mean to go “out of distribution” in any of these cases.
This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.

Tom Lieberum Sep 1, 2022, 2:31 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
I’m not aware of any work that identifies superposition in exactly this way in NNs of practical use.
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.
ETA: Ofc there could be some other mediating factor, too.

Tom Lieberum Sep 1, 2022, 1:43 PM
1 point
0
in reply to: Tom Lieberum’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
This example is meant to only illustrate how one could achieve this encoding. It’s not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior.
But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.

Tom Lieberum Sep 1, 2022, 1:09 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
Ah, I might have misunderstood your original point then, sorry!
I’m not sure what you mean by “basis” then. How strictly are you using this term?
I imagine you are basically going down the “features as elementary unit” route proposed in Circuits (although you might not be pre-disposed to assume features are the elementary unit).Finding the set of features used by the network and figuring out how its using them in its computations does not 1-to-1 translate to “find the basis the network is thinking in” in my mind.

Tom Lieberum Sep 1, 2022, 1:02 PM
2 points
0
in reply to: Lucius Bushnaq’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?
If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let’s assume we observe two numbers $X, Y$ . With probability $p$ , $X = 0, Y \sim N (0, 1)$ , and with probability $(1 - p)$ , $Y = 0, X \sim N (0, 1)$ .
We now want to encode these two events in some third variable $Z$ , such that we can perfectly reconstruct $X, Y$ with probability $\approx 1$ .
I put the solution behind a spoiler for anyone wanting to try it on their own.
Choose some veeeery large $μ ≫ 1$ (much greater than the variance of the normal distribution of the features). For the first event, set $Z = Y - μ$ . For the second event, set $Z = X + μ$ .
The decoding works as follows:
If $Z$ is negative, then with probability $\approx 1$ we are in the first scenario and we can set $X = 0, Y = Z + μ$ . Vice versa if $Z$ is positive.

Tom Lieberum Sep 1, 2022, 10:17 AM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Taking the parameters which seem to matter and rotating them until they don’t

I’d say that there is a basis the network is thinking in in this hypothetical, it would just so happens to not match the human abstraction set for thinking about the problem in question.

Well, yes but the number of basis elements that make that basis human interpretable could theoretically be exponential in the number of neurons.

Tom Lieberum Sep 1, 2022, 10:15 AM
2 points
0
in reply to: Lucius Bushnaq’s comment on: Taking the parameters which seem to matter and rotating them until they don’t

If due to superposition, it proves advantageous to the AI to have a single feature that kind of does dog-head-detection and kind of does car-front-detection, because dog heads and car fronts don’t show up in the training data at the same time, so it can still get perfect loss through a properly constructed dual-purpose feature like this, it’d mean that to the AI, dog heads and car fronts are “the same thing”.

I don’t think that’s true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don’t appear together. That means the model can still differentiate the two features, they are different in the model’s ontology.

As AIs get more capable and general, I’d expect the concepts/features they use to start more closely matching the ones humans use in many domains.

My intuition disagrees here too. Whether we will observe superposition is a function of (number of “useful” features in the data), (sparsity of said features), and something like (bottleneck size). It’s possible that bottleneck size will never be enough to compensate for number of features. Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.

Tom Lieberum Aug 29, 2022, 8:06 PM
1 point
0
in reply to: Garrett Baker’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
I agree that all is not lost wrt sparsity and if SPH turns out to be true it might help us disentangle the superimposed features to better understand what is going on. You could think of constructing an “expanded” view of a neural network. The expanded view would allocate one neuron per feature and thus has sparse activations for any given data point and would be easier to reason about. That seems impractical in reality, since the cost of constructing this view might in theory be exponential, as there are exponentially many “almost orthogonal” vectors for a given vector space dimension, as a function of the dimension.
I think my original comment was meant more as a caution against the specific approach of “find an interpretable basis in activation space”, since that might be futile, rather than a caution against all attempts at finding a sparse representation of the computations that are happining within the network.