Tom Lieberum

Karma: 808

Research Engineer at DeepMind, focused on mechanistic interpretability and large language models. Opinions are my own.

Tom Lieberum 31 Jul 2022 8:01 UTC
48 points
11
in reply to: Rodrigo Heck’s comment on: chinchilla’s wild implications
I’d like to propose not talking publicly about ways to “fix” this issue. Insofar these results spell trouble for scaling up LLMs, this is a good thing!
Infohazard (meta-)discussions are thorny by their very nature and I don’t want to discourage discussions around these results in general, e.g. how to interpret them or whether the analysis has merits.

Tom Lieberum 10 Feb 2022 20:03 UTC
LW: 27 AF: 17
AF
in reply to: Rohin Shah’s comment on: Hypothesis: gradient descent prefers general circuits

It would be interesting to see if, once grokking had clearly started, you could just 100x the learning rate and speed up the convergence to zero validation loss by 100x.

I ran a quick-and-dirty experiment and it does in fact look like you can just crank up the learning rate at the point where some part of grokking happens to speed up convergence significantly. See the wandb report:

https://wandb.ai/tomfrederik/interpreting_grokking/reports/Increasing-Learning-Rate-at-Grokking—VmlldzoxNTQ2ODY2?accessToken=y3f00qfxot60n709pu8d049wgci69g53pki6pq6khsemnncca1dnmocu7a3d43y8

I set the LR to 5x the normal value (100x tanked the accuracy, 10x still works though). Of course you would want to anneal it after grokking was finished.

Tom Lieberum 15 Jul 2022 13:29 UTC
21 points
2
in reply to: lc’s comment on: PSA about differential technological development
I’d also be interested in hearing which parts of Anthropic’s research output you think burns our serial time budget. If I understood the post correctly, then OP thinks that efforts like transformer circuits are mostly about accelerating parallelizable research.

Maybe OP thinks that
- mechanistic interpretability does have little value in terms of serial research
- RLHF does not give us alignment (because it doesn’t generalize beyond the “sharp left turn” which OP thinks is likely to happen)
- therefore, since most of Anthropic’s alignment focused output has not much value in terms of serial research, and it does somewhat enhance present-day LLM capabilities/usability, it is net negative?
But I’m very much unsure whether OP really believes this—would love to hear him elaborate.

ETA: It could also be the case that OP was exclusively referring to the part of Anthropic that is about training LLMs efficiently as a pre-requisite to study those models?

Tom Lieberum 18 Dec 2021 8:38 UTC
17 points
on: Perishable Knowledge
I like the framing of perishable vs non-perishable knowledge and I like that the post is short and concise.

However, after reading this I’m left feeling “So what now?” and would appreciate some more actionable advice or tools of thought. What I got out so far is:
1. Things that have been around for longer are more likely to stay around longer (seems like a decent prior)
2. Keep tabs on a few major event categories and dump the rest of the news cycle (checks out—not sure how that would work as a categorical imperative, but seems like the right choice for an individual)
I think the concept can be applied pretty broadly. Some more ideas:
- when learning about a new field, in general, go for textbooks rather than papers
- if you use spaced repetition, regularly ask yourself whether the cards you are studying have passed their shelf life --> this can help reduce frustration/annoyance/boredom when reviewing cards
- some skills have extremely long shelf-life and they seem to overlap with those that compound:
  - learning basic life admin skills
  - learning how to take care of your mental health (e.g. CBT methods)
  - learning how to learn
  - basic social skills
I’m sure there is much more here.
What links here?
- Here’s a List of Some of My Ideas for Blog Posts by lsusr (26 May 2022 5:35 UTC; 48 points)

Understanding the tensor product formulation in Transformer Circuits

Tom Lieberum24 Dec 2021 18:05 UTC

16 points

2 comments3 min readLW link

[Question] How should my timelines influence my career choice?

Tom Lieberum3 Aug 2021 10:14 UTC

13 points

10 comments1 min readLW link

Thoughts on Formalizing Composition

Tom Lieberum7 Jun 2022 7:51 UTC

13 points

0 comments7 min readLW link

Tom Lieberum 19 Dec 2021 11:33 UTC
11 points
in reply to: Yair Halberstadt’s comment on: Should I delay having children to take advantage of polygenic screening?
Seems like this could be circumvented relatively easily by freezing gametes now.

Tom Lieberum 27 Aug 2022 7:15 UTC
8 points
0
on: Taking the parameters which seem to matter and rotating them until they don’t
Interesting idea!
What do you think about the Superposition Hypothesis? If that were true, then at a sufficient sparsity of features in the input there is no basis in which the network is thinking in, meaning it will be impossible to find a rotation matrix that allows for a bijective mapping between neurons and features.
I would assume that the rotation matrix that enables local changes via the sparse Jacobian coincides with one which maximizes some notion of “neuron-feature-bijectiveness”. But as noted above that seems impossible if the SPH holds.

Tom Lieberum 24 Nov 2021 21:35 UTC
8 points
in reply to: ChristianKl’s comment on: AI Safety Needs Great Engineers
Oh yes I’m aware that he expressed this view. That’s different however from it being objectively plausible (whatever that means). I have the feeling we’re talking past each other a bit. I’m not saying “no-one reputable thinks OpenAI is net-negative for the world”. I’m just pointing out that it’s not as clear-cut as your initial comment made it seem to me.

Tom Lieberum 31 Mar 2022 9:27 UTC
7 points
in reply to: Space L Clottey’s comment on: Do a cost-benefit analysis of your technology usage
I want to second your first point. Texting frequently with significant others lets me feel be part of their life and vice versa which a weekly call does not accomplish, partly because it is weekly and partly because I am pretty averse to calls.
In one relationship I had, this led to significant misery on my part because my partner was pretty strict on their phone usage, batching messages for the mornings and evenings. For my current primary relationship, I’m convinced that the frequent texting is what kept it alive while being long-distance.
To reconcile the two viewpoints, I think it is still true that superficial relationships via social media likes or retweets are not worth that much if they are all there is to the relationship. But direct text messages are a significant improvement on that.
Re your blog post:
Maybe that’s me being introverted but there are probably significant differences in whether people feel comfortable/like texting or calling. For me, the instantaneousness of calling makes it much more stressful, and I do have a problem with people generalizing either way that one way to interact over distances is superior in general. I do cede the point that calling is of course much higher bandwidth, but it also requires more time commitment and coordination.

Tom Lieberum 18 Dec 2021 8:46 UTC
7 points
in reply to: lberglund’s comment on: The Case for Radical Optimism about Interpretability
I can only speculate, but the main researchers are now working on other stuff, like e.g. Anthropic. As to why they switched, I don’t know. Maybe they were not making progress fast enough or Anthropic’s mission seemed more important?
However, at least Chris Olah believes this is still a tractable and important direction, see the recent RFP by him for Open Phil.

Tom Lieberum 19 Aug 2022 13:40 UTC
6 points
2
in reply to: tgb’s comment on: A Mechanistic Interpretability Analysis of Grokking
K-composition as a concept was introduced by Anthropic in their work on Transformer Circuits in the initial post. In general, the output of an attention head in an earlier layer can influence the query, key, or value computation of an attention head in a later layer.
K-composition refers to the case in which the key-computation is influenced. In a model without nonlinearities or layernorms you can do this simply by looking at how strongly the output matrix of head 1 and the key matrix of head 2 compose (or more precisely, by looking at the frobenius norm of the product relative to the product of the individual norms). I also tried to write a bit about it here.

Tom Lieberum 12 Feb 2023 15:09 UTC
5 points
0
on: We Found An Neuron in GPT-2
Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It’s curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).
Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.

Tom Lieberum 15 Jul 2022 19:10 UTC
5 points
1
in reply to: So8res’s comment on: PSA about differential technological development
Thanks for elaborating! In so far your assessment is based on in-person interactions, I can’t really comment since I haven’t spoken much with people from Anthropic.

I think there are degrees to believing this meme you refer to, in the sense of “we need an AI of capability level X to learn meaningful things”. And I would guess that many people at Anthropic do believe this weaker version—it’s their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.

One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.

I don’t know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?

Tom Lieberum 29 Aug 2022 18:44 UTC
4 points
1
in reply to: Garrett Baker’s comment on: Taking the parameters which seem to matter and rotating them until they don’t
I disagree with your intuition that we should not expect networks at irreducible loss to not be in superposition.
The reason I brought this up is that there are, IMO, strong first-principle reasons for why SPH should be correct. Say there are two features, which have an independent probability of 0.05 to be present in a given data point, then it would be wasteful to allocate a full neuron to each of these features. The probability of both features being present at the same time is a mere 0.00025. If the superposition is implemented well you get basically two features for the price of one with an error rate of 0.025%. So if there is even a slight pressure towards compression, e.g. by having less available neurons than features, then superposition should be favored by the network.
Now does this toy scenario map to reality? I think it does, and in some sense it is even more favorable to SPH since often the presence of features will be anti-correlated.

Tom Lieberum 31 Jul 2022 12:39 UTC
4 points
0
in reply to: Leon Lang’s comment on: chinchilla’s wild implications
Thanks for your reply! I think I basically agree with all of your points. I feel a lot of frustration around the fact that we don’t seem to have adequate infohazard policies to address this. It seems like a fundamental trade-off between security and openness/earnestness of discussion does exist though.
It could be the case that this community is not the correct place to enforce this rules, as there does still exist a substantial gap between “this thing could work” and “we have a working system”. This is doubly true in DL where implementation details matter a great deal.

Tom Lieberum 22 Jul 2022 14:41 UTC
3 points
2
in reply to: Noosphere89’s comment on: Which singularity schools plus the no singularity school was right?
This assumes a fixed scaling law. One possible way of improving oneself could be to design a better architecture with a better scaling exponent.

Tom Lieberum 13 May 2022 14:11 UTC
3 points
AF
in reply to: Yair Halberstadt’s comment on: DeepMind is hiring for the Scalable Alignment and Alignment Teams
I can’t speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.

ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.

Tom Lieberum 25 Apr 2022 9:38 UTC
3 points
in reply to: Hoagy’s comment on: Hoagy’s Shortform
There is definitely something out there, just can’t recall the name. A keyword you might want to look for is “disentangled representations”.

One start would be the beta-VAE paper https://openreview.net/forum?id=Sy2fzU9gl

Tom Lieberum

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

[Question] How should my timelines in­fluence my ca­reer choice?

Thoughts on For­mal­iz­ing Composition

Understanding the tensor product formulation in Transformer Circuits

[Question] How should my timelines influence my career choice?

Thoughts on Formalizing Composition