Neel Nanda

Karma: 8,793

Neel Nanda 24 Nov 2024 14:43 UTC
4 points
2
on: Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders
Cool project! Thanks for doing it and sharing, great to see more models with SAEs

interpretability research on proprietary LLMs that was quite popular this year and great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind

I run the Google DeepMind team, and just wanted to clarify that our work was not on proprietary closed weight models, but instead on Gemma 2, as were our open weight SAEs—Gemma 2 is about as open as llama imo. We try to use open models wherever possible for these general reasons of good scientific practice, ease of replicability, etc. Though we couldn’t open source the data, and didn’t go to the effort of open sourcing the code, so I don’t think they can be considered true open source. OpenAI did most of their work on gpt2, and only did their large scale experiment on GPT4 I believe. All Anthropic work I’m aware of is on proprietary models, alas.

Neel Nanda 18 Nov 2024 22:00 UTC
LW: 4 AF: 4
0
AF
in reply to: Wei Shi’s comment on: Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
It’s essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I’m not getting into)

Neel Nanda 21 Oct 2024 14:40 UTC
LW: 3 AF: 2
0
AF
in reply to: Roger Dearnaley’s comment on: Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
This is somewhat similar to the approach of the ROME paper, which has been shown to not actually do fact editing, just inserting louder facts that drown out the old ones and maybe suppressing the old ones.

In general, the problem with optimising model behavior as a localisation technique is that you can’t distinguish between something that truly edits the fact, and something which adds a new fact in another layer that cancels out the first fact and adds something new.

Neel Nanda 14 Oct 2024 20:07 UTC
3 points
0
in reply to: Shayne O'Neill’s comment on: How I got 3.2 million Youtube views without making a single video
Agreed, chance of success when cold emailing busy people is low, and spamming them is bad. And there are alternate approaches that may work better, depending on the person and their setup—some Youtubers don’t have a manager or employees, some do. I also think being able to begin an email with “Hi, I run the DeepMind mechanistic interpretability team” was quite helpful here.

Neel Nanda 11 Oct 2024 18:20 UTC
LW: 2 AF: 2
0
AF
in reply to: Mark Xu’s comment on: Mark Xu’s Shortform
The high level claim seems pretty true to me. Come to the GDM alignment team, it’s great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

Neel Nanda 4 Oct 2024 20:19 UTC
11 points
0
in reply to: habryka’s comment on: MichaelDickens’s Shortform
Huh, are there examples of right leaning stuff they stopped funding? That’s new to me

Neel Nanda 24 Sep 2024 13:39 UTC
7 points
2
in reply to: niplav’s comment on: Nathan Young’s Shortform
+1. Concretely this means converting every probability p into p/(1-p), and then multiplying those (you can then convert back to probabilities)

Intuition pump: Person A says 0.1 and Person B says 0.9. This is symmetric, if we instead study the negation, they swap places, so any reasonable aggregation should give 0.5

Geometric mean does not, instead you get 0.3

Arithmetic gets 0.5, but is bad for the other reasons you noted

Geometric mean of odds is sqrt(1/9 * 9) = 1, which maps to a probability of 0.5, while also eg treating low probabilities fairly

Neel Nanda 22 Sep 2024 9:47 UTC
LW: 7 AF: 5
0
AF
in reply to: Rohin Shah’s comment on: Showing SAE Latents Are Not Atomic Using Meta-SAEs
Interesting thought! I expect there’s systematic differences, though it’s not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they’re useful for more predicting many latents. Though ones that don’t split may be useful as they entirely explain a latent that’s otherwise hard to explain.

Anyway, we haven’t checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they’re much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as “SAE latents are not atomic, as shown by one method, but probably other methods would work well too”

Neel Nanda 18 Sep 2024 22:02 UTC
4 points
0
in reply to: Nathan Helm-Burger’s comment on: quetzal_rainbow’s Shortform
What’s wrong with twitter as an archival source? You can’t edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though

Neel Nanda 13 Sep 2024 17:38 UTC
11 points
3
on: Why I’m bearish on mechanistic interpretability: the shards are not in the network
To me, this model predicts that sparse autoencoders should not find abstract features, because those are shards, and should not be localisable to a direction in activation space on a single token. Do you agree that this is implied?

If so, how do you square that with eg all the abstract features Anthropic found in Sonnet 3?

Neel Nanda 13 Sep 2024 11:38 UTC
3 points
0
in reply to: Lawrence Phillips’s comment on: Contra papers claiming superhuman AI forecasting
Thanks for making the correction!

Neel Nanda 13 Sep 2024 10:26 UTC
7 points
3
in reply to: Joseph Miller’s comment on: OpenAI o1
I expect there’s lots of new forms of capabilities elicitation for this kind of model, which their standard framework may not have captured, and which requires more time to iterate on

Neel Nanda 12 Sep 2024 21:41 UTC
15 points
3
on: Contra papers claiming superhuman AI forecasting
Thanks for the post!

sample five random users’ forecasts, score them, and then average

Are you sure this is how their bot works? I read this more as “sample five things from the LLM, and average those predictions”. For Metaculus, the crowd is just given to you, right, so it seems crazy to sample users?

Neel Nanda 8 Sep 2024 10:11 UTC
6 points
2
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
Yeah, fair point, disagreement retracted

Neel Nanda 7 Sep 2024 23:36 UTC
6 points
4
in reply to: RobertM’s comment on: Pay Risk Evaluators in Cash, Not Equity
I think this is important to define anyway! (and likely pretty obvious). This would create a lot more friction for someone to take on such a role though, or move out

Neel Nanda 7 Sep 2024 20:44 UTC
24 points
25
in reply to: RobertM’s comment on: Pay Risk Evaluators in Cash, Not Equity
But only a small fraction work on evaluations, so the increased cost is much smaller than you make out

Neel Nanda 7 Sep 2024 3:25 UTC
10 points
6
on: Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Cool work! This is the outcome I expected, but I’m glad someone actually went and did it

Neel Nanda 6 Sep 2024 18:46 UTC
4 points
1
in reply to: Ben Pace’s comment on: How I got 3.2 million Youtube views without making a single video
Yeah, if I made an introduction it would ruin the spirit of it!

Neel Nanda 6 Sep 2024 10:40 UTC
2 points
0
in reply to: keith_wynroe’s comment on: Lucius Bushnaq’s Shortform
I don’t see important differences between that and ce loss delta in the context Lucius is describing

Neel Nanda 6 Sep 2024 10:40 UTC
4 points
1
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
This seems true to me, though finding the right scaling curve for models is typically quite hard so the conversion to effective compute is difficult. I typically use CE loss change, not loss recovered. I think we just don’t know how to evaluate SAE quality.

My personal guess is that SAEs can be a useful interpretability tool despite making a big difference in effective compute, and we should think more in terms of useful they are for downstream tasks. But I agree this is a real phenomena, that is easy to overlook, and is bad.