Neel Nanda

Karma: 6,942

Neel Nanda 6 May 2024 11:54 UTC
LW: 2 AF: 2
0
AF
in reply to: wassname’s comment on: Refusal in LLMs is mediated by a single direction

Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.

Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we’re both wrong, and the most surgical intervention is deleting the direction from key layers only.

Neel Nanda 5 May 2024 13:19 UTC
LW: 5 AF: 3
1
AF
in reply to: wassname’s comment on: Refusal in LLMs is mediated by a single direction
Thanks! I’m personally skeptical of ablating a separate direction per block, it feels less surgical than a single direction everywhere, and we show that a single direction works fine for LLAMA3 8B and 70B

The transformer lens library does not have a save feature :(

Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.

Neel Nanda 4 May 2024 21:57 UTC
6 points
0
on: Introducing AI-Powered Audiobooks of Rational Fiction Classics
Thanks for making these! How expensive is it?

Neel Nanda 4 May 2024 11:36 UTC
2 points
2
in reply to: Florian_Dietz’s comment on: Mechanistic Interpretability Workshop Happening at ICML 2024!
Makes sense! Sounds like a fairly good fit

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Another way of framing it: Try to write your paper in such a way that a mech interp researcher reading it says “huh, I want to go and use this library for my research”. Eg give examples of things that were previously hard that are now easy.

Neel Nanda 3 May 2024 9:47 UTC
2 points
2
in reply to: Florian_Dietz’s comment on: Mechanistic Interpretability Workshop Happening at ICML 2024!
Looks relevant to me on a skim! I’d probably want to see some arguments in the submission for why this is useful tooling for mech interp people specifically (though being useful to non mech interp people too is a bonus!)

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and Fazl

3 May 2024 1:18 UTC

47 points

4 comments1 min readLW link

Neel Nanda 1 May 2024 22:23 UTC
4 points
0
in reply to: Johnny Lin’s comment on: Transcoders enable fine-grained interpretable circuit analysis for language models
That’s awesome, and insanely fast! Thanks so much, I really appreciate it

Neel Nanda 1 May 2024 9:14 UTC
2 points
0
in reply to: Lucius Bushnaq’s comment on: Transcoders enable fine-grained interpretable circuit analysis for language models
Nope to both of those, though I think both could be interesting directions!

Neel Nanda 1 May 2024 1:03 UTC
LW: 6 AF: 4
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Nah I think it’s pretty sketchy. I personally prefer mean ablation, especially for residual stream SAEs where zero ablation is super damaging. But even there I agree. Compute efficiency hit would be nice, though it’s a pain to get the scaling laws precise enough

For our paper this is irrelevant though IMO because we’re comparing gated and normal SAEs, and I think this is just scaling by a constant? It’s at least monotonic in CE loss degradation

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

56 points

12 comments17 min readLW link

Neel Nanda 29 Apr 2024 18:33 UTC
LW: 5 AF: 3
3
AF
in reply to: Buck’s comment on: Refusal in LLMs is mediated by a single direction
I don’t think we really engaged with that question in this post, so the following is fairly speculative. But I think there’s some situations where this would be a superior technique, mostly low resource settings where doing a backwards pass is prohibitive for memory reasons, or with a very tight compute budget. But yeah, this isn’t a load bearing claim for me, I still count it as a partial victory to find a novel technique that’s a bit worse than fine tuning, and think this is significantly better than prior interp work. Seems reasonable to disagree though, and say you need to be better or bust

Neel Nanda 29 Apr 2024 13:30 UTC
LW: 6 AF: 3
4
AF
in reply to: Buck’s comment on: Refusal in LLMs is mediated by a single direction
+1 to Rohin. I also think “we found a cheaper way to remove safety guardrails from a model’s weights than fine tuning” is a real result (albeit the opposite of useful), though I would want to do more actual benchmarking before we claim that it’s cheaper too confidently. I don’t think it’s a qualitative improvement over what fine tuning can do, thus hedging and saying tentative

Neel Nanda 29 Apr 2024 2:07 UTC
LW: 2 AF: 2
0
AF
in reply to: LawrenceC’s comment on: Refusal in LLMs is mediated by a single direction
Thanks! Broadly agreed

For example, I think our understanding of Grokking in late 2022 turned out to be importantly incomplete.

I’d be curious to hear more about what you meant by this

Neel Nanda 28 Apr 2024 22:46 UTC
LW: 6 AF: 5
0
AF
in reply to: magnetoid’s comment on: Refusal in LLMs is mediated by a single direction
It was added recently and just added to a new release, so pip install transformer_lens should work now/soon (you want v1.16.0 I think), otherwise you can install from the Github repo

Neel Nanda 28 Apr 2024 11:05 UTC
LW: 10 AF: 7
3
AF
in reply to: cousin_it’s comment on: Refusal in LLMs is mediated by a single direction
There’s been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven’t seen much elsewhere, but I could easily be missing references

Neel Nanda 28 Apr 2024 11:00 UTC
LW: 37 AF: 18
19
AF
in reply to: Zack_M_Davis’s comment on: Refusal in LLMs is mediated by a single direction
First and foremost, this is interpretability work, not directly safety work. Our goal was to see if insights about model internals could be applied to do anything useful on a real world task, as validation that our techniques and models of interpretability were correct. I would tentatively say that we succeeded here, though less than I would have liked. We are not making a strong statement that addressing refusals is a high importance safety problem.

I do want to push back on the broader point though, I think getting refusals right does matter. I think a lot of the corporate censorship stuff is dumb, and I could not care less about whether GPT4 says naughty words. And IMO it’s not very relevant to deceptive alignment threat models, which I care a lot about. But I think it’s quite important for minimising misuse of models, which is also important: we will eventually get models capable of eg helping terrorists make better bioweapons (though I don’t think we currently have such), and people will want to deploy those behind an API. I would like them to be as jailbreak proof as possible!

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

183 points

75 comments10 min readLW link

Neel Nanda 26 Apr 2024 2:48 UTC
LW: 4 AF: 3
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This’ll be added to the paper soon, sorry!). I’ll leave the other Qs for my co-authors

Neel Nanda 26 Apr 2024 0:14 UTC
LW: 6 AF: 5
4
AF
in reply to: Sam Marks’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
I haven’t fully worked through the maths, but I think both IG and attribution patching break down here? The fundamental problem is that the discontinuity is invisible to IG because it only takes derivatives. Eg the ReLU and Jump ReLU below look identical from the perspective of IG, but not from the perspective of activation patching, I think.

Neel Nanda 25 Apr 2024 23:04 UTC
11 points
10
on: Funny Anecdote of Eliezer From His Sister
From the title I expected this to be embarrassing for Eliezer, but that was actually extremely sweet, and good advice!

Neel Nanda

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

Re­fusal in LLMs is me­di­ated by a sin­gle direction

Mechanistic Interpretability Workshop Happening at ICML 2024!

Transcoders enable fine-grained interpretable circuit analysis for language models

Refusal in LLMs is mediated by a single direction