RSS

Neel Nanda

Karma: 8,787

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
16 points
0 comments9 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
62 points
4 comments14 min readLW link

SAE Prob­ing: What is it good for? Ab­solutely some­thing!

1 Nov 2024 19:23 UTC
31 points
0 comments11 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
38 points
4 comments5 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
26 points
4 comments7 min readLW link

Base LLMs re­fuse too

29 Sep 2024 16:04 UTC
60 points
20 comments10 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
60 points
9 comments20 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

17 Aug 2024 1:16 UTC
53 points
0 comments5 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

12 Aug 2024 20:34 UTC
31 points
1 comment9 min readLW link

Self-ex­plain­ing SAE features

5 Aug 2024 22:20 UTC
60 points
13 comments10 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
52 points
0 comments4 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
48 points
10 comments1 min readLW link
(storage.googleapis.com)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
65 points
0 comments10 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
39 points
12 comments12 min readLW link

Neel Nanda’s Shortform

Neel Nanda12 Jul 2024 7:16 UTC
8 points
6 comments1 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
134 points
15 comments25 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

21 Jun 2024 12:56 UTC
31 points
0 comments19 min readLW link

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

5 Jun 2024 23:48 UTC
15 points
2 comments10 min readLW link

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

3 May 2024 1:18 UTC
48 points
6 comments1 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
70 points
14 comments17 min readLW link