RSS

Neel Nanda

Karma: 9,417

The GDM AGI Safety+Align­ment Team is Hiring for Ap­plied In­ter­pretabil­ity Research

Feb 24, 2025, 2:17 AM
46 points
1 comment7 min readLW link

MATS Ap­pli­ca­tions + Re­search Direc­tions I’m Cur­rently Ex­cited About

Neel NandaFeb 6, 2025, 11:03 AM
72 points
7 comments8 min readLW link

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

Dec 19, 2024, 3:59 PM
33 points
4 comments11 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

Nov 14, 2024, 1:06 PM
20 points
0 comments9 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

Nov 7, 2024, 5:22 AM
66 points
4 comments14 min readLW link

SAE Prob­ing: What is it good for?

Nov 1, 2024, 7:23 PM
32 points
0 comments11 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

Oct 27, 2024, 6:46 PM
40 points
4 comments5 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Oct 12, 2024, 2:54 PM
29 points
4 comments7 min readLW link

Base LLMs re­fuse too

Sep 29, 2024, 4:04 PM
60 points
20 comments10 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Aug 24, 2024, 12:56 AM
68 points
10 comments20 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

Aug 17, 2024, 1:16 AM
53 points
0 comments5 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

Aug 12, 2024, 8:34 PM
31 points
1 comment9 min readLW link

Self-ex­plain­ing SAE features

Aug 5, 2024, 10:20 PM
60 points
13 comments10 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

Jul 20, 2024, 2:20 AM
53 points
0 comments4 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

Jul 19, 2024, 4:10 PM
48 points
10 comments1 min readLW link
(storage.googleapis.com)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
66 points
0 comments10 min readLW link

Stitch­ing SAEs of differ­ent sizes

Jul 13, 2024, 5:19 PM
39 points
12 comments12 min readLW link

Neel Nanda’s Shortform

Neel NandaJul 12, 2024, 7:16 AM
8 points
7 comments1 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel NandaJul 7, 2024, 5:39 PM
134 points
16 comments25 min readLW link