RSS

Neel Nanda

Karma: 11,074

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Oct 12, 2024, 2:54 PM
29 points
4 comments7 min readLW link

Base LLMs re­fuse too

Sep 29, 2024, 4:04 PM
60 points
20 comments10 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Aug 24, 2024, 12:56 AM
68 points
10 comments20 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

Aug 17, 2024, 1:16 AM
53 points
0 comments5 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

Aug 12, 2024, 8:34 PM
31 points
1 comment9 min readLW link

Self-ex­plain­ing SAE features

Aug 5, 2024, 10:20 PM
61 points
13 comments10 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

Jul 20, 2024, 2:20 AM
61 points
0 comments4 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

Jul 19, 2024, 4:10 PM
49 points
10 comments1 min readLW link
(storage.googleapis.com)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
67 points
0 comments10 min readLW link

Stitch­ing SAEs of differ­ent sizes

Jul 13, 2024, 5:19 PM
39 points
12 comments12 min readLW link

Neel Nanda’s Shortform

Neel NandaJul 12, 2024, 7:16 AM
8 points
15 commentsLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel NandaJul 7, 2024, 5:39 PM
136 points
16 comments25 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points
3 comments19 min readLW link

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

Jun 5, 2024, 11:48 PM
15 points
2 comments10 min readLW link

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

May 3, 2024, 1:18 AM
48 points
6 comments1 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

Apr 30, 2024, 5:58 PM
74 points
14 comments17 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

Apr 27, 2024, 11:13 AM
247 points
95 comments10 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

Apr 25, 2024, 6:43 PM
63 points
38 comments1 min readLW link
(arxiv.org)

How to use and in­ter­pret ac­ti­va­tion patching

Apr 24, 2024, 8:35 AM
13 points
7 comments19 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

Apr 19, 2024, 7:06 PM
79 points
10 comments8 min readLW link