RSS

Neel Nanda

Karma: 6,920

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

3 May 2024 1:18 UTC
47 points
4 comments1 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
54 points
11 comments17 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
177 points
66 comments10 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
61 points
35 comments1 min readLW link
(arxiv.org)

How to use and in­ter­pret ac­ti­va­tion patching

24 Apr 2024 8:35 UTC
10 points
0 comments18 min readLW link