RSS

Neel Nanda

Karma: 6,942

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

3 May 2024 1:18 UTC
47 points
4 comments1 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
56 points
12 comments17 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
183 points
75 comments10 min readLW link