RSS

Arthur Conmy

Karma: 1,406

Intepretability

Views my own

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
15 points
0 comments9 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
62 points
4 comments14 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
38 points
4 comments5 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
26 points
4 comments7 min readLW link

Base LLMs re­fuse too

29 Sep 2024 16:04 UTC
60 points
20 comments10 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

12 Aug 2024 20:34 UTC
31 points
1 comment9 min readLW link

Self-ex­plain­ing SAE features

5 Aug 2024 22:20 UTC
60 points
13 comments10 min readLW link