RSS

Arthur Conmy

Karma: 1,650

Intepretability

Views my own

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

Mar 26, 2025, 7:07 PM
109 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

The GDM AGI Safety+Align­ment Team is Hiring for Ap­plied In­ter­pretabil­ity Research

Feb 24, 2025, 2:17 AM
48 points
1 comment7 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

Nov 14, 2024, 1:06 PM
21 points
0 comments9 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

Nov 7, 2024, 5:22 AM
66 points
4 comments14 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

Oct 27, 2024, 6:46 PM
47 points
4 comments5 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Oct 12, 2024, 2:54 PM
29 points
4 comments7 min readLW link

Base LLMs re­fuse too

Sep 29, 2024, 4:04 PM
60 points
20 comments10 min readLW link

Ex­tract­ing SAE task fea­tures for in-con­text learning

Aug 12, 2024, 8:34 PM
31 points
1 comment9 min readLW link

Self-ex­plain­ing SAE features

Aug 5, 2024, 10:20 PM
60 points
13 comments10 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

Jul 19, 2024, 4:10 PM
48 points
10 comments1 min readLW link
(storage.googleapis.com)

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
66 points
0 comments10 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points
3 comments19 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

Apr 25, 2024, 6:43 PM
63 points
38 comments1 min readLW link
(arxiv.org)

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

Apr 19, 2024, 7:06 PM
79 points
10 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

Apr 19, 2024, 7:06 PM
72 points
0 comments3 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Mar 6, 2024, 5:03 AM
63 points
0 comments12 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

Feb 3, 2024, 6:50 AM
78 points
4 comments8 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
83 points
9 comments18 min readLW link

My best guess at the im­por­tant tricks for train­ing 1L SAEs

Arthur ConmyDec 21, 2023, 1:59 AM
37 points
4 comments3 min readLW link