RSS

Lee Sharkey

Karma: 1,703

Apollo Research (London).

My main research interests are mechanistic interpretability and inner alignment.

Paper: Open Prob­lems in Mechanis­tic Interpretability

Jan 29, 2025, 10:25 AM
68 points
0 comments1 min readLW link
(arxiv.org)

At­tri­bu­tion-based pa­ram­e­ter decomposition

Jan 25, 2025, 1:12 PM
107 points
17 comments4 min readLW link
(publications.apolloresearch.ai)

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Aug 24, 2024, 12:56 AM
68 points
10 comments20 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

Aug 23, 2024, 6:52 PM
41 points
5 comments16 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

Jul 18, 2024, 2:15 PM
118 points
18 comments18 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

Jul 2, 2024, 1:17 PM
86 points
7 comments12 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points
0 comments7 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

May 17, 2024, 4:25 PM
57 points
20 comments4 min readLW link
(arxiv.org)

Gated At­ten­tion Blocks: Pre­limi­nary Progress to­ward Re­mov­ing At­ten­tion Head Superposition

Apr 8, 2024, 11:14 AM
42 points
4 comments15 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee SharkeyApr 3, 2024, 12:34 PM
95 points
22 comments22 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

Feb 16, 2024, 6:32 PM
86 points
4 comments10 min readLW link

The­o­ries of Change for AI Auditing

Nov 13, 2023, 7:33 PM
54 points
0 comments18 min readLW link
(www.apolloresearch.ai)

An­nounc­ing Apollo Research

May 30, 2023, 4:17 PM
217 points
11 comments8 min readLW link

‘Fun­da­men­tal’ vs ‘ap­plied’ mechanis­tic in­ter­pretabil­ity research

Lee SharkeyMay 23, 2023, 6:26 PM
65 points
6 comments3 min readLW link

A tech­ni­cal note on bil­in­ear lay­ers for interpretability

Lee SharkeyMay 8, 2023, 6:06 AM
59 points
0 comments1 min readLW link
(arxiv.org)

A small up­date to the Sparse Cod­ing in­terim re­search report

Apr 30, 2023, 7:54 PM
61 points
5 comments1 min readLW link

Why al­most ev­ery RL agent does learned optimization

Lee SharkeyFeb 12, 2023, 4:58 AM
32 points
3 comments5 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

Dec 13, 2022, 3:41 PM
149 points
23 comments22 min readLW link2 reviews

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

Nov 16, 2022, 2:14 PM
89 points
2 comments12 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

Sep 23, 2022, 5:58 PM
144 points
29 comments33 min readLW link