Lee Sharkey

Karma: 1,712

Goodfire (London). Formerly cofounded Apollo Research.

My main research interests are mechanistic interpretability and inner alignment.

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

Jan 29, 2025, 10:25 AM

68 points

0 comments1 min readLW link

(arxiv.org)

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

Jan 25, 2025, 1:12 PM

107 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

Aug 24, 2024, 12:56 AM

68 points

10 comments20 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

Aug 23, 2024, 6:52 PM

42 points

8 comments16 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

Jul 18, 2024, 2:15 PM

121 points

18 comments18 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

Jul 2, 2024, 1:17 PM

86 points

7 comments12 min readLW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

May 29, 2024, 5:44 PM

93 points

0 comments7 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

May 17, 2024, 4:25 PM

57 points

20 comments4 min readLW link

(arxiv.org)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

cmathw, Dennis Akar and Lee Sharkey

Apr 8, 2024, 11:14 AM

42 points

4 comments15 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee SharkeyApr 3, 2024, 12:34 PM

96 points

23 comments22 min readLW link

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

Feb 16, 2024, 6:32 PM

86 points

4 comments10 min readLW link

Theories of Change for AI Auditing

Lee Sharkey, beren and Marius Hobbhahn

Nov 13, 2023, 7:33 PM

54 points

0 comments18 min readLW link

(www.apolloresearch.ai)

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

May 30, 2023, 4:17 PM

217 points

11 comments8 min readLW link

‘Fundamental’ vs ‘applied’ mechanistic interpretability research

Lee SharkeyMay 23, 2023, 6:26 PM

65 points

6 comments3 min readLW link

A technical note on bilinear layers for interpretability

Lee SharkeyMay 8, 2023, 6:06 AM

59 points

0 comments1 min readLW link

(arxiv.org)

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

Apr 30, 2023, 7:54 PM

61 points

5 comments1 min readLW link

Why almost every RL agent does learned optimization

Lee SharkeyFeb 12, 2023, 4:58 AM

32 points

3 comments5 min readLW link

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

Dec 13, 2022, 3:41 PM

150 points

23 comments22 min readLW link 2 reviews

Current themes in mechanistic interpretability research

Lee Sharkey, Sid Black and beren

Nov 16, 2022, 2:14 PM

89 points

2 comments12 min readLW link

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

Sep 23, 2022, 5:58 PM

144 points

29 comments33 min readLW link