RSS

Lee Sharkey

Karma: 1,532

Apollo Research (London).

My main research interests are mechanistic interpretability and inner alignment.

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
61 points
9 comments20 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

23 Aug 2024 18:52 UTC
39 points
5 comments16 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
117 points
18 comments18 min readLW link

De­com­pos­ing the QK cir­cuit with Bilin­ear Sparse Dic­tionary Learning

2 Jul 2024 13:17 UTC
81 points
7 comments12 min readLW link

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
57 points
10 comments4 min readLW link
(arxiv.org)

Gated At­ten­tion Blocks: Pre­limi­nary Progress to­ward Re­mov­ing At­ten­tion Head Superposition

8 Apr 2024 11:14 UTC
37 points
4 comments15 min readLW link