Apollo Re­search (org)

TagLast edit: 19 Jul 2024 10:42 UTC by Lee Sharkey

SAE fea­ture ge­om­e­try is out­side the su­per­po­si­tion hypothesis

jake_mendel24 Jun 2024 16:07 UTC
218 points
17 comments11 min readLW link

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
215 points
11 comments8 min readLW link

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
94 points
22 comments22 min readLW link

You can re­move GPT2’s Lay­erNorm by fine-tun­ing for an hour

StefanHex8 Aug 2024 18:33 UTC
161 points
11 comments8 min readLW link

In­ter­pretabil­ity: In­te­grated Gra­di­ents is a de­cent at­tri­bu­tion method

20 May 2024 17:55 UTC
22 points
7 comments6 min readLW link

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

20 May 2024 17:53 UTC
105 points
4 comments3 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
57 points
10 comments4 min readLW link

Un­der­stand­ing strate­gic de­cep­tion and de­cep­tive alignment

25 Sep 2023 16:27 UTC
64 points
16 comments7 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

8 Jul 2024 22:24 UTC
101 points
28 comments5 min readLW link

We need a Science of Evals

22 Jan 2024 20:30 UTC
71 points
13 comments9 min readLW link

A starter guide for evals

8 Jan 2024 18:24 UTC
50 points
2 comments12 min readLW link

The­o­ries of Change for AI Auditing

13 Nov 2023 19:33 UTC
54 points
0 comments18 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC
25 points
0 comments2 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

5 Jul 2024 17:05 UTC
64 points
2 comments5 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
117 points
18 comments18 min readLW link

An Opinionated Evals Read­ing List

15 Oct 2024 14:38 UTC
47 points
0 comments13 min readLW link
No comments.