RSS

Apollo Re­search (org)

TagLast edit: Jul 19, 2024, 10:42 AM by Lee Sharkey

SAE fea­ture ge­om­e­try is out­side the su­per­po­si­tion hypothesis

jake_mendelJun 24, 2024, 4:07 PM
228 points
17 comments11 min readLW link

An­nounc­ing Apollo Research

May 30, 2023, 4:17 PM
217 points
11 comments8 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points
0 comments7 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee SharkeyApr 3, 2024, 12:34 PM
96 points
23 comments22 min readLW link

You can re­move GPT2’s Lay­erNorm by fine-tun­ing for an hour

StefanHexAug 8, 2024, 6:33 PM
164 points
11 comments8 min readLW link

In­ter­pretabil­ity: In­te­grated Gra­di­ents is a de­cent at­tri­bu­tion method

May 20, 2024, 5:55 PM
23 points
7 comments6 min readLW link

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

May 20, 2024, 5:53 PM
105 points
4 comments3 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

May 17, 2024, 4:25 PM
57 points
20 comments4 min readLW link
(arxiv.org)

Un­der­stand­ing strate­gic de­cep­tion and de­cep­tive alignment

Sep 25, 2023, 4:27 PM
64 points
16 comments7 min readLW link
(www.apolloresearch.ai)

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

Jul 8, 2024, 10:24 PM
109 points
37 comments5 min readLW link

We need a Science of Evals

Jan 22, 2024, 8:30 PM
71 points
13 comments9 min readLW link

A starter guide for evals

Jan 8, 2024, 6:24 PM
53 points
2 comments12 min readLW link
(www.apolloresearch.ai)

The­o­ries of Change for AI Auditing

Nov 13, 2023, 7:33 PM
54 points
0 comments18 min readLW link
(www.apolloresearch.ai)

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM
25 points
0 comments2 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

Jul 5, 2024, 5:05 PM
65 points
2 comments5 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

Jul 18, 2024, 2:15 PM
121 points
18 comments18 min readLW link

An Opinionated Evals Read­ing List

Oct 15, 2024, 2:38 PM
65 points
0 comments13 min readLW link
(www.apolloresearch.ai)

Try train­ing to­ken-level probes

StefanHexApr 14, 2025, 11:56 AM
46 points
4 comments8 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

Jan 25, 2025, 1:12 PM
107 points
21 comments4 min readLW link
(publications.apolloresearch.ai)

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
102 points
9 comments2 min readLW link
(arxiv.org)

Paper: Open Prob­lems in Mechanis­tic Interpretability

Jan 29, 2025, 10:25 AM
68 points
0 comments1 min readLW link
(arxiv.org)

Ac­ti­va­tion space in­ter­pretabil­ity may be doomed

Jan 8, 2025, 12:49 PM
147 points
32 comments8 min readLW link
No comments.