RSS

StefanHex

Karma: 1,362

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

28 Jan 2025 20:02 UTC
18 points
2 comments4 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

25 Jan 2025 13:12 UTC
99 points
11 comments4 min readLW link
(publications.apolloresearch.ai)

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

7 Nov 2024 22:07 UTC
47 points
0 comments1 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

26 Sep 2024 13:44 UTC
42 points
4 comments1 min readLW link
(arxiv.org)

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

25 Sep 2024 20:37 UTC
29 points
0 comments3 min readLW link
(arxiv.org)

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

6 Sep 2024 2:28 UTC
28 points
0 comments12 min readLW link

You can re­move GPT2’s Lay­erNorm by fine-tun­ing for an hour

StefanHex8 Aug 2024 18:33 UTC
161 points
11 comments8 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
118 points
18 comments18 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

5 Jul 2024 17:05 UTC
65 points
2 comments5 min readLW link

Ste­fanHex’s Shortform

StefanHex5 Jul 2024 14:31 UTC
5 points
22 comments1 min readLW link

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

In­ter­pretabil­ity: In­te­grated Gra­di­ents is a de­cent at­tri­bu­tion method

20 May 2024 17:55 UTC
22 points
7 comments6 min readLW link

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

20 May 2024 17:53 UTC
105 points
4 comments3 min readLW link

How to use and in­ter­pret ac­ti­va­tion patching

24 Apr 2024 8:35 UTC
12 points
2 comments18 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
51 points
0 comments6 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 2

25 May 2023 15:37 UTC
71 points
1 comment13 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 1

9 May 2023 19:41 UTC
119 points
1 comment10 min readLW link

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

7 May 2023 0:46 UTC
77 points
24 comments11 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
96 points
8 comments21 min readLW link

How-to Trans­former Mechanis­tic In­ter­pretabil­ity—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC
47 points
5 comments13 min readLW link