RSS

StefanHex

Karma: 1,584

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

Proof-of-Con­cept De­bug­ger for a Small LLM

Mar 17, 2025, 10:27 PM
20 points
0 comments11 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
102 points
9 comments2 min readLW link
(arxiv.org)

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

Jan 28, 2025, 8:02 PM
21 points
7 comments4 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

Jan 25, 2025, 1:12 PM
107 points
21 comments4 min readLW link
(publications.apolloresearch.ai)

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

Nov 7, 2024, 10:07 PM
47 points
0 comments1 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

Sep 26, 2024, 1:44 PM
42 points
4 comments1 min readLW link
(arxiv.org)

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

Sep 25, 2024, 8:37 PM
29 points
0 comments3 min readLW link
(arxiv.org)

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

Sep 6, 2024, 2:28 AM
28 points
0 comments12 min readLW link

You can re­move GPT2’s Lay­erNorm by fine-tun­ing for an hour

StefanHexAug 8, 2024, 6:33 PM
162 points
11 comments8 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

Jul 18, 2024, 2:15 PM
120 points
18 comments18 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

Jul 5, 2024, 5:05 PM
65 points
2 comments5 min readLW link

Ste­fanHex’s Shortform

StefanHexJul 5, 2024, 2:31 PM
5 points
59 comments1 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points
0 comments7 min readLW link

In­ter­pretabil­ity: In­te­grated Gra­di­ents is a de­cent at­tri­bu­tion method

May 20, 2024, 5:55 PM
23 points
7 comments6 min readLW link

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

May 20, 2024, 5:53 PM
105 points
4 comments3 min readLW link

How to use and in­ter­pret ac­ti­va­tion patching

Apr 24, 2024, 8:35 AM
13 points
2 comments18 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

Nov 9, 2023, 4:16 PM
51 points
0 comments6 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 2

May 25, 2023, 3:37 PM
71 points
1 comment13 min readLW link

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 1

May 9, 2023, 7:41 PM
119 points
1 comment10 min readLW link

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

May 7, 2023, 12:46 AM
77 points
24 comments11 min readLW link