StefanHex

Karma: 1,698

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

Try training token-level probes

StefanHexApr 14, 2025, 11:56 AM

46 points

6 comments8 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

Mar 17, 2025, 10:27 PM

27 points

0 comments11 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

Feb 6, 2025, 3:46 PM

102 points

9 comments2 min readLW link

(arxiv.org)

SAE regularization produces more interpretable models

Peter Lai and StefanHex

Jan 28, 2025, 8:02 PM

21 points

7 comments4 min readLW link

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

Jan 25, 2025, 1:12 PM

107 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

Nov 7, 2024, 10:07 PM

47 points

0 comments1 min readLW link

(arxiv.org)

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

Sep 26, 2024, 1:44 PM

42 points

4 comments1 min readLW link

(arxiv.org)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

Sep 25, 2024, 8:37 PM

29 points

0 comments3 min readLW link

(arxiv.org)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee and StefanHex

Sep 6, 2024, 2:28 AM

28 points

0 comments12 min readLW link

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHexAug 8, 2024, 6:33 PM

165 points

11 comments8 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

Jul 18, 2024, 2:15 PM

121 points

18 comments18 min readLW link

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

Jul 5, 2024, 5:05 PM

65 points

2 comments5 min readLW link

StefanHex’s Shortform

StefanHexJul 5, 2024, 2:31 PM

5 points

64 comments LW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

May 29, 2024, 5:44 PM

93 points

0 comments7 min readLW link

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq, jake_mendel, StefanHex and Kaarel

May 20, 2024, 5:55 PM

23 points

7 comments6 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

May 20, 2024, 5:53 PM

107 points

4 comments3 min readLW link

How to use and interpret activation patching

StefanHex and Neel Nanda

Apr 24, 2024, 8:35 AM

13 points

6 comments18 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

Nov 9, 2023, 4:16 PM

51 points

0 comments6 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

StefanHex and Marius Hobbhahn

May 25, 2023, 3:37 PM

71 points

1 comment13 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

StefanHex and Marius Hobbhahn

May 9, 2023, 7:41 PM

119 points

1 comment10 min readLW link