RSS

StefanHex

Karma: 1,362

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

28 Jan 2025 20:02 UTC
9 points
2 comments4 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

25 Jan 2025 13:12 UTC
88 points
9 comments4 min readLW link
(publications.apolloresearch.ai)

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

7 Nov 2024 22:07 UTC
47 points
0 comments1 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

26 Sep 2024 13:44 UTC
42 points
4 comments1 min readLW link
(arxiv.org)

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

25 Sep 2024 20:37 UTC
29 points
0 comments3 min readLW link
(arxiv.org)