RSS

Daniel Lee

Karma: 78

Find­ing Fea­tures Causally Up­stream of Refusal

14 Jan 2025 2:30 UTC
54 points
5 comments12 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

6 Sep 2024 2:28 UTC
28 points
0 comments12 min readLW link