RSS

Daniel Lee

Karma: 72

Find­ing Fea­tures Causally Up­stream of Refusal

Jan 14, 2025, 2:30 AM
48 points
5 comments12 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

Sep 6, 2024, 2:28 AM
28 points
0 comments12 min readLW link