Benjamin Wright

Karma: 539

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

75 comments10 min readLW link

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

Aug 2, 2024, 7:50 PM

38 points

1 comment9 min readLW link

Benjamin Wright Mar 29, 2024, 5:20 PM
2 points
1
on: SAE reconstruction errors are (empirically) pathological
One explanation for pathological errors is feature suppression/feature shrinkage (link). I’d be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.

Benjamin Wright Feb 16, 2024, 10:30 PM
3 points
0
in reply to: Joseph Bloom’s comment on: Fixing Feature Suppression in SAEs
The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

Feb 16, 2024, 6:32 PM

86 points

4 comments10 min readLW link