Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Benjamin Wright
Karma:
363
All
Posts
Comments
New
Top
Old
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
18 Dec 2024 17:19 UTC
311
points
26
comments
10
min read
LW
link
Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen
,
Sam Marks
,
Can
,
Benjamin Wright
,
Jannik Brinkmann
,
Logan Riggs
and
Rico Angell
2 Aug 2024 19:50 UTC
38
points
1
comment
9
min read
LW
link
Addressing Feature Suppression in SAEs
Benjamin Wright
and
Lee Sharkey
16 Feb 2024 18:32 UTC
86
points
4
comments
10
min read
LW
link
Back to top