RSS

Sam Marks

Karma: 3,045

Down­stream ap­pli­ca­tions as val­i­da­tion of in­ter­pretabil­ity progress

Sam MarksMar 31, 2025, 1:35 AM
98 points
1 comment7 min readLW link

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
137 points
9 comments13 min readLW link

Recom­men­da­tions for Tech­ni­cal AI Safety Re­search Directions

Sam MarksJan 10, 2025, 7:34 PM
64 points
1 comment17 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
74 comments10 min readLW link