RSS

Fabien Roger

Karma: 5,172

Rea­son­ing mod­els don’t always say what they think

Apr 9, 2025, 7:48 PM
28 points
4 comments1 min readLW link
(www.anthropic.com)

Align­ment Fak­ing Re­vis­ited: Im­proved Clas­sifiers and Open Source Extensions

Apr 8, 2025, 5:32 PM
144 points
18 comments12 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

Mar 26, 2025, 7:13 PM
41 points
0 comments4 min readLW link
(alignment.anthropic.com)