RSS

Joe Benton

Karma: 232

Rea­son­ing mod­els don’t always say what they think

Apr 9, 2025, 7:48 PM
28 points
4 comments1 min readLW link
(www.anthropic.com)

Do mod­els say what they learn?

Mar 22, 2025, 3:19 PM
115 points
12 comments13 min readLW link

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

Feb 19, 2025, 8:47 PM
15 points
1 comment1 min readLW link
(alignment.anthropic.com)

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
94 points
56 comments6 min readLW link
(assets.anthropic.com)