RSS

Joe Benton

Karma: 101

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

19 Feb 2025 20:47 UTC
15 points
1 comment1 min readLW link
(alignment.anthropic.com)

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
94 points
56 comments6 min readLW link
(assets.anthropic.com)