RSS

Zachary Witten

Karma: 77

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

Feb 19, 2025, 8:47 PM
15 points
1 comment1 min readLW link
(alignment.anthropic.com)

Con­tent Fea­tures Aren’t Enough for De­tect­ing Tox­i­c­ity. One Needs User Fea­tures.

Zachary WittenFeb 14, 2023, 6:48 PM
11 points
0 comments3 min readLW link