RSS

Kei

Karma: 270

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
137 points
7 comments13 min readLW link

Kei’s Shortform

KeiJan 27, 2025, 7:23 AM
3 points
5 comments1 min readLW link

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
79 points
5 comments21 min readLW link