RSS

Benjamin Wright

Karma: 363

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
307 points
22 comments10 min readLW link

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

2 Aug 2024 19:50 UTC
38 points
1 comment9 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
86 points
4 comments10 min readLW link