RSS

Carson Denison

Karma: 1,489

I work on deceptive alignment and reward hacking at Anthropic

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
141 points
15 comments13 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
75 comments10 min readLW link

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

Jun 17, 2024, 6:41 PM
161 points
22 comments8 min readLW link
(arxiv.org)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
79 points
5 comments21 min readLW link

Sim­ple probes can catch sleeper agents

Apr 23, 2024, 9:10 PM
133 points
21 comments1 min readLW link
(www.anthropic.com)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
305 points
95 comments3 min readLW link
(arxiv.org)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
318 points
30 comments18 min readLW link1 review

[Question] How do I Op­ti­mize Team-Match­ing at Google

Carson Denison24 Feb 2022 22:10 UTC
8 points
1 comment1 min readLW link