RSS

Nicholas Schiefer

Karma: 667

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

Jan 12, 2024, 7:51 PM
305 points
95 comments3 min readLW link
(arxiv.org)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
318 points
29 comments18 min readLW link1 review

Eng­ineer­ing Monose­man­tic­ity in Toy Models

Nov 18, 2022, 1:43 AM
75 points
7 comments3 min readLW link
(arxiv.org)

ELK Pro­posal—Make the Re­porter care about the Pre­dic­tor’s beliefs

Jun 11, 2022, 10:53 PM
8 points
0 comments6 min readLW link