Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Nicholas Schiefer
Karma:
661
All
Posts
Comments
New
Top
Old
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
12 Jan 2024 19:51 UTC
305
points
95
comments
3
min read
LW
link
(arxiv.org)
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
evhub
,
Nicholas Schiefer
,
Carson Denison
and
Ethan Perez
8 Aug 2023 1:30 UTC
312
points
29
comments
18
min read
LW
link
1
review
Engineering Monosemanticity in Toy Models
Adam Jermyn
,
evhub
and
Nicholas Schiefer
18 Nov 2022 1:43 UTC
75
points
7
comments
3
min read
LW
link
(arxiv.org)
ELK Proposal—Make the Reporter care about the Predictor’s beliefs
Adam Jermyn
and
Nicholas Schiefer
11 Jun 2022 22:53 UTC
8
points
0
comments
6
min read
LW
link
Back to top