Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
David Duvenaud
Karma:
466
All
Posts
Comments
New
Top
Old
Sabotage Evaluations for Frontier Models
David Duvenaud
,
Joe Benton
,
Sam Bowman
,
evhub
,
mishajw
,
Eric Christiansen
,
HoldenKarnofsky
,
Ethan Perez
and
Buck
18 Oct 2024 22:33 UTC
93
points
55
comments
6
min read
LW
link
(assets.anthropic.com)
Simple probes can catch sleeper agents
Monte M
,
Carson Denison
,
Zac Hatfield-Dodds
,
David Duvenaud
,
Sam Bowman
,
Ethan Perez
and
evhub
23 Apr 2024 21:10 UTC
133
points
21
comments
1
min read
LW
link
(www.anthropic.com)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
12 Jan 2024 19:51 UTC
305
points
95
comments
3
min read
LW
link
(arxiv.org)
Back to top