David Duvenaud

Karma: 466

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

18 Oct 2024 22:33 UTC

93 points

55 comments6 min readLW link

(assets.anthropic.com)

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

133 points

21 comments1 min readLW link

(www.anthropic.com)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

12 Jan 2024 19:51 UTC

305 points

95 comments3 min readLW link

(arxiv.org)