evhub

Karma: 14,071

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

138 points

15 comments13 min readLW link

Training on Documents About Reward Hacking Induces Reward Hacking

evhub and Nathan Hu

Jan 21, 2025, 9:32 PM

131 points

15 comments2 min readLW link

(alignment.anthropic.com)

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

75 comments10 min readLW link

Catastrophic sabotage as a major threat model for human-level AI systems

evhubOct 22, 2024, 8:57 PM

92 points

13 comments15 min readLW link

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

Oct 18, 2024, 10:33 PM

94 points

56 comments6 min readLW link

(assets.anthropic.com)

Automating LLM Auditing with Developmental Interpretability

htlou and evhub

Sep 4, 2024, 3:50 PM

19 points

0 comments3 min readLW link

Sycophancy to subterfuge: Investigating reward tampering in large language models

Carson Denison and evhub

Jun 17, 2024, 6:41 PM

161 points

22 comments8 min readLW link

(arxiv.org)

Reward hacking behavior can generalize across tasks

Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

May 28, 2024, 4:33 PM

79 points

5 comments21 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

May 6, 2024, 7:07 AM

95 points

13 comments1 min readLW link

(arxiv.org)

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

Apr 23, 2024, 9:10 PM

133 points

21 comments1 min readLW link

(www.anthropic.com)

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

Apr 19, 2024, 8:00 PM

38 points

7 comments16 min readLW link

Measuring Predictability of Persona Evaluations

Thee Ho and evhub

Apr 6, 2024, 8:46 AM

20 points

0 comments7 min readLW link

How to train your own “Sleeper Agents”

evhubFeb 7, 2024, 12:31 AM

92 points

11 comments2 min readLW link

Introducing Alignment Stress-Testing at Anthropic

evhubJan 12, 2024, 11:51 PM

182 points

23 comments2 min readLW link

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

Jan 12, 2024, 7:51 PM

305 points

95 comments3 min readLW link

(arxiv.org)

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

Jan 2, 2024, 12:47 AM

125 points

29 comments8 min readLW link

(arxiv.org)

RSPs are pauses done right

evhubOct 14, 2023, 4:06 AM

164 points

73 comments7 min readLW link 1 review

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

Aug 8, 2023, 1:30 AM

318 points

30 comments18 min readLW link 1 review

When can we trust model evaluations?

evhubJul 28, 2023, 7:42 PM

166 points

10 comments10 min readLW link 1 review

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

robertzk and evhub

Jul 21, 2023, 2:52 PM

56 points

1 comment1 min readLW link