RSS

evhub

Karma: 14,071

Evan Hubinger (he/​him/​his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic

Selected work:

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
138 points
15 comments13 min readLW link

Train­ing on Doc­u­ments About Re­ward Hack­ing In­duces Re­ward Hacking

Jan 21, 2025, 9:32 PM
131 points
15 comments2 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
75 comments10 min readLW link

Catas­trophic sab­o­tage as a ma­jor threat model for hu­man-level AI systems

evhubOct 22, 2024, 8:57 PM
92 points
13 comments15 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
94 points
56 comments6 min readLW link
(assets.anthropic.com)

Au­tomat­ing LLM Au­dit­ing with Devel­op­men­tal Interpretability

Sep 4, 2024, 3:50 PM
19 points
0 comments3 min readLW link

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

Jun 17, 2024, 6:41 PM
161 points
22 comments8 min readLW link
(arxiv.org)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
79 points
5 comments21 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

May 6, 2024, 7:07 AM
95 points
13 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

Apr 23, 2024, 9:10 PM
133 points
21 comments1 min readLW link
(www.anthropic.com)

In­duc­ing Un­prompted Misal­ign­ment in LLMs

Apr 19, 2024, 8:00 PM
38 points
7 comments16 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

Apr 6, 2024, 8:46 AM
20 points
0 comments7 min readLW link

How to train your own “Sleeper Agents”

evhubFeb 7, 2024, 12:31 AM
92 points
11 comments2 min readLW link

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhubJan 12, 2024, 11:51 PM
182 points
23 comments2 min readLW link

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

Jan 12, 2024, 7:51 PM
305 points
95 comments3 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Jan 2, 2024, 12:47 AM
125 points
29 comments8 min readLW link
(arxiv.org)

RSPs are pauses done right

evhubOct 14, 2023, 4:06 AM
164 points
73 comments7 min readLW link1 review

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
318 points
30 comments18 min readLW link1 review

When can we trust model eval­u­a­tions?

evhubJul 28, 2023, 7:42 PM
166 points
10 comments10 min readLW link1 review

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

Jul 21, 2023, 2:52 PM
56 points
1 comment1 min readLW link