RSS

evhub

Karma: 13,834

Evan Hubinger (he/​him/​his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic

Selected work:

Train­ing on Doc­u­ments About Re­ward Hack­ing In­duces Re­ward Hacking

21 Jan 2025 21:32 UTC
130 points
13 comments2 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
476 points
68 comments10 min readLW link

Catas­trophic sab­o­tage as a ma­jor threat model for hu­man-level AI systems

evhub22 Oct 2024 20:57 UTC
92 points
11 comments15 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
94 points
55 comments6 min readLW link
(assets.anthropic.com)

Au­tomat­ing LLM Au­dit­ing with Devel­op­men­tal Interpretability

4 Sep 2024 15:50 UTC
17 points
0 comments3 min readLW link

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
161 points
22 comments8 min readLW link
(arxiv.org)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
78 points
5 comments21 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
95 points
13 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
133 points
21 comments1 min readLW link
(www.anthropic.com)

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
38 points
7 comments16 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

6 Apr 2024 8:46 UTC
20 points
0 comments7 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
91 points
11 comments2 min readLW link

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhub12 Jan 2024 23:51 UTC
182 points
23 comments2 min readLW link

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
305 points
95 comments3 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
124 points
29 comments8 min readLW link
(arxiv.org)

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC
164 points
73 comments7 min readLW link1 review

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
313 points
29 comments18 min readLW link1 review

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
165 points
10 comments10 min readLW link1 review

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

21 Jul 2023 14:52 UTC
56 points
1 comment1 min readLW link

The Hub­inger lec­tures on AGI safety: an in­tro­duc­tory lec­ture series

evhub22 Jun 2023 0:59 UTC
126 points
0 comments1 min readLW link
(www.youtube.com)