RSS

evhub(Evan Hubinger)

Karma: 12,026

Evan Hubinger (he/​him/​his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic

Selected work:

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
82 points
4 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
119 points
17 comments1 min readLW link
(www.anthropic.com)

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
37 points
6 comments16 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

6 Apr 2024 8:46 UTC
19 points
0 comments7 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
91 points
10 comments2 min readLW link

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhub12 Jan 2024 23:51 UTC
179 points
23 comments2 min readLW link

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
292 points
94 comments3 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
122 points
29 comments8 min readLW link
(arxiv.org)

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC
164 points
70 comments7 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
306 points
26 comments18 min readLW link

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
143 points
9 comments10 min readLW link

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

21 Jul 2023 14:52 UTC
56 points
1 comment1 min readLW link

The Hub­inger lec­tures on AGI safety: an in­tro­duc­tory lec­ture series

evhub22 Jun 2023 0:59 UTC
126 points
0 comments1 min readLW link
(www.youtube.com)

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
152 points
16 comments5 min readLW link

Agents vs. Pre­dic­tors: Con­crete differ­en­ti­at­ing factors

evhub24 Feb 2023 23:50 UTC
37 points
3 comments4 min readLW link

Bing Chat is blatantly, ag­gres­sively misaligned

evhub15 Feb 2023 5:29 UTC
396 points
170 comments2 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Open prob­lems, Con­clu­sion, and Appendix

10 Feb 2023 19:21 UTC
36 points
3 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

9 Feb 2023 20:59 UTC
28 points
0 comments10 min readLW link

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

8 Feb 2023 18:19 UTC
32 points
2 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Mak­ing in­ner al­ign­ment as easy as possible

7 Feb 2023 20:04 UTC
27 points
2 comments19 min readLW link