RSS

evhub(Evan Hubinger)

Karma: 12,026

Evan Hubinger (he/​him/​his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic

Selected work:

Bing Chat is blatantly, ag­gres­sively misaligned

evhub15 Feb 2023 5:29 UTC
396 points
170 comments2 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
306 points
26 comments18 min readLW link

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
298 points
94 comments3 min readLW link
(arxiv.org)

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
206 points
38 comments12 min readLW link2 reviews

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
205 points
36 comments38 min readLW link2 reviews

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
184 points
42 comments12 min readLW link3 reviews

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhub12 Jan 2024 23:51 UTC
179 points
23 comments2 min readLW link

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC
164 points
70 comments7 min readLW link

A trans­parency and in­ter­pretabil­ity tech tree

evhub16 Jun 2022 23:44 UTC
163 points
11 comments18 min readLW link1 review

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
152 points
16 comments5 min readLW link

Un­der­stand­ing “Deep Dou­ble Des­cent”

evhub6 Dec 2019 0:00 UTC
149 points
51 comments5 min readLW link4 reviews

AI co­or­di­na­tion needs clear wins

evhub1 Sep 2022 23:41 UTC
146 points
16 comments2 min readLW link1 review

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
144 points
4 comments3 min readLW link
(transformer-circuits.pub)

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
143 points
9 comments10 min readLW link

Mon­i­tor­ing for de­cep­tive alignment

evhub8 Sep 2022 23:07 UTC
135 points
8 comments9 min readLW link

The Hub­inger lec­tures on AGI safety: an in­tro­duc­tory lec­ture series

evhub22 Jun 2023 0:59 UTC
126 points
0 comments1 min readLW link
(www.youtube.com)

How do we be­come con­fi­dent in the safety of a ma­chine learn­ing sys­tem?

evhub8 Nov 2021 22:49 UTC
126 points
5 comments31 min readLW link

Why I’m join­ing Anthropic

evhub5 Jan 2023 1:12 UTC
121 points
4 comments1 min readLW link