RSS

evhub

Karma: 13,915

Evan Hubinger (he/​him/​his) (evanjhub@gmail.com)

I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic

Selected work:

Towards un­der­stand­ing-based safety evaluations

evhubMar 15, 2023, 6:18 PM
164 points
16 comments5 min readLW link

Agents vs. Pre­dic­tors: Con­crete differ­en­ti­at­ing factors

evhubFeb 24, 2023, 11:50 PM
37 points
3 comments4 min readLW link

Bing Chat is blatantly, ag­gres­sively misaligned

evhubFeb 15, 2023, 5:29 AM
403 points
181 comments2 min readLW link1 review

Con­di­tion­ing Pre­dic­tive Models: Open prob­lems, Con­clu­sion, and Appendix

Feb 10, 2023, 7:21 PM
36 points
3 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

Feb 9, 2023, 8:59 PM
28 points
0 comments10 min readLW link

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

Feb 8, 2023, 6:19 PM
32 points
2 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Mak­ing in­ner al­ign­ment as easy as possible

Feb 7, 2023, 8:04 PM
27 points
2 comments19 min readLW link

Con­di­tion­ing Pre­dic­tive Models: The case for competitiveness

Feb 6, 2023, 8:08 PM
20 points
3 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Outer al­ign­ment via care­ful conditioning

Feb 2, 2023, 8:28 PM
72 points
15 comments57 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Large lan­guage mod­els as predictors

Feb 2, 2023, 8:28 PM
88 points
4 comments13 min readLW link

Why I’m join­ing Anthropic

evhubJan 5, 2023, 1:12 AM
118 points
4 comments1 min readLW link

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

Dec 20, 2022, 8:08 PM
100 points
34 comments1 min readLW link
(www.anthropic.com)

In defense of prob­a­bly wrong mechanis­tic models

evhubDec 6, 2022, 11:24 PM
55 points
10 comments2 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

Nov 18, 2022, 1:43 AM
75 points
7 comments3 min readLW link
(arxiv.org)

We must be very clear: fraud in the ser­vice of effec­tive al­tru­ism is unacceptable

evhubNov 10, 2022, 11:31 PM
42 points
56 comments1 min readLW link

At­tempts at For­ward­ing Speed Priors

Sep 24, 2022, 5:49 AM
30 points
2 comments18 min readLW link

Toy Models of Superposition

evhubSep 21, 2022, 11:48 PM
69 points
4 comments5 min readLW link1 review
(transformer-circuits.pub)

Path de­pen­dence in ML in­duc­tive biases

Sep 10, 2022, 1:38 AM
68 points
13 comments10 min readLW link

Mon­i­tor­ing for de­cep­tive alignment

evhubSep 8, 2022, 11:07 PM
135 points
8 comments9 min readLW link

Sticky goals: a con­crete ex­per­i­ment for un­der­stand­ing de­cep­tive alignment

evhubSep 2, 2022, 9:57 PM
39 points
13 comments3 min readLW link