evhub

Karma: 14,138

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

The Hubinger lectures on AGI safety: an introductory lecture series

evhubJun 22, 2023, 12:59 AM

126 points

0 comments1 min readLW link

(www.youtube.com)

Towards understanding-based safety evaluations

evhubMar 15, 2023, 6:18 PM

164 points

16 comments5 min readLW link

Agents vs. Predictors: Concrete differentiating factors

evhubFeb 24, 2023, 11:50 PM

37 points

3 comments4 min readLW link

Bing Chat is blatantly, aggressively misaligned

evhubFeb 15, 2023, 5:29 AM

405 points

181 comments2 min readLW link 1 review

Conditioning Predictive Models: Open problems, Conclusion, and Appendix

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 10, 2023, 7:21 PM

36 points

3 comments11 min readLW link

Conditioning Predictive Models: Deployment strategy

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 9, 2023, 8:59 PM

28 points

0 comments10 min readLW link

Conditioning Predictive Models: Interactions with other approaches

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 8, 2023, 6:19 PM

32 points

2 comments11 min readLW link

Conditioning Predictive Models: Making inner alignment as easy as possible

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 7, 2023, 8:04 PM

27 points

2 comments19 min readLW link

Conditioning Predictive Models: The case for competitiveness

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 6, 2023, 8:08 PM

20 points

3 comments11 min readLW link

Conditioning Predictive Models: Outer alignment via careful conditioning

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 2, 2023, 8:28 PM

72 points

15 comments57 min readLW link

Conditioning Predictive Models: Large language models as predictors

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

Feb 2, 2023, 8:28 PM

88 points

4 comments13 min readLW link

Why I’m joining Anthropic

evhubJan 5, 2023, 1:12 AM

118 points

4 comments2 min readLW link

Discovering Language Model Behaviors with Model-Written Evaluations

evhub and Ethan Perez

20 Dec 2022 20:08 UTC

100 points

34 comments1 min readLW link

(www.anthropic.com)

In defense of probably wrong mechanistic models

evhub6 Dec 2022 23:24 UTC

55 points

10 comments2 min readLW link

Engineering Monosemanticity in Toy Models

Adam Jermyn, evhub and Nicholas Schiefer

18 Nov 2022 1:43 UTC

75 points

7 comments3 min readLW link

(arxiv.org)

We must be very clear: fraud in the service of effective altruism is unacceptable

evhub10 Nov 2022 23:31 UTC

42 points

56 comments LW link

Attempts at Forwarding Speed Priors

james.lucassen and evhub

24 Sep 2022 5:49 UTC

30 points

2 comments18 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC

69 points

4 comments5 min readLW link 1 review

(transformer-circuits.pub)

Path dependence in ML inductive biases

Vivek Hebbar and evhub

10 Sep 2022 1:38 UTC

68 points

13 comments10 min readLW link

Monitoring for deceptive alignment

evhub8 Sep 2022 23:07 UTC

135 points

8 comments9 min readLW link