Ethan Perez

Karma: 2,962

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://ethanperez.net/

Unsupervised Elicitation of Language Models

jiaxin wen, Peter Hase, Sam Marks, Collin, Ethan Perez and janleike

Jun 13, 2025, 4:17 PM

14 points

0 comments2 min readLW link

Unsupervised Elicitation of Language Models

Jiaxin Wen, Peter Hase, Sam Marks, Collin, Ethan Perez and janleike

Jun 13, 2025, 4:15 PM

39 points

9 comments2 min readLW link

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger and Sam Marks

Apr 24, 2025, 9:15 PM

70 points

12 comments2 min readLW link

(alignment.anthropic.com)

Reasoning models don’t always say what they think

Joe Benton, Ethan Perez, Vlad Mikulik and Fabien Roger

Apr 9, 2025, 7:48 PM

28 points

4 comments1 min readLW link

(www.anthropic.com)

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

Mar 26, 2025, 7:13 PM

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Tips and Code for Empirical Research Workflows

John Hughes and Ethan Perez

Jan 20, 2025, 10:31 PM

94 points

14 comments20 min readLW link

Tips On Empirical Research Slides

James Chua, John Hughes, Ethan Perez and Owain_Evans

Jan 8, 2025, 5:06 AM

91 points

4 comments6 min readLW link

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Caspar Oesterheld, Ethan Perez and Chi Nguyen

Dec 16, 2024, 10:42 PM

49 points

1 comment2 min readLW link

(arxiv.org)

Best-of-N Jailbreaking

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, Fazl, Henry Sleight, Ethan Perez and mrinank_sharma

Dec 14, 2024, 4:58 AM

78 points

5 comments2 min readLW link

(arxiv.org)

Introducing the Anthropic Fellows Program

Miranda Zhang and Ethan Perez

Nov 30, 2024, 11:47 PM

26 points

0 comments4 min readLW link

(alignment.anthropic.com)

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

Oct 18, 2024, 10:33 PM

95 points

56 comments6 min readLW link

(assets.anthropic.com)

Reward hacking behavior can generalize across tasks

Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

May 28, 2024, 4:33 PM

81 points

5 comments21 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

Apr 23, 2024, 9:10 PM

133 points

21 comments1 min readLW link

(www.anthropic.com)

Ethan Perez Apr 12, 2024, 8:58 PM
LW: 26 AF: 14
16
AF
in reply to: johnswentworth’s comment on: How I select alignment research projects
Yeah, some caveats I should’ve added in the interview:
1. Don’t listen to my project selection advice if you don’t like my research
2. The forward-chaining -style approach I’m advocating for is controversial among the alignment forum community (and less controversial in the ML/LLM research community and to some extent among LLM alignment groups)
  1. Part of why I like this approach is that I (personally) think there are at least some somewhat promising agendas out there, that aren’t getting executed on enough (or much at all), and it’s doable to e.g. double the amount of good work happening on some agenda by executing quickly/well
  2. If you don’t think existing agendas are that promising (or think they have more work done on them than they deserve), then this is the wrong approach
3. The back-chaining approach I’m advocating for is pretty standard in the alignment community, I think most alignment forum community researchers would probably endorse it. I’m also excited about this approach to research as well, and have done some work in this way as well (e.g., sleepers agents and model organisms of misalignment)
I’m guessing part of the disagreement here is coming from disagreement on how much alignment progress is idea/agenda bottlenecked vs. execution bottlenecked. I really like Tim Dettmer’s blog post on credit assignment in research, which has a good framework for thinking about when you’ll have more counterfactual impact working on ideas vs. working on execution.

How I select alignment research projects

Ethan Perez, Henry Sleight and Mikita Balesni

Apr 10, 2024, 4:33 AM

36 points

4 comments24 min readLW link

Ethan Perez Mar 3, 2024, 4:38 AM
LW: 7 AF: 1
1
AF
in reply to: qxcv’s comment on: Tips for Empirical Alignment Research
Yeah, I think this is one of the ways that velocity is really helpful. I’d probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there’s much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you’ve probably run through the low-hanging fruit on that problem/approach, and there’s other low-hanging in other areas that you probably want to be picking instead.
(High velocity is great for both picking low-hanging fruit and for getting through swamps when you really need to solve a particular problem, so it’s useful to have either way)

Tips for Empirical Alignment Research

Ethan PerezFeb 29, 2024, 6:04 AM

164 points

4 comments23 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

Feb 7, 2024, 9:28 PM

89 points

14 comments9 min readLW link

(arxiv.org)

Ethan Perez Jan 13, 2024, 8:38 PM
LW: 25 AF: 12
10
AF
in reply to: TurnTrout’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I’d be interested if so. I’ve been watching the online discussion fairly closely, and I think I’ve only seen one case where someone might’ve had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I’ve seen).
Almost all of the misunderstanding of the paper I’m seeing is actually in the opposite direction “why are you even concerned if you explicitly trained the bad behavior into the model in the first place?” suggesting that it’s pretty salient to people that we explicitly trained for this (e.g., from the paper title).

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

Jan 12, 2024, 7:51 PM

305 points

95 comments3 min readLW link

(arxiv.org)