RSS

Ethan Perez

Karma: 2,746

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://​​ethanperez.net/​​

Tips and Code for Em­piri­cal Re­search Workflows

20 Jan 2025 22:31 UTC
63 points
7 comments20 min readLW link

Tips On Em­piri­cal Re­search Slides

8 Jan 2025 5:06 UTC
88 points
4 comments6 min readLW link

A dataset of ques­tions on de­ci­sion-the­o­retic rea­son­ing in New­comb-like problems

16 Dec 2024 22:42 UTC
47 points
1 comment2 min readLW link
(arxiv.org)

Best-of-N Jailbreaking

14 Dec 2024 4:58 UTC
78 points
5 comments2 min readLW link
(arxiv.org)

In­tro­duc­ing the An­thropic Fel­lows Program

30 Nov 2024 23:47 UTC
26 points
0 comments4 min readLW link
(alignment.anthropic.com)

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
94 points
55 comments6 min readLW link
(assets.anthropic.com)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
78 points
5 comments21 min readLW link

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
133 points
21 comments1 min readLW link
(www.anthropic.com)

How I se­lect al­ign­ment re­search projects

10 Apr 2024 4:33 UTC
35 points
4 comments24 min readLW link

Tips for Em­piri­cal Align­ment Research

Ethan Perez29 Feb 2024 6:04 UTC
155 points
4 comments23 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
88 points
14 comments9 min readLW link
(arxiv.org)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
305 points
95 comments3 min readLW link
(arxiv.org)

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

16 Nov 2023 20:18 UTC
45 points
3 comments1 min readLW link
(arxiv.org)

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

24 Oct 2023 0:30 UTC
66 points
0 comments2 min readLW link
(arxiv.org)

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

23 Oct 2023 14:11 UTC
20 points
2 comments5 min readLW link
(far.ai)