RSS

Ethan Perez

Karma: 2,729

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://​​ethanperez.net/​​

Tips and Code for Em­piri­cal Re­search Workflows

Jan 20, 2025, 10:31 PM
49 points
2 comments20 min readLW link

Tips On Em­piri­cal Re­search Slides

Jan 8, 2025, 5:06 AM
88 points
4 comments6 min readLW link

A dataset of ques­tions on de­ci­sion-the­o­retic rea­son­ing in New­comb-like problems

Dec 16, 2024, 10:42 PM
47 points
1 comment2 min readLW link
(arxiv.org)

Best-of-N Jailbreaking

Dec 14, 2024, 4:58 AM
78 points
5 comments2 min readLW link
(arxiv.org)

In­tro­duc­ing the An­thropic Fel­lows Program

Nov 30, 2024, 11:47 PM
26 points
0 comments4 min readLW link
(alignment.anthropic.com)

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
93 points
55 comments6 min readLW link
(assets.anthropic.com)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
78 points
5 comments21 min readLW link

Sim­ple probes can catch sleeper agents

Apr 23, 2024, 9:10 PM
133 points
20 comments1 min readLW link
(www.anthropic.com)

How I se­lect al­ign­ment re­search projects

Apr 10, 2024, 4:33 AM
35 points
4 comments24 min readLW link

Tips for Em­piri­cal Align­ment Research

Ethan PerezFeb 29, 2024, 6:04 AM
155 points
4 comments23 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Feb 7, 2024, 9:28 PM
88 points
14 comments9 min readLW link
(arxiv.org)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

Jan 12, 2024, 7:51 PM
305 points
95 comments3 min readLW link
(arxiv.org)

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

Nov 16, 2023, 8:18 PM
45 points
3 comments1 min readLW link
(arxiv.org)

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

Oct 24, 2023, 12:30 AM
66 points
0 comments2 min readLW link
(arxiv.org)

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

Oct 23, 2023, 2:11 PM
20 points
2 comments5 min readLW link
(far.ai)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
313 points
29 comments18 min readLW link1 review

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

Jul 18, 2023, 4:36 PM
111 points
15 comments6 min readLW link1 review

Imi­ta­tion Learn­ing from Lan­guage Feedback

Mar 30, 2023, 2:11 PM
71 points
3 comments10 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

Feb 21, 2023, 5:57 PM
135 points
20 comments11 min readLW link2 reviews

In­verse Scal­ing Prize: Se­cond Round Winners

Jan 24, 2023, 8:12 PM
58 points
17 comments15 min readLW link