RSS

Ethan Perez

Karma: 2,962

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://​​ethanperez.net/​​

Un­su­per­vised Elic­i­ta­tion of Lan­guage Models

Jun 13, 2025, 4:17 PM
14 points
0 comments2 min readLW link

Un­su­per­vised Elic­i­ta­tion of Lan­guage Models

Jun 13, 2025, 4:15 PM
39 points
9 comments2 min readLW link

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

Apr 24, 2025, 9:15 PM
70 points
12 comments2 min readLW link
(alignment.anthropic.com)

Rea­son­ing mod­els don’t always say what they think

Apr 9, 2025, 7:48 PM
28 points
4 comments1 min readLW link
(www.anthropic.com)

Au­to­mated Re­searchers Can Subtly Sandbag

Mar 26, 2025, 7:13 PM
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Tips and Code for Em­piri­cal Re­search Workflows

Jan 20, 2025, 10:31 PM
94 points
14 comments20 min readLW link

Tips On Em­piri­cal Re­search Slides

Jan 8, 2025, 5:06 AM
91 points
4 comments6 min readLW link

A dataset of ques­tions on de­ci­sion-the­o­retic rea­son­ing in New­comb-like problems

Dec 16, 2024, 10:42 PM
49 points
1 comment2 min readLW link
(arxiv.org)

Best-of-N Jailbreaking

Dec 14, 2024, 4:58 AM
78 points
5 comments2 min readLW link
(arxiv.org)

In­tro­duc­ing the An­thropic Fel­lows Program

Nov 30, 2024, 11:47 PM
26 points
0 comments4 min readLW link
(alignment.anthropic.com)

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
95 points
56 comments6 min readLW link
(assets.anthropic.com)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
81 points
5 comments21 min readLW link

Sim­ple probes can catch sleeper agents

Apr 23, 2024, 9:10 PM
133 points
21 comments1 min readLW link
(www.anthropic.com)

How I se­lect al­ign­ment re­search projects

Apr 10, 2024, 4:33 AM
36 points
4 comments24 min readLW link

Tips for Em­piri­cal Align­ment Research

Ethan PerezFeb 29, 2024, 6:04 AM
164 points
4 comments23 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Feb 7, 2024, 9:28 PM
89 points
14 comments9 min readLW link
(arxiv.org)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

Jan 12, 2024, 7:51 PM
305 points
95 comments3 min readLW link
(arxiv.org)