Buck

Karma: 10,830

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Alex Mallen, charlie_griffin and Buck

Mar 24, 2025, 5:55 PM

30 points

0 comments8 min readLW link

Some articles in “International Security” that I enjoyed

BuckJan 31, 2025, 4:23 PM

129 points

10 comments4 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

Ten people on the inside

BuckJan 28, 2025, 4:41 PM

136 points

28 comments4 min readLW link

Early Experiments in Human Auditing for AI Control

Joey Yudelson and Buck

Jan 23, 2025, 1:34 AM

27 points

0 comments7 min readLW link

Thoughts on the conservative assumptions in AI control

BuckJan 17, 2025, 7:23 PM

90 points

5 comments13 min readLW link

Measuring whether AIs can statelessly strategize to subvert security measures

Alex Mallen and Buck

Dec 19, 2024, 9:25 PM

61 points

0 comments11 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

74 comments10 min readLW link

Why imperfect adversarial robustness doesn’t doom AI control

Buck and Claude+

Nov 18, 2024, 4:05 PM

62 points

25 comments2 min readLW link

Win/continue/lose scenarios and execute/replace/audit protocols

BuckNov 15, 2024, 3:47 PM

64 points

2 comments7 min readLW link

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

Oct 18, 2024, 10:33 PM

94 points

56 comments6 min readLW link

(assets.anthropic.com)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

BuckOct 10, 2024, 1:36 PM

100 points

4 comments13 min readLW link

How to prevent collusion when using untrusted models to monitor each other

BuckSep 25, 2024, 6:58 PM

83 points

11 comments22 min readLW link

A basic systems architecture for AI agents that do autonomous research

BuckSep 23, 2024, 1:58 PM

189 points

16 comments8 min readLW link

Distinguish worst-case analysis from instrumental training-gaming

Olli Järviniemi and Buck

Sep 5, 2024, 7:13 PM

37 points

0 comments5 min readLW link

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

BuckAug 26, 2024, 4:46 PM

308 points

77 comments4 min readLW link

Fields that I reference when thinking about AI takeover prevention

BuckAug 13, 2024, 11:08 PM

144 points

16 comments10 min readLW link

(redwoodresearch.substack.com)

Games for AI Control

charlie_griffin and Buck

Jul 11, 2024, 6:40 PM

43 points

0 comments5 min readLW link

Scalable oversight as a quantitative rather than qualitative problem

BuckJul 6, 2024, 5:42 PM

85 points

11 comments3 min readLW link

Different senses in which two AIs can be “the same”

Vivek Hebbar and Buck

Jun 24, 2024, 3:16 AM

68 points

2 comments4 min readLW link