RSS

Buck

Karma: 10,830

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

Mar 24, 2025, 5:55 PM
30 points
0 comments8 min readLW link

Some ar­ti­cles in “In­ter­na­tional Se­cu­rity” that I enjoyed

BuckJan 31, 2025, 4:23 PM
129 points
10 comments4 min readLW link

A sketch of an AI con­trol safety case

Jan 30, 2025, 5:28 PM
57 points
0 comments5 min readLW link

Ten peo­ple on the inside

BuckJan 28, 2025, 4:41 PM
136 points
28 comments4 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

Jan 23, 2025, 1:34 AM
27 points
0 comments7 min readLW link

Thoughts on the con­ser­va­tive as­sump­tions in AI control

BuckJan 17, 2025, 7:23 PM
90 points
5 comments13 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

Dec 19, 2024, 9:25 PM
61 points
0 comments11 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
74 comments10 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

Nov 18, 2024, 4:05 PM
62 points
25 comments2 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

BuckNov 15, 2024, 3:47 PM
64 points
2 comments7 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
94 points
56 comments6 min readLW link
(assets.anthropic.com)

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

BuckOct 10, 2024, 1:36 PM
100 points
4 comments13 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

BuckSep 25, 2024, 6:58 PM
83 points
11 comments22 min readLW link

A ba­sic sys­tems ar­chi­tec­ture for AI agents that do au­tonomous research

BuckSep 23, 2024, 1:58 PM
189 points
16 comments8 min readLW link

Dist­in­guish worst-case anal­y­sis from in­stru­men­tal train­ing-gaming

Sep 5, 2024, 7:13 PM
37 points
0 comments5 min readLW link

Would catch­ing your AIs try­ing to es­cape con­vince AI de­vel­op­ers to slow down or un­de­ploy?

BuckAug 26, 2024, 4:46 PM
308 points
77 comments4 min readLW link

Fields that I refer­ence when think­ing about AI takeover prevention

BuckAug 13, 2024, 11:08 PM
144 points
16 comments10 min readLW link
(redwoodresearch.substack.com)

Games for AI Control

Jul 11, 2024, 6:40 PM
43 points
0 comments5 min readLW link

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

BuckJul 6, 2024, 5:42 PM
85 points
11 comments3 min readLW link

Differ­ent senses in which two AIs can be “the same”

Jun 24, 2024, 3:16 AM
68 points
2 comments4 min readLW link