RSS

Buck

Karma: 10,403

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Some ar­ti­cles in “In­ter­na­tional Se­cu­rity” that I enjoyed

Buck31 Jan 2025 16:23 UTC
62 points
3 comments4 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
55 points
0 comments5 min readLW link

Ten peo­ple on the inside

Buck28 Jan 2025 16:41 UTC
124 points
24 comments4 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

23 Jan 2025 1:34 UTC
27 points
0 comments7 min readLW link

Thoughts on the con­ser­va­tive as­sump­tions in AI control

Buck17 Jan 2025 19:23 UTC
90 points
5 comments13 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

19 Dec 2024 21:25 UTC
60 points
0 comments11 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
478 points
70 comments10 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

18 Nov 2024 16:05 UTC
61 points
25 comments2 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

Buck15 Nov 2024 15:47 UTC
56 points
2 comments7 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
94 points
55 comments6 min readLW link
(assets.anthropic.com)

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

Buck10 Oct 2024 13:36 UTC
100 points
4 comments13 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

Buck25 Sep 2024 18:58 UTC
83 points
11 comments22 min readLW link

A ba­sic sys­tems ar­chi­tec­ture for AI agents that do au­tonomous research

Buck23 Sep 2024 13:58 UTC
189 points
15 comments8 min readLW link

Dist­in­guish worst-case anal­y­sis from in­stru­men­tal train­ing-gaming

5 Sep 2024 19:13 UTC
37 points
0 comments5 min readLW link

Would catch­ing your AIs try­ing to es­cape con­vince AI de­vel­op­ers to slow down or un­de­ploy?

Buck26 Aug 2024 16:46 UTC
305 points
77 comments4 min readLW link

Fields that I refer­ence when think­ing about AI takeover prevention

Buck13 Aug 2024 23:08 UTC
144 points
16 comments10 min readLW link
(redwoodresearch.substack.com)

Games for AI Control

11 Jul 2024 18:40 UTC
43 points
0 comments5 min readLW link

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

Buck6 Jul 2024 17:42 UTC
85 points
11 comments3 min readLW link

Differ­ent senses in which two AIs can be “the same”

24 Jun 2024 3:16 UTC
68 points
1 comment4 min readLW link

Ac­cess to pow­er­ful AI might make com­puter se­cu­rity rad­i­cally easier

Buck8 Jun 2024 6:00 UTC
97 points
14 comments6 min readLW link