RSS

Buck

Karma: 11,616

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Fields that I refer­ence when think­ing about AI takeover prevention

BuckAug 13, 2024, 11:08 PM
144 points
16 comments10 min readLW link
(redwoodresearch.substack.com)

Games for AI Control

Jul 11, 2024, 6:40 PM
45 points
0 comments5 min readLW link

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

BuckJul 6, 2024, 5:42 PM
86 points
11 comments3 min readLW link

Differ­ent senses in which two AIs can be “the same”

Jun 24, 2024, 3:16 AM
69 points
2 comments4 min readLW link

Ac­cess to pow­er­ful AI might make com­puter se­cu­rity rad­i­cally easier

BuckJun 8, 2024, 6:00 AM
106 points
14 comments6 min readLW link

AI catas­tro­phes and rogue deployments

BuckJun 3, 2024, 5:04 PM
120 points
16 comments8 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

Feb 28, 2024, 4:15 PM
49 points
0 comments32 min readLW link

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

Feb 6, 2024, 1:38 AM
51 points
2 comments7 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

Jan 24, 2024, 4:11 PM
276 points
73 comments28 min readLW link

Manag­ing catas­trophic mi­suse with­out ro­bust AIs

Jan 16, 2024, 5:27 PM
63 points
17 comments11 min readLW link

Catch­ing AIs red-handed

Jan 5, 2024, 5:43 PM
111 points
27 comments17 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

Dec 23, 2023, 12:05 AM
57 points
10 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

Dec 16, 2023, 5:49 AM
76 points
4 comments6 min readLW link1 review

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

Dec 13, 2023, 3:51 PM
236 points
24 comments10 min readLW link4 reviews

How use­ful is mechanis­tic in­ter­pretabil­ity?

Dec 1, 2023, 2:54 AM
167 points
54 comments25 min readLW link

Un­trusted smart mod­els and trusted dumb models

BuckNov 4, 2023, 3:06 AM
87 points
17 comments6 min readLW link1 review

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

Oct 23, 2023, 4:37 PM
107 points
3 comments8 min readLW link

Meta-level ad­ver­sar­ial eval­u­a­tion of over­sight tech­niques might al­low ro­bust mea­sure­ment of their adequacy

Jul 26, 2023, 5:02 PM
100 points
19 comments1 min readLW link1 review

A fresh­man year dur­ing the AI midgame: my ap­proach to the next year

BuckApr 14, 2023, 12:38 AM
154 points
15 commentsLW link1 review

One-layer trans­form­ers aren’t equiv­a­lent to a set of skip-trigrams

BuckFeb 17, 2023, 5:26 PM
127 points
11 comments7 min readLW link