RSS

Buck(Buck Shlegeris)

Karma: 5,862

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
32 points
0 comments32 min readLW link

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

6 Feb 2024 1:38 UTC
50 points
2 comments7 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
245 points
66 comments28 min readLW link

Manag­ing catas­trophic mi­suse with­out ro­bust AIs

16 Jan 2024 17:27 UTC
58 points
16 comments11 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
82 points
18 comments17 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
56 points
10 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
73 points
3 comments6 min readLW link

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
197 points
7 comments10 min readLW link

How use­ful is mechanis­tic in­ter­pretabil­ity?

1 Dec 2023 2:54 UTC
156 points
53 comments25 min readLW link

Un­trusted smart mod­els and trusted dumb models

Buck4 Nov 2023 3:06 UTC
80 points
12 comments6 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
101 points
3 comments8 min readLW link

Meta-level ad­ver­sar­ial eval­u­a­tion of over­sight tech­niques might al­low ro­bust mea­sure­ment of their adequacy

26 Jul 2023 17:02 UTC
83 points
18 comments1 min readLW link

A fresh­man year dur­ing the AI midgame: my ap­proach to the next year

Buck14 Apr 2023 0:38 UTC
146 points
14 comments1 min readLW link

One-layer trans­form­ers aren’t equiv­a­lent to a set of skip-trigrams

Buck17 Feb 2023 17:26 UTC
120 points
10 comments7 min readLW link

Try­ing to dis­am­biguate differ­ent ques­tions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC
106 points
47 comments7 min readLW link1 review

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
1 comment17 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
34 points
2 comments30 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
17 points
4 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
197 points
35 comments20 min readLW link1 review

Multi-Com­po­nent Learn­ing and S-Curves

30 Nov 2022 1:37 UTC
61 points
24 comments7 min readLW link