Buck(Buck Shlegeris)

Karma: 5,862

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

28 Feb 2024 16:15 UTC

32 points

0 comments32 min readLW link

Toy models of AI control for concentrated catastrophe prevention

Fabien Roger and Buck

6 Feb 2024 1:38 UTC

50 points

2 comments7 min readLW link

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

245 points

66 comments28 min readLW link

Managing catastrophic misuse without robust AIs

ryan_greenblatt and Buck

16 Jan 2024 17:27 UTC

58 points

16 comments11 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

82 points

18 comments17 min readLW link

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

23 Dec 2023 0:05 UTC

56 points

10 comments4 min readLW link

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt and Fabien Roger

16 Dec 2023 5:49 UTC

73 points

3 comments6 min readLW link

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

197 points

7 comments10 min readLW link

How useful is mechanistic interpretability?

ryan_greenblatt, Neel Nanda, Buck and habryka

1 Dec 2023 2:54 UTC

156 points

53 comments25 min readLW link

Untrusted smart models and trusted dumb models

Buck4 Nov 2023 3:06 UTC

80 points

12 comments6 min readLW link

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation

Fabien Roger and Buck

23 Oct 2023 16:37 UTC

101 points

3 comments8 min readLW link

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck and ryan_greenblatt

26 Jul 2023 17:02 UTC

83 points

18 comments1 min readLW link

A freshman year during the AI midgame: my approach to the next year

Buck14 Apr 2023 0:38 UTC

146 points

14 comments1 min readLW link

One-layer transformers aren’t equivalent to a set of skip-trigrams

Buck17 Feb 2023 17:26 UTC

120 points

10 comments7 min readLW link

Trying to disambiguate different questions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC

106 points

47 comments7 min readLW link 1 review

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

1 comment17 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

2 comments30 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

17 points

4 comments20 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

197 points

35 comments20 min readLW link 1 review

Multi-Component Learning and S-Curves

Adam Jermyn and Buck

30 Nov 2022 1:37 UTC

61 points

24 comments7 min readLW link