Redwood Research

TagLast edit: Dec 30, 2024, 10:12 AM by Dakara

Redwood Research is a nonprofit organization focused on mitigating risks from advanced artificial intelligence.

The initial directions of their research agenda include:

AI control
Evaluations and demonstrations of risk from strategic deception
Consulting on risks from misalignment

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

75 comments10 min readLW link

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

Jan 24, 2024, 4:11 PM

276 points

73 comments28 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:58 AM

206 points

35 comments20 min readLW link 1 review

Takeaways from our robust injury classifier project [Redwood Research]

dmzSep 17, 2022, 3:55 AM

143 points

12 comments6 min readLW link 1 review

Benchmarks for Detecting Measurement Tampering [Redwood Research]

ryan_greenblatt and Fabien Roger

Sep 5, 2023, 4:44 PM

87 points

22 comments20 min readLW link 1 review

(arxiv.org)

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

Dec 13, 2023, 3:51 PM

236 points

24 comments10 min readLW link 4 reviews

AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

DanielFilanAug 21, 2022, 11:50 PM

16 points

0 comments35 min readLW link

Redwood Research’s current project

BuckSep 21, 2021, 11:30 PM

145 points

29 comments15 min readLW link 1 review

Catching AIs red-handed

ryan_greenblatt and Buck

Jan 5, 2024, 5:43 PM

111 points

27 comments17 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

Oct 31, 2023, 2:34 PM

119 points

15 comments12 min readLW link 1 review

Redwood’s Technique-Focused Epistemic Strategy

adamShimiDec 12, 2021, 4:36 PM

48 points

1 comment7 min readLW link

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau, Xander Davies, Buck and Nate Thomas

Oct 27, 2022, 1:32 AM

135 points

14 comments12 min readLW link

Will alignment-faking Claude accept a deal to reveal its misalignment?

ryan_greenblatt and Kyle Fish

Jan 31, 2025, 4:49 PM

203 points

28 comments12 min readLW link

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt and Fabien Roger

Dec 16, 2023, 5:49 AM

76 points

4 comments6 min readLW link 1 review

Some common confusion about induction heads

Alexandre VariengienMar 28, 2023, 9:51 PM

64 points

4 comments5 min readLW link

How will we update about scheming?

ryan_greenblattJan 6, 2025, 8:21 PM

171 points

20 comments37 min readLW link

Why I’m excited about Redwood Research’s current project

paulfchristianoNov 12, 2021, 7:26 PM

114 points

6 comments7 min readLW link

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

May 5, 2022, 12:59 AM

142 points

29 comments9 min readLW link

A quick experiment on LMs’ inductive biases in performing search

Alex MallenApr 14, 2024, 3:41 AM

32 points

2 comments4 min readLW link

Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]

habryka and Buck

Nov 3, 2021, 6:22 PM

95 points

4 comments1 min readLW link

[Paper] Stress-testing capability elicitation with password-locked models

Fabien Roger and ryan_greenblatt

Jun 4, 2024, 2:52 PM

85 points

10 comments12 min readLW link

(arxiv.org)

A basic systems architecture for AI agents that do autonomous research

BuckSep 23, 2024, 1:58 PM

189 points

16 comments8 min readLW link

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

Dec 23, 2023, 12:05 AM

57 points

10 comments4 min readLW link

LLMs are (mostly) not helped by filler tokens

Kshitij SachanAug 10, 2023, 12:48 AM

66 points

35 comments6 min readLW link

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

Feb 28, 2024, 4:15 PM

49 points

0 comments32 min readLW link

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

34 points

1 comment17 min readLW link

Balancing Label Quantity and Quality for Scalable Elicitation

Alex MallenOct 24, 2024, 4:49 PM

31 points

1 comment2 min readLW link

Practical Pitfalls of Causal Scrubbing

Jérémy Scheurer, Phil3, tony, jacquesthibs and David Lindner

Mar 27, 2023, 7:47 AM

87 points

17 comments13 min readLW link

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation

Fabien Roger and Buck

Oct 23, 2023, 4:37 PM

107 points

3 comments8 min readLW link

Measuring whether AIs can statelessly strategize to subvert security measures

Alex Mallen and Buck

Dec 19, 2024, 9:25 PM

62 points

0 comments11 min readLW link

Preventing model exfiltration with upload limits

ryan_greenblattFeb 6, 2024, 4:29 PM

71 points

22 comments14 min readLW link

Managing catastrophic misuse without robust AIs

ryan_greenblatt and Buck

Jan 16, 2024, 5:27 PM

63 points

17 comments11 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:58 AM

18 points

4 comments20 min readLW link

Improving the Welfare of AIs: A Nearcasted Proposal

ryan_greenblattOct 30, 2023, 2:51 PM

114 points

9 comments20 min readLW link 1 review

Why imperfect adversarial robustness doesn’t doom AI control

Buck and Claude+

Nov 18, 2024, 4:05 PM

62 points

25 comments2 min readLW link

Polysemanticity and Capacity in Neural Networks

Buck, Adam Jermyn and Kshitij Sachan

Oct 7, 2022, 5:51 PM

87 points

14 comments3 min readLW link

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du and Buck

Oct 12, 2022, 9:25 PM

50 points

11 comments4 min readLW link

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck and ryan_greenblatt

Jul 26, 2023, 5:02 PM

99 points

19 comments1 min readLW link 1 review

We’re Redwood Research, we do applied alignment research, AMA

Nate ThomasOct 6, 2021, 5:51 AM

56 points

2 comments2 min readLW link

(forum.effectivealtruism.org)

Untrusted smart models and trusted dumb models

BuckNov 4, 2023, 3:06 AM

87 points

17 comments6 min readLW link 1 review

Redwood Research is hiring for several roles (Operations and Technical)

Jessica W and billzito

Apr 14, 2022, 4:57 PM

29 points

0 comments1 min readLW link

Toy models of AI control for concentrated catastrophe prevention

Fabien Roger and Buck

Feb 6, 2024, 1:38 AM

51 points

2 comments7 min readLW link

Redwood Research is hiring for several roles

Jack R and billzito

Nov 29, 2021, 12:16 AM

44 points

0 comments1 min readLW link

Some ideas for follow-up projects to Redwood Research’s recent paper

JanBJun 6, 2022, 1:29 PM

10 points

0 comments7 min readLW link

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RowanWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

Oct 28, 2022, 11:55 PM

101 points

9 comments9 min readLW link 2 reviews

(arxiv.org)

Coup probes: Catching catastrophes with probes trained off-policy

Fabien RogerNov 17, 2023, 5:58 PM

93 points

9 comments11 min readLW link 1 review

Access to powerful AI might make computer security radically easier

BuckJun 8, 2024, 6:00 AM

105 points

14 comments6 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

34 points

2 comments30 min readLW link

Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

BuckMay 6, 2022, 4:23 AM

69 points

0 comments6 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

How to prevent collusion when using untrusted models to monitor each other

BuckSep 25, 2024, 6:58 PM

89 points

11 comments22 min readLW link

Win/continue/lose scenarios and execute/replace/audit protocols

BuckNov 15, 2024, 3:47 PM

64 points

2 comments7 min readLW link

No comments.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer