RSS

Red­wood Research

TagLast edit: Dec 30, 2024, 10:12 AM by Dakara

Redwood Research is a nonprofit organization focused on mitigating risks from advanced artificial intelligence.

The initial directions of their research agenda include:

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
75 comments10 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

Jan 24, 2024, 4:11 PM
276 points
73 comments28 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

Dec 3, 2022, 12:58 AM
206 points
35 comments20 min readLW link1 review

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmzSep 17, 2022, 3:55 AM
143 points
12 comments6 min readLW link1 review

Bench­marks for De­tect­ing Mea­sure­ment Tam­per­ing [Red­wood Re­search]

Sep 5, 2023, 4:44 PM
87 points
22 comments20 min readLW link1 review
(arxiv.org)

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

Dec 13, 2023, 3:51 PM
236 points
24 comments10 min readLW link4 reviews

AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

DanielFilanAug 21, 2022, 11:50 PM
16 points
0 comments35 min readLW link

Red­wood Re­search’s cur­rent project

BuckSep 21, 2021, 11:30 PM
145 points
29 comments15 min readLW link1 review

Catch­ing AIs red-handed

Jan 5, 2024, 5:43 PM
111 points
27 comments17 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

Oct 31, 2023, 2:34 PM
119 points
15 comments12 min readLW link1 review

Red­wood’s Tech­nique-Fo­cused Epistemic Strategy

adamShimiDec 12, 2021, 4:36 PM
48 points
1 comment7 min readLW link

Ap­ply to the Red­wood Re­search Mechanis­tic In­ter­pretabil­ity Ex­per­i­ment (REMIX), a re­search pro­gram in Berkeley

Oct 27, 2022, 1:32 AM
135 points
14 comments12 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

Jan 31, 2025, 4:49 PM
203 points
28 comments12 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

Dec 16, 2023, 5:49 AM
76 points
4 comments6 min readLW link1 review

Some com­mon con­fu­sion about in­duc­tion heads

Alexandre VariengienMar 28, 2023, 9:51 PM
64 points
4 comments5 min readLW link

How will we up­date about schem­ing?

ryan_greenblattJan 6, 2025, 8:21 PM
171 points
20 comments37 min readLW link

Why I’m ex­cited about Red­wood Re­search’s cur­rent project

paulfchristianoNov 12, 2021, 7:26 PM
114 points
6 comments7 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

May 5, 2022, 12:59 AM
142 points
29 comments9 min readLW link

A quick ex­per­i­ment on LMs’ in­duc­tive bi­ases in perform­ing search

Alex MallenApr 14, 2024, 3:41 AM
32 points
2 comments4 min readLW link

Ap­ply to the ML for Align­ment Boot­camp (MLAB) in Berkeley [Jan 3 - Jan 22]

Nov 3, 2021, 6:22 PM
95 points
4 comments1 min readLW link

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

Jun 4, 2024, 2:52 PM
85 points
10 comments12 min readLW link
(arxiv.org)

A ba­sic sys­tems ar­chi­tec­ture for AI agents that do au­tonomous research

BuckSep 23, 2024, 1:58 PM
189 points
16 comments8 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

Dec 23, 2023, 12:05 AM
57 points
10 comments4 min readLW link

LLMs are (mostly) not helped by filler tokens

Kshitij SachanAug 10, 2023, 12:48 AM
66 points
35 comments6 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

Feb 28, 2024, 4:15 PM
49 points
0 comments32 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

Dec 3, 2022, 12:59 AM
34 points
1 comment17 min readLW link

Balanc­ing La­bel Quan­tity and Qual­ity for Scal­able Elicitation

Alex MallenOct 24, 2024, 4:49 PM
31 points
1 comment2 min readLW link

Prac­ti­cal Pit­falls of Causal Scrubbing

Mar 27, 2023, 7:47 AM
87 points
17 comments13 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

Oct 23, 2023, 4:37 PM
107 points
3 comments8 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

Dec 19, 2024, 9:25 PM
62 points
0 comments11 min readLW link

Prevent­ing model exfil­tra­tion with up­load limits

ryan_greenblattFeb 6, 2024, 4:29 PM
71 points
22 comments14 min readLW link

Manag­ing catas­trophic mi­suse with­out ro­bust AIs

Jan 16, 2024, 5:27 PM
63 points
17 comments11 min readLW link

Causal scrub­bing: Appendix

Dec 3, 2022, 12:58 AM
18 points
4 comments20 min readLW link

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

ryan_greenblattOct 30, 2023, 2:51 PM
114 points
9 comments20 min readLW link1 review

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

Nov 18, 2024, 4:05 PM
62 points
25 comments2 min readLW link

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

Oct 7, 2022, 5:51 PM
87 points
14 comments3 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

Oct 12, 2022, 9:25 PM
50 points
11 comments4 min readLW link

Meta-level ad­ver­sar­ial eval­u­a­tion of over­sight tech­niques might al­low ro­bust mea­sure­ment of their adequacy

Jul 26, 2023, 5:02 PM
99 points
19 comments1 min readLW link1 review

We’re Red­wood Re­search, we do ap­plied al­ign­ment re­search, AMA

Nate ThomasOct 6, 2021, 5:51 AM
56 points
2 comments2 min readLW link
(forum.effectivealtruism.org)

Un­trusted smart mod­els and trusted dumb models

BuckNov 4, 2023, 3:06 AM
87 points
17 comments6 min readLW link1 review

Red­wood Re­search is hiring for sev­eral roles (Oper­a­tions and Tech­ni­cal)

Apr 14, 2022, 4:57 PM
29 points
0 comments1 min readLW link

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

Feb 6, 2024, 1:38 AM
51 points
2 comments7 min readLW link

Red­wood Re­search is hiring for sev­eral roles

Nov 29, 2021, 12:16 AM
44 points
0 comments1 min readLW link

Some ideas for fol­low-up pro­jects to Red­wood Re­search’s re­cent paper

JanBJun 6, 2022, 1:29 PM
10 points
0 comments7 min readLW link

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

Oct 28, 2022, 11:55 PM
101 points
9 comments9 min readLW link2 reviews
(arxiv.org)

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien RogerNov 17, 2023, 5:58 PM
93 points
9 comments11 min readLW link1 review

Ac­cess to pow­er­ful AI might make com­puter se­cu­rity rad­i­cally easier

BuckJun 8, 2024, 6:00 AM
105 points
14 comments6 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

Dec 3, 2022, 12:59 AM
34 points
2 comments30 min readLW link

Ap­ply to the sec­ond iter­a­tion of the ML for Align­ment Boot­camp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

BuckMay 6, 2022, 4:23 AM
69 points
0 comments6 min readLW link

A sketch of an AI con­trol safety case

Jan 30, 2025, 5:28 PM
57 points
0 comments5 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

BuckSep 25, 2024, 6:58 PM
89 points
11 comments22 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

BuckNov 15, 2024, 3:47 PM
64 points
2 comments7 min readLW link
No comments.