RSS

Red­wood Research

Tag

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
205 points
35 comments20 min readLW link1 review

Ap­ply to the Red­wood Re­search Mechanis­tic In­ter­pretabil­ity Ex­per­i­ment (REMIX), a re­search pro­gram in Berkeley

27 Oct 2022 1:32 UTC
135 points
14 comments12 min readLW link

Bench­marks for De­tect­ing Mea­sure­ment Tam­per­ing [Red­wood Re­search]

5 Sep 2023 16:44 UTC
86 points
21 comments20 min readLW link
(arxiv.org)

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmz17 Sep 2022 3:55 UTC
143 points
12 comments6 min readLW link1 review

AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

DanielFilan21 Aug 2022 23:50 UTC
16 points
0 comments35 min readLW link

Red­wood’s Tech­nique-Fo­cused Epistemic Strategy

adamShimi12 Dec 2021 16:36 UTC
48 points
1 comment7 min readLW link

Red­wood Re­search’s cur­rent project

Buck21 Sep 2021 23:30 UTC
145 points
29 comments15 min readLW link1 review

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

5 May 2022 0:59 UTC
142 points
29 comments9 min readLW link

Some com­mon con­fu­sion about in­duc­tion heads

Alexandre Variengien28 Mar 2023 21:51 UTC
64 points
4 comments5 min readLW link

[Linkpost] Cri­tiques of Red­wood Research

Akash31 Mar 2023 20:00 UTC
13 points
2 comments1 min readLW link
(forum.effectivealtruism.org)

Why I’m ex­cited about Red­wood Re­search’s cur­rent project

paulfchristiano12 Nov 2021 19:26 UTC
114 points
6 comments7 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

19 Dec 2024 21:25 UTC
56 points
0 comments11 min readLW link

Red­wood Re­search is hiring for sev­eral roles (Oper­a­tions and Tech­ni­cal)

14 Apr 2022 16:57 UTC
29 points
0 comments1 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
50 points
11 comments4 min readLW link

Red­wood Re­search is hiring for sev­eral roles

29 Nov 2021 0:16 UTC
44 points
0 comments1 min readLW link

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

28 Oct 2022 23:55 UTC
101 points
9 comments9 min readLW link2 reviews
(arxiv.org)

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
34 points
2 comments30 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
1 comment17 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
18 points
4 comments20 min readLW link

Prac­ti­cal Pit­falls of Causal Scrubbing

27 Mar 2023 7:47 UTC
87 points
17 comments13 min readLW link

We’re Red­wood Re­search, we do ap­plied al­ign­ment re­search, AMA

Nate Thomas6 Oct 2021 5:51 UTC
56 points
2 comments2 min readLW link
(forum.effectivealtruism.org)

Cri­tiques of promi­nent AI safety labs: Red­wood Research

Omega.17 Apr 2023 18:20 UTC
2 points
0 comments22 min readLW link
(forum.effectivealtruism.org)
No comments.