RSS

Ad­ver­sar­ial Training

TagLast edit: Jun 3, 2022, 1:30 AM by Ruby

Ad­ver­sar­ial Ro­bust­ness Could Help Prevent Catas­trophic Misuse

aogaraDec 11, 2023, 7:12 PM
30 points
18 comments9 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth BarnesDec 8, 2021, 1:28 AM
9 points
6 comments3 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav FortAug 29, 2024, 5:17 PM
87 points
8 comments7 min readLW link

Iron­ing Out the Squiggles

Zack_M_DavisApr 29, 2024, 4:13 PM
157 points
36 comments11 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasperJul 30, 2024, 2:57 PM
25 points
0 comments4 min readLW link

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

BuckOct 10, 2024, 1:36 PM
100 points
4 comments13 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

May 5, 2022, 12:59 AM
142 points
29 comments9 min readLW link

AI Safety 101 - Chap­ter 5.2 - Un­re­stricted Ad­ver­sar­ial Training

Charbel-RaphaëlOct 31, 2023, 2:34 PM
17 points
0 comments19 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM
123 points
30 comments13 min readLW link

Ad­ver­sar­ial train­ing, im­por­tance sam­pling, and anti-ad­ver­sar­ial train­ing for AI whistleblowing

BuckJun 2, 2022, 11:48 PM
38 points
0 comments3 min readLW link

AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

DanielFilanAug 21, 2022, 11:50 PM
16 points
0 comments35 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmzSep 17, 2022, 3:55 AM
143 points
12 comments6 min readLW link1 review

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya KorenJul 8, 2023, 5:32 PM
6 points
0 comments9 min readLW link

Beyond the Board: Ex­plor­ing AI Ro­bust­ness Through Go

AdamGleaveJun 19, 2024, 4:40 PM
41 points
2 comments1 min readLW link
(far.ai)

La­tent Ad­ver­sar­ial Train­ing (LAT) Im­proves the Rep­re­sen­ta­tion of Refusal

Jan 6, 2025, 10:24 AM
20 points
6 comments10 min readLW link

Does ro­bust­ness im­prove with scale?

Jul 25, 2024, 8:55 PM
14 points
0 comments1 min readLW link
(far.ai)

EIS XI: Mov­ing Forward

scasperFeb 22, 2023, 7:05 PM
19 points
2 comments9 min readLW link

La­tent Ad­ver­sar­ial Training

Adam JermynJun 29, 2022, 8:04 PM
52 points
13 comments5 min readLW link

EIS XII: Sum­mary

scasperFeb 23, 2023, 5:45 PM
18 points
0 comments6 min readLW link

Over­sight Leagues: The Train­ing Game as a Feature

Paul BricmanSep 9, 2022, 10:08 AM
20 points
6 comments10 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasperFeb 20, 2023, 6:25 PM
30 points
8 comments8 min readLW link
No comments.