Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Adversarial Training
Tag
Last edit:
3 Jun 2022 1:30 UTC
by
Ruby
Relevant
New
Old
Adversarial Robustness Could Help Prevent Catastrophic Misuse
aogara
11 Dec 2023 19:12 UTC
30
points
18
comments
9
min read
LW
link
Some thoughts on why adversarial training might be useful
Beth Barnes
8 Dec 2021 1:28 UTC
9
points
6
comments
3
min read
LW
link
Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort
29 Aug 2024 17:17 UTC
87
points
8
comments
7
min read
LW
link
Ironing Out the Squiggles
Zack_M_Davis
29 Apr 2024 16:13 UTC
153
points
36
comments
11
min read
LW
link
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper
30 Jul 2024 14:57 UTC
25
points
0
comments
4
min read
LW
link
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming
Buck
10 Oct 2024 13:36 UTC
100
points
4
comments
13
min read
LW
link
High-stakes alignment via adversarial training [Redwood Research report]
dmz
,
LawrenceC
and
Nate Thomas
5 May 2022 0:59 UTC
142
points
29
comments
9
min read
LW
link
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
Charbel-Raphaël
31 Oct 2023 14:34 UTC
17
points
0
comments
19
min read
LW
link
Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
5 Dec 2023 16:48 UTC
122
points
29
comments
13
min read
LW
link
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck
2 Jun 2022 23:48 UTC
38
points
0
comments
3
min read
LW
link
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
21 Aug 2022 23:50 UTC
16
points
0
comments
35
min read
LW
link
Takeaways from our robust injury classifier project [Redwood Research]
dmz
17 Sep 2022 3:55 UTC
143
points
12
comments
6
min read
LW
link
1
review
Beyond the Board: Exploring AI Robustness Through Go
AdamGleave
19 Jun 2024 16:40 UTC
41
points
2
comments
1
min read
LW
link
(far.ai)
Does robustness improve with scale?
ChengCheng
,
niki.h
,
Ian McKenzie
,
Oskar Hollinsworth
,
Tom Tseng
and
AdamGleave
25 Jul 2024 20:55 UTC
14
points
0
comments
1
min read
LW
link
(far.ai)
EIS XI: Moving Forward
scasper
22 Feb 2023 19:05 UTC
19
points
2
comments
9
min read
LW
link
Latent Adversarial Training
Adam Jermyn
29 Jun 2022 20:04 UTC
50
points
13
comments
5
min read
LW
link
EIS XII: Summary
scasper
23 Feb 2023 17:45 UTC
18
points
0
comments
6
min read
LW
link
Oversight Leagues: The Training Game as a Feature
Paul Bricman
9 Sep 2022 10:08 UTC
20
points
6
comments
10
min read
LW
link
EIS IX: Interpretability and Adversaries
scasper
20 Feb 2023 18:25 UTC
30
points
7
comments
8
min read
LW
link
Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI
Benaya Koren
8 Jul 2023 17:32 UTC
6
points
0
comments
9
min read
LW
link
No comments.
Back to top