Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Adversarial Training
Tag
Last edit:
Jun 3, 2022, 1:30 AM
by
Ruby
Relevant
New
Old
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck
Jun 2, 2022, 11:48 PM
42
points
0
comments
3
min read
LW
link
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming
Buck
Oct 10, 2024, 1:36 PM
100
points
4
comments
13
min read
LW
link
Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort
Aug 29, 2024, 5:17 PM
88
points
8
comments
7
min read
LW
link
Ironing Out the Squiggles
Zack_M_Davis
Apr 29, 2024, 4:13 PM
157
points
36
comments
11
min read
LW
link
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
Aug 21, 2022, 11:50 PM
16
points
0
comments
35
min read
LW
link
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
Charbel-Raphaël
Oct 31, 2023, 2:34 PM
17
points
0
comments
19
min read
LW
link
Takeaways from our robust injury classifier project [Redwood Research]
dmz
Sep 17, 2022, 3:55 AM
143
points
12
comments
6
min read
LW
link
1
review
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper
Jul 30, 2024, 2:57 PM
25
points
0
comments
4
min read
LW
link
High-stakes alignment via adversarial training [Redwood Research report]
dmz
,
LawrenceC
and
Nate Thomas
May 5, 2022, 12:59 AM
142
points
29
comments
9
min read
LW
link
Adversarial Robustness Could Help Prevent Catastrophic Misuse
aog
Dec 11, 2023, 7:12 PM
30
points
18
comments
9
min read
LW
link
Some thoughts on why adversarial training might be useful
Beth Barnes
Dec 8, 2021, 1:28 AM
9
points
6
comments
3
min read
LW
link
Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
Dec 5, 2023, 4:48 PM
126
points
30
comments
13
min read
LW
link
EIS IX: Interpretability and Adversaries
scasper
Feb 20, 2023, 6:25 PM
30
points
8
comments
8
min read
LW
link
Beyond the Board: Exploring AI Robustness Through Go
AdamGleave
Jun 19, 2024, 4:40 PM
41
points
2
comments
1
min read
LW
link
(far.ai)
Latent Adversarial Training (LAT) Improves the Representation of Refusal
alexandraabbas
,
nlpet
and
hal2k
Jan 6, 2025, 10:24 AM
20
points
6
comments
10
min read
LW
link
Does robustness improve with scale?
ChengCheng
,
niki.h
,
Ian McKenzie
,
Oskar Hollinsworth
,
Tom Tseng
and
AdamGleave
Jul 25, 2024, 8:55 PM
14
points
0
comments
1
min read
LW
link
(far.ai)
The Theory Behind Loss Curves
James Camacho
May 6, 2025, 10:22 PM
16
points
1
comment
4
min read
LW
link
(github.com)
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
Knight Lee
Apr 14, 2025, 10:27 AM
−3
points
2
comments
4
min read
LW
link
Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI
Benaya Koren
Jul 8, 2023, 5:32 PM
6
points
0
comments
9
min read
LW
link
Latent Adversarial Training
Adam Jermyn
Jun 29, 2022, 8:04 PM
52
points
13
comments
5
min read
LW
link
Oversight Leagues: The Training Game as a Feature
Paul Bricman
Sep 9, 2022, 10:08 AM
20
points
6
comments
10
min read
LW
link
EIS XI: Moving Forward
scasper
Feb 22, 2023, 7:05 PM
19
points
2
comments
9
min read
LW
link
EIS XII: Summary
scasper
Feb 23, 2023, 5:45 PM
19
points
0
comments
6
min read
LW
link
No comments.
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel