Adversarial Training

TagLast edit: Jun 3, 2022, 1:30 AM by Ruby

Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

BuckJun 2, 2022, 11:48 PM

42 points

0 comments3 min readLW link

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

BuckOct 10, 2024, 1:36 PM

100 points

4 comments13 min readLW link

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav FortAug 29, 2024, 5:17 PM

89 points

8 comments7 min readLW link

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

May 5, 2022, 12:59 AM

142 points

29 comments9 min readLW link

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

scasperJul 30, 2024, 2:57 PM

25 points

0 comments4 min readLW link

AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training

Charbel-RaphaëlOct 31, 2023, 2:34 PM

17 points

0 comments19 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM

126 points

30 comments13 min readLW link

AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

DanielFilanAug 21, 2022, 11:50 PM

16 points

0 comments35 min readLW link

Takeaways from our robust injury classifier project [Redwood Research]

dmzSep 17, 2022, 3:55 AM

143 points

12 comments6 min readLW link 1 review

Ironing Out the Squiggles

Zack_M_DavisApr 29, 2024, 4:13 PM

157 points

36 comments11 min readLW link

Adversarial Robustness Could Help Prevent Catastrophic Misuse

aogDec 11, 2023, 7:12 PM

30 points

18 comments9 min readLW link

Some thoughts on why adversarial training might be useful

Beth BarnesDec 8, 2021, 1:28 AM

9 points

6 comments3 min readLW link

Oversight Leagues: The Training Game as a Feature

Paul BricmanSep 9, 2022, 10:08 AM

20 points

6 comments10 min readLW link

EIS IX: Interpretability and Adversaries

scasperFeb 20, 2023, 6:25 PM

30 points

8 comments8 min readLW link

Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas, nlpet and hal2k

Jan 6, 2025, 10:24 AM

20 points

6 comments10 min readLW link

Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI

Benaya KorenJul 8, 2023, 5:32 PM

6 points

0 comments9 min readLW link

Does robustness improve with scale?

ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng and AdamGleave

Jul 25, 2024, 8:55 PM

14 points

0 comments1 min readLW link

(far.ai)

EIS XII: Summary

scasperFeb 23, 2023, 5:45 PM

19 points

0 comments6 min readLW link

The Theory Behind Loss Curves

James CamachoMay 6, 2025, 10:22 PM

16 points

3 comments4 min readLW link

(github.com)

Beyond the Board: Exploring AI Robustness Through Go

AdamGleaveJun 19, 2024, 4:40 PM

41 points

2 comments1 min readLW link

(far.ai)

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Knight LeeApr 14, 2025, 10:27 AM

−3 points

2 comments4 min readLW link

EIS XI: Moving Forward

scasperFeb 22, 2023, 7:05 PM

19 points

2 comments9 min readLW link

Latent Adversarial Training

Adam JermynJun 29, 2022, 8:04 PM

52 points

13 comments5 min readLW link

No comments.