Adversarial Examples (AI)

TagLast edit: Dec 14, 2024, 1:56 AM by Ruby

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

Feb 5, 2023, 10:02 PM

676 points

206 comments12 min readLW link 1 review

AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave and EuanMcLean

Mar 8, 2023, 2:40 AM

70 points

28 comments29 min readLW link

(far.ai)

Ironing Out the Squiggles

Zack_M_DavisApr 29, 2024, 4:13 PM

157 points

36 comments11 min readLW link

If I were a well-intentioned AI… I: Image classifier

Stuart_ArmstrongFeb 26, 2020, 12:39 PM

35 points

4 comments5 min readLW link

AXRP Episode 1 - Adversarial Policies with Adam Gleave

DanielFilanDec 29, 2020, 8:41 PM

12 points

5 comments34 min readLW link

Human beats SOTA Go AI by learning an adversarial policy

Vanessa KosoyFeb 19, 2023, 9:38 AM

59 points

32 comments1 min readLW link

(goattack.far.ai)

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav FortAug 29, 2024, 5:17 PM

89 points

8 comments7 min readLW link

There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs

TaranFeb 19, 2023, 12:25 PM

125 points

34 comments4 min readLW link

The Goodhart Game

John_MaxwellNov 18, 2019, 11:22 PM

13 points

5 comments5 min readLW link

Adversarial Policies Beat Professional-Level Go AIs

sanxiynNov 3, 2022, 1:27 PM

31 points

35 comments1 min readLW link

(goattack.alignmentfund.org)

[Question] What progress have we made on automated auditing?

LawrenceCJul 6, 2024, 1:49 AM

38 points

1 comment1 min readLW link

Adversarial Robustness Could Help Prevent Catastrophic Misuse

aogDec 11, 2023, 7:12 PM

30 points

18 comments9 min readLW link

RAIN: Your Language Models Can Align Themselves without Finetuning—Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!

Singularian2501Sep 24, 2023, 4:48 PM

5 points

0 comments1 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM

126 points

30 comments13 min readLW link

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Scott Emmons, Luke Bailey and Euan Ong

Sep 20, 2023, 3:23 PM

58 points

9 comments1 min readLW link

(arxiv.org)

EIS IX: Interpretability and Adversaries

scasperFeb 20, 2023, 6:25 PM

30 points

8 comments8 min readLW link

The Achilles Heel Hypothesis for AI

scasperOct 13, 2020, 2:35 PM

20 points

6 comments1 min readLW link

Adversarial attacks and optimal control

JanMay 22, 2022, 6:22 PM

17 points

7 comments8 min readLW link

(universalprior.substack.com)

Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.

Sohaib ImranNov 10, 2023, 3:23 PM

11 points

0 comments2 min readLW link

Beyond the Board: Exploring AI Robustness Through Go

AdamGleaveJun 19, 2024, 4:40 PM

41 points

2 comments1 min readLW link

(far.ai)

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

Aug 30, 2023, 5:36 PM

17 points

0 comments8 min readLW link

(arxiv.org)

Features and Adversaries in MemoryDT

Joseph Bloom and Jay Bailey

Oct 20, 2023, 7:32 AM

31 points

6 comments25 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasperFeb 21, 2023, 4:59 PM

14 points

4 comments3 min readLW link

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo and Jason Hoelscher-Obermaier

Jun 17, 2024, 2:16 PM

9 points

0 comments8 min readLW link

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kittenDec 16, 2021, 10:41 PM

22 points

10 comments21 min readLW link

[AN #62] Are adversarial examples caused by real but imperceptible features?

Rohin ShahAug 22, 2019, 5:10 PM

28 points

10 comments9 min readLW link

(mailchi.mp)

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

May 5, 2022, 12:59 AM

142 points

29 comments9 min readLW link

A Search for More ChatGPT / GPT-3.5 / GPT-4 “Unspeakable” Glitch Tokens

Martin FellMay 9, 2023, 2:36 PM

26 points

9 comments6 min readLW link

SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

AdamYedidiaApr 15, 2023, 10:35 PM

71 points

18 comments6 min readLW link

Does robustness improve with scale?

ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng and AdamGleave

Jul 25, 2024, 8:55 PM

14 points

0 comments1 min readLW link

(far.ai)

EIS XII: Summary

scasperFeb 23, 2023, 5:45 PM

19 points

0 comments6 min readLW link

Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller and MichaelDennis

Jul 20, 2023, 5:31 PM

130 points

22 comments10 min readLW link

(far.ai)

No comments.