RSS

Ad­ver­sar­ial Ex­am­ples (AI)

TagLast edit: Dec 14, 2024, 1:56 AM by Ruby

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

Feb 5, 2023, 10:02 PM
679 points
206 comments12 min readLW link1 review

Iron­ing Out the Squiggles

Zack_M_DavisApr 29, 2024, 4:13 PM
157 points
36 comments11 min readLW link

AI Safety in a World of Vuln­er­a­ble Ma­chine Learn­ing Systems

Mar 8, 2023, 2:40 AM
70 points
28 comments29 min readLW link
(far.ai)

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_ArmstrongFeb 26, 2020, 12:39 PM
35 points
4 comments5 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav FortAug 29, 2024, 5:17 PM
87 points
8 comments7 min readLW link

[Question] What progress have we made on au­to­mated au­dit­ing?

LawrenceCJul 6, 2024, 1:49 AM
38 points
1 comment1 min readLW link

AXRP Epi­sode 1 - Ad­ver­sar­ial Poli­cies with Adam Gleave

DanielFilanDec 29, 2020, 8:41 PM
12 points
5 comments34 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM
123 points
30 comments13 min readLW link

The Good­hart Game

John_MaxwellNov 18, 2019, 11:22 PM
13 points
5 comments5 min readLW link

Ad­ver­sar­ial Ro­bust­ness Could Help Prevent Catas­trophic Misuse

aogaraDec 11, 2023, 7:12 PM
30 points
18 comments9 min readLW link

There are (prob­a­bly) no su­per­hu­man Go AIs: strong hu­man play­ers beat the strongest AIs

TaranFeb 19, 2023, 12:25 PM
124 points
34 comments4 min readLW link

RAIN: Your Lan­guage Models Can Align Them­selves with­out Fine­tun­ing—Microsoft Re­search 2023 - Re­duces the ad­ver­sar­ial prompt at­tack suc­cess rate from 94% to 19%!

Singularian2501Sep 24, 2023, 4:48 PM
5 points
0 comments1 min readLW link

Ad­ver­sar­ial Poli­cies Beat Pro­fes­sional-Level Go AIs

sanxiynNov 3, 2022, 1:27 PM
31 points
35 comments1 min readLW link
(goattack.alignmentfund.org)

Hu­man beats SOTA Go AI by learn­ing an ad­ver­sar­ial policy

Vanessa KosoyFeb 19, 2023, 9:38 AM
58 points
32 comments1 min readLW link
(goattack.far.ai)

Ad­ver­sar­ial at­tacks and op­ti­mal control

JanMay 22, 2022, 6:22 PM
17 points
7 comments8 min readLW link
(universalprior.substack.com)

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin FellMay 9, 2023, 2:36 PM
26 points
9 comments6 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasperFeb 21, 2023, 4:59 PM
14 points
4 comments3 min readLW link

EIS XII: Sum­mary

scasperFeb 23, 2023, 5:45 PM
18 points
0 comments6 min readLW link

[AN #62] Are ad­ver­sar­ial ex­am­ples caused by real but im­per­cep­ti­ble fea­tures?

Rohin ShahAug 22, 2019, 5:10 PM
28 points
10 comments9 min readLW link
(mailchi.mp)

Even Su­per­hu­man Go AIs Have Sur­pris­ing Failure Modes

Jul 20, 2023, 5:31 PM
129 points
22 comments10 min readLW link
(far.ai)

Analysing Ad­ver­sar­ial At­tacks with Lin­ear Probing

Jun 17, 2024, 2:16 PM
9 points
0 comments8 min readLW link

The Achilles Heel Hy­poth­e­sis for AI

scasperOct 13, 2020, 2:35 PM
20 points
6 comments1 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

Aug 30, 2023, 5:36 PM
17 points
0 comments8 min readLW link
(arxiv.org)

Image Hi­jacks: Ad­ver­sar­ial Images can Con­trol Gen­er­a­tive Models at Runtime

Sep 20, 2023, 3:23 PM
58 points
9 comments1 min readLW link
(arxiv.org)

EIS IX: In­ter­pretabil­ity and Adversaries

scasperFeb 20, 2023, 6:25 PM
30 points
8 comments8 min readLW link

Beyond the Board: Ex­plor­ing AI Ro­bust­ness Through Go

AdamGleaveJun 19, 2024, 4:40 PM
41 points
2 comments1 min readLW link
(far.ai)

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kittenDec 16, 2021, 10:41 PM
22 points
10 comments21 min readLW link

Does ro­bust­ness im­prove with scale?

Jul 25, 2024, 8:55 PM
14 points
0 comments1 min readLW link
(far.ai)

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

May 5, 2022, 12:59 AM
142 points
29 comments9 min readLW link

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidiaApr 15, 2023, 10:35 PM
71 points
18 comments6 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

Oct 20, 2023, 7:32 AM
31 points
6 comments25 min readLW link

Arte­facts gen­er­ated by mode col­lapse in GPT-4 Turbo serve as ad­ver­sar­ial at­tacks.

Sohaib ImranNov 10, 2023, 3:23 PM
11 points
0 comments2 min readLW link
No comments.