RSS

Threat Models (AI)

TagLast edit: 14 Dec 2024 1:39 UTC by Ruby

A threat model is a story of how a particular risk (e.g. AI) plays out.

In the AI risk case, according to Rohin Shah, a threat model is ideally:

Combination of a development model that says how we get AGI and a risk model that says how AGI leads to existential catastrophe.

See also AI Risk Concrete Stories

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
244 points
38 comments12 min readLW link1 review

What failure looks like

paulfchristiano17 Mar 2019 20:18 UTC
416 points
54 comments8 min readLW link2 reviews

Dist­in­guish­ing AI takeover scenarios

8 Sep 2021 16:19 UTC
74 points
11 comments14 min readLW link

Vignettes Work­shop (AI Im­pacts)

Daniel Kokotajlo15 Jun 2021 12:05 UTC
47 points
5 comments1 min readLW link

What Mul­tipo­lar Failure Looks Like, and Ro­bust Agent-Ag­nos­tic Pro­cesses (RAAPs)

Andrew_Critch31 Mar 2021 23:50 UTC
280 points
65 comments22 min readLW link1 review

AGI Ruin: A List of Lethalities

Eliezer Yudkowsky5 Jun 2022 22:05 UTC
915 points
703 comments30 min readLW link3 reviews

Sur­vey on AI ex­is­ten­tial risk scenarios

8 Jun 2021 17:12 UTC
65 points
11 comments7 min readLW link

Less Real­is­tic Tales of Doom

Mark Xu6 May 2021 23:01 UTC
113 points
13 comments4 min readLW link

In­ves­ti­gat­ing AI Takeover Scenarios

Sammy Martin17 Sep 2021 18:47 UTC
27 points
1 comment27 min readLW link

On how var­i­ous plans miss the hard bits of the al­ign­ment challenge

So8res12 Jul 2022 2:49 UTC
305 points
88 comments29 min readLW link3 reviews

Cur­rent AIs Provide Nearly No Data Rele­vant to AGI Alignment

Thane Ruthenis15 Dec 2023 20:16 UTC
124 points
156 comments8 min readLW link

Rogue AGI Em­bod­ies Valuable In­tel­lec­tual Property

3 Jun 2021 20:37 UTC
71 points
9 comments3 min readLW link

Without spe­cific coun­ter­mea­sures, the eas­iest path to trans­for­ma­tive AI likely leads to AI takeover

Ajeya Cotra18 Jul 2022 19:06 UTC
366 points
94 comments75 min readLW link1 review

Clar­ify­ing AI X-risk

1 Nov 2022 11:03 UTC
127 points
24 comments4 min readLW link1 review

AI Could Defeat All Of Us Combined

HoldenKarnofsky9 Jun 2022 15:50 UTC
170 points
42 comments17 min readLW link
(www.cold-takes.com)

What Failure Looks Like: Distill­ing the Discussion

Ben Pace29 Jul 2020 21:49 UTC
82 points
14 comments7 min readLW link

Will the grow­ing deer prion epi­demic spread to hu­mans? Why not?

eukaryote25 Jun 2023 4:31 UTC
170 points
33 comments13 min readLW link
(eukaryotewritesblog.com)

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
74 points
40 comments16 min readLW link

Refin­ing the Sharp Left Turn threat model, part 2: ap­ply­ing al­ign­ment techniques

25 Nov 2022 14:36 UTC
39 points
9 comments6 min readLW link
(vkrakovna.wordpress.com)

A Com­mon-Sense Case For Mu­tu­ally-Misal­igned AGIs Ally­ing Against Humans

Thane Ruthenis17 Dec 2023 20:28 UTC
29 points
7 comments11 min readLW link

“Hu­man­ity vs. AGI” Will Never Look Like “Hu­man­ity vs. AGI” to Humanity

Thane Ruthenis16 Dec 2023 20:08 UTC
189 points
34 comments5 min readLW link

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC
5 points
2 comments4 min readLW link

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
34 points
5 comments2 min readLW link

[Question] We might be drop­ping the ball on Au­tonomous Repli­ca­tion and Adap­ta­tion.

31 May 2024 13:49 UTC
61 points
30 comments4 min readLW link

Dist­in­guish worst-case anal­y­sis from in­stru­men­tal train­ing-gaming

5 Sep 2024 19:13 UTC
37 points
0 comments5 min readLW link

Catas­trophic sab­o­tage as a ma­jor threat model for hu­man-level AI systems

evhub22 Oct 2024 20:57 UTC
91 points
11 comments15 min readLW link

AI Takeover Sce­nario with Scaled LLMs

simeon_c16 Apr 2023 23:28 UTC
42 points
15 comments8 min readLW link

Power-seek­ing can be prob­a­ble and pre­dic­tive for trained agents

28 Feb 2023 21:10 UTC
56 points
22 comments9 min readLW link
(arxiv.org)

A Case for the Least For­giv­ing Take On Alignment

Thane Ruthenis2 May 2023 21:34 UTC
100 points
84 comments22 min readLW link

Ten Levels of AI Align­ment Difficulty

Sammy Martin3 Jul 2023 20:20 UTC
121 points
14 comments12 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

A cen­tral AI al­ign­ment prob­lem: ca­pa­bil­ities gen­er­al­iza­tion, and the sharp left turn

So8res15 Jun 2022 13:10 UTC
282 points
54 comments10 min readLW link1 review

Refin­ing the Sharp Left Turn threat model, part 1: claims and mechanisms

12 Aug 2022 15:17 UTC
86 points
4 comments3 min readLW link1 review
(vkrakovna.wordpress.com)

Notes on Caution

David Gross1 Dec 2022 3:05 UTC
14 points
0 comments19 min readLW link

AI Ne­o­re­al­ism: a threat model & suc­cess crite­rion for ex­is­ten­tial safety

davidad15 Dec 2022 13:42 UTC
67 points
1 comment3 min readLW link

Con­tra “Strong Co­her­ence”

DragonGod4 Mar 2023 20:05 UTC
39 points
24 comments1 min readLW link

[Linkpost] Some high-level thoughts on the Deep­Mind al­ign­ment team’s strategy

7 Mar 2023 11:55 UTC
128 points
13 comments5 min readLW link
(drive.google.com)

[Question] AI in­ter­pretabil­ity could be harm­ful?

Roman Leventov10 May 2023 20:43 UTC
13 points
2 comments1 min readLW link

Agen­tic Mess (A Failure Story)

6 Jun 2023 13:09 UTC
46 points
5 comments13 min readLW link

Per­sua­sion Tools: AI takeover with­out AGI or agency?

Daniel Kokotajlo20 Nov 2020 16:54 UTC
85 points
25 comments11 min readLW link1 review

A Friendly Face (Another Failure Story)

20 Jun 2023 10:31 UTC
65 points
21 comments16 min readLW link

The Main Sources of AI Risk?

21 Mar 2019 18:28 UTC
121 points
26 comments2 min readLW link

AI x-risk, ap­prox­i­mately or­dered by embarrassment

Alex Lawsen 12 Apr 2023 23:01 UTC
151 points
7 comments19 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher King29 Jun 2023 16:56 UTC
7 points
0 comments2 min readLW link

Friendly and Un­friendly AGI are Indistinguishable

ErgoEcho29 Dec 2022 22:13 UTC
−4 points
4 comments4 min readLW link
(neologos.co)

An Overview of AI risks—the Flyer

17 Jul 2023 12:03 UTC
20 points
0 comments1 min readLW link
(docs.google.com)

Gear­ing Up for Long Timelines in a Hard World

Dalcy14 Jul 2023 6:11 UTC
15 points
0 comments4 min readLW link

Proof of pos­te­ri­or­ity: a defense against AI-gen­er­ated misinformation

jchan17 Jul 2023 12:04 UTC
33 points
3 comments5 min readLW link

Thoughts On (Solv­ing) Deep Deception

Jozdien21 Oct 2023 22:40 UTC
69 points
4 comments6 min readLW link

One Does Not Sim­ply Re­place the Hu­mans

JerkyTreats6 Apr 2023 20:56 UTC
9 points
3 comments4 min readLW link
(www.lesswrong.com)

The Peril of the Great Leaks (writ­ten with ChatGPT)

bvbvbvbvbvbvbvbvbvbvbv31 Mar 2023 18:14 UTC
3 points
1 comment1 min readLW link

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC
238 points
59 comments14 min readLW link

The Lo­gis­tics of Distri­bu­tion of Mean­ing: Against Epistemic Bureaucratization

Sahil7 Nov 2024 5:27 UTC
27 points
1 comment12 min readLW link

The need for multi-agent experiments

Martín Soto1 Aug 2024 17:14 UTC
43 points
3 comments9 min readLW link

[Question] Has Eliezer pub­li­cly and satis­fac­to­rily re­sponded to at­tempted re­but­tals of the anal­ogy to evolu­tion?

kaler28 Jul 2024 12:23 UTC
10 points
14 comments1 min readLW link

Unal­igned AI is com­ing re­gard­less.

verbalshadow26 Jul 2024 16:41 UTC
−15 points
3 comments2 min readLW link

[Question] Self-cen­sor­ing on AI x-risk dis­cus­sions?

Decaeneus1 Jul 2024 18:24 UTC
17 points
2 comments1 min readLW link

Model­ing Failure Modes of High-Level Ma­chine Intelligence

6 Dec 2021 13:54 UTC
54 points
1 comment12 min readLW link

Monthly Doom Ar­gu­ment Threads? Doom Ar­gu­ment Wiki?

LVSN4 Feb 2023 16:59 UTC
3 points
0 comments1 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC
52 points
3 comments28 min readLW link

Why ra­tio­nal­ists should care (more) about free software

RichardJActon23 Jan 2022 17:31 UTC
66 points
42 comments5 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
24 points
0 comments8 min readLW link

PoMP and Cir­cum­stance: Introduction

benatkin9 Dec 2024 5:54 UTC
1 point
1 comment1 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

beren20 Feb 2023 21:32 UTC
14 points
1 comment4 min readLW link

How Deadly Will Roughly-Hu­man-Level AGI Be?

David Udell8 Aug 2022 1:59 UTC
12 points
6 comments1 min readLW link

More Thoughts on the Hu­man-AGI War

Seth Ahrenbach27 Dec 2023 1:03 UTC
−3 points
4 comments7 min readLW link

Threat Model Liter­a­ture Review

1 Nov 2022 11:03 UTC
77 points
4 comments25 min readLW link

What Failure Looks Like is not an ex­is­ten­tial risk (and al­ign­ment is not the solu­tion)

otto.barten2 Feb 2024 18:59 UTC
13 points
12 comments9 min readLW link

AGI goal space is big, but nar­row­ing might not be as hard as it seems.

Jacy Reese Anthis12 Apr 2023 19:03 UTC
15 points
0 comments3 min readLW link

Without fun­da­men­tal ad­vances, mis­al­ign­ment and catas­tro­phe are the de­fault out­comes of train­ing pow­er­ful AI

26 Jan 2024 7:22 UTC
161 points
60 comments57 min readLW link

Scale Was All We Needed, At First

Gabe M14 Feb 2024 1:49 UTC
286 points
33 comments8 min readLW link
(aiacumen.substack.com)

The ba­sic rea­sons I ex­pect AGI ruin

Rob Bensinger18 Apr 2023 3:37 UTC
186 points
73 comments14 min readLW link

On AutoGPT

Zvi13 Apr 2023 12:30 UTC
248 points
47 comments20 min readLW link
(thezvi.wordpress.com)

Paths to failure

25 Apr 2023 8:03 UTC
29 points
1 comment8 min readLW link

[Question] Help me solve this prob­lem: The basilisk isn’t real, but peo­ple are

canary_itm26 Nov 2023 17:44 UTC
−19 points
4 comments1 min readLW link

[Question] What‘s in your list of un­solved prob­lems in AI al­ign­ment?

jacquesthibs7 Mar 2023 18:58 UTC
60 points
9 comments1 min readLW link

Gra­di­ent hack­ing via ac­tual hacking

Max H10 May 2023 1:57 UTC
12 points
7 comments3 min readLW link

AGI-Au­to­mated In­ter­pretabil­ity is Suicide

__RicG__10 May 2023 14:20 UTC
24 points
33 comments7 min readLW link