RSS

Threat Models (AI)

TagLast edit: Dec 14, 2024, 1:39 AM by Ruby

A threat model is a story of how a particular risk (e.g. AI) plays out.

In the AI risk case, according to Rohin Shah, a threat model is ideally:

Combination of a development model that says how we get AGI and a risk model that says how AGI leads to existential catastrophe.

See also AI Risk Concrete Stories

Another (outer) al­ign­ment failure story

paulfchristianoApr 7, 2021, 8:12 PM
244 points
38 comments12 min readLW link1 review

What failure looks like

paulfchristianoMar 17, 2019, 8:18 PM
428 points
54 comments8 min readLW link2 reviews

Dist­in­guish­ing AI takeover scenarios

Sep 8, 2021, 4:19 PM
74 points
11 comments14 min readLW link

Vignettes Work­shop (AI Im­pacts)

Daniel KokotajloJun 15, 2021, 12:05 PM
47 points
5 comments1 min readLW link

What Mul­tipo­lar Failure Looks Like, and Ro­bust Agent-Ag­nos­tic Pro­cesses (RAAPs)

Andrew_CritchMar 31, 2021, 11:50 PM
282 points
65 comments22 min readLW link1 review

AGI Ruin: A List of Lethalities

Eliezer YudkowskyJun 5, 2022, 10:05 PM
923 points
705 comments30 min readLW link3 reviews

In­ves­ti­gat­ing AI Takeover Scenarios

Sammy MartinSep 17, 2021, 6:47 PM
27 points
1 comment27 min readLW link

Less Real­is­tic Tales of Doom

Mark XuMay 6, 2021, 11:01 PM
113 points
13 comments4 min readLW link

On how var­i­ous plans miss the hard bits of the al­ign­ment challenge

So8resJul 12, 2022, 2:49 AM
305 points
89 comments29 min readLW link3 reviews

Sur­vey on AI ex­is­ten­tial risk scenarios

Jun 8, 2021, 5:12 PM
65 points
11 comments7 min readLW link

Cur­rent AIs Provide Nearly No Data Rele­vant to AGI Alignment

Thane RuthenisDec 15, 2023, 8:16 PM
129 points
157 comments8 min readLW link1 review

Rogue AGI Em­bod­ies Valuable In­tel­lec­tual Property

Jun 3, 2021, 8:37 PM
71 points
9 comments3 min readLW link

Without spe­cific coun­ter­mea­sures, the eas­iest path to trans­for­ma­tive AI likely leads to AI takeover

Ajeya CotraJul 18, 2022, 7:06 PM
368 points
95 comments75 min readLW link1 review

Will the grow­ing deer prion epi­demic spread to hu­mans? Why not?

eukaryoteJun 25, 2023, 4:31 AM
170 points
33 comments13 min readLW link
(eukaryotewritesblog.com)

What Failure Looks Like: Distill­ing the Discussion

Ben PaceJul 29, 2020, 9:49 PM
82 points
14 comments7 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven ByrnesMar 25, 2021, 1:45 PM
74 points
40 comments16 min readLW link

Clar­ify­ing AI X-risk

Nov 1, 2022, 11:03 AM
127 points
24 comments4 min readLW link1 review

AI Could Defeat All Of Us Combined

HoldenKarnofskyJun 9, 2022, 3:50 PM
170 points
42 comments17 min readLW link
(www.cold-takes.com)

[Question] We might be drop­ping the ball on Au­tonomous Repli­ca­tion and Adap­ta­tion.

May 31, 2024, 1:49 PM
61 points
30 comments4 min readLW link

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM
5 points
2 comments4 min readLW link

AI Takeover Sce­nario with Scaled LLMs

simeon_cApr 16, 2023, 11:28 PM
42 points
15 comments8 min readLW link

Power-seek­ing can be prob­a­ble and pre­dic­tive for trained agents

Feb 28, 2023, 9:10 PM
56 points
22 comments9 min readLW link
(arxiv.org)

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM
127 points
9 comments15 min readLW link

Dist­in­guish worst-case anal­y­sis from in­stru­men­tal train­ing-gaming

Sep 5, 2024, 7:13 PM
37 points
0 comments5 min readLW link

“Hu­man­ity vs. AGI” Will Never Look Like “Hu­man­ity vs. AGI” to Humanity

Thane RuthenisDec 16, 2023, 8:08 PM
189 points
34 comments5 min readLW link

Catas­trophic sab­o­tage as a ma­jor threat model for hu­man-level AI systems

evhubOct 22, 2024, 8:57 PM
92 points
11 comments15 min readLW link

A cen­tral AI al­ign­ment prob­lem: ca­pa­bil­ities gen­er­al­iza­tion, and the sharp left turn

So8resJun 15, 2022, 1:10 PM
271 points
55 comments10 min readLW link1 review

Refin­ing the Sharp Left Turn threat model, part 1: claims and mechanisms

Aug 12, 2022, 3:17 PM
86 points
4 comments3 min readLW link1 review
(vkrakovna.wordpress.com)

A Com­mon-Sense Case For Mu­tu­ally-Misal­igned AGIs Ally­ing Against Humans

Thane RuthenisDec 17, 2023, 8:28 PM
29 points
7 comments11 min readLW link

Refin­ing the Sharp Left Turn threat model, part 2: ap­ply­ing al­ign­ment techniques

Nov 25, 2022, 2:36 PM
39 points
9 comments6 min readLW link
(vkrakovna.wordpress.com)

Notes on Caution

David GrossDec 1, 2022, 3:05 AM
14 points
0 comments19 min readLW link

AI Ne­o­re­al­ism: a threat model & suc­cess crite­rion for ex­is­ten­tial safety

davidadDec 15, 2022, 1:42 PM
67 points
1 comment3 min readLW link

Ten Levels of AI Align­ment Difficulty

Sammy MartinJul 3, 2023, 8:20 PM
129 points
24 comments12 min readLW link1 review

Difficulty classes for al­ign­ment properties

JozdienFeb 20, 2024, 9:08 AM
34 points
5 comments2 min readLW link

A Case for the Least For­giv­ing Take On Alignment

Thane RuthenisMay 2, 2023, 9:34 PM
100 points
84 comments22 min readLW link

Con­tra “Strong Co­her­ence”

DragonGodMar 4, 2023, 8:05 PM
39 points
24 comments1 min readLW link

[Linkpost] Some high-level thoughts on the Deep­Mind al­ign­ment team’s strategy

Mar 7, 2023, 11:55 AM
128 points
13 comments5 min readLW link
(drive.google.com)

Proof of pos­te­ri­or­ity: a defense against AI-gen­er­ated misinformation

jchanJul 17, 2023, 12:04 PM
33 points
3 comments5 min readLW link

Thoughts On (Solv­ing) Deep Deception

JozdienOct 21, 2023, 10:40 PM
71 points
6 comments6 min readLW link

Model­ing Failure Modes of High-Level Ma­chine Intelligence

Dec 6, 2021, 1:54 PM
54 points
1 comment12 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel NandaDec 25, 2021, 11:07 PM
53 points
3 comments28 min readLW link

Why ra­tio­nal­ists should care (more) about free software

RichardJActonJan 23, 2022, 5:31 PM
66 points
42 comments5 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnettMay 26, 2022, 11:22 PM
24 points
0 comments8 min readLW link

How Deadly Will Roughly-Hu­man-Level AGI Be?

David UdellAug 8, 2022, 1:59 AM
12 points
6 comments1 min readLW link

Threat Model Liter­a­ture Review

Nov 1, 2022, 11:03 AM
78 points
4 comments25 min readLW link

Friendly and Un­friendly AGI are Indistinguishable

ErgoEchoDec 29, 2022, 10:13 PM
−4 points
4 comments4 min readLW link
(neologos.co)

Monthly Doom Ar­gu­ment Threads? Doom Ar­gu­ment Wiki?

LVSNFeb 4, 2023, 4:59 PM
3 points
0 comments1 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

berenFeb 20, 2023, 9:32 PM
14 points
1 comment4 min readLW link

[Question] What‘s in your list of un­solved prob­lems in AI al­ign­ment?

jacquesthibsMar 7, 2023, 6:58 PM
60 points
9 comments1 min readLW link

Scale Was All We Needed, At First

Gabe MFeb 14, 2024, 1:49 AM
286 points
33 comments8 min readLW link
(aiacumen.substack.com)

More Thoughts on the Hu­man-AGI War

Seth AhrenbachDec 27, 2023, 1:03 AM
−3 points
4 comments7 min readLW link

Without fun­da­men­tal ad­vances, mis­al­ign­ment and catas­tro­phe are the de­fault out­comes of train­ing pow­er­ful AI

Jan 26, 2024, 7:22 AM
161 points
60 comments57 min readLW link

What Failure Looks Like is not an ex­is­ten­tial risk (and al­ign­ment is not the solu­tion)

otto.bartenFeb 2, 2024, 6:59 PM
13 points
12 comments9 min readLW link

PoMP and Cir­cum­stance: Introduction

benatkinDec 9, 2024, 5:54 AM
1 point
1 comment1 min readLW link

[Question] Self-cen­sor­ing on AI x-risk dis­cus­sions?

DecaeneusJul 1, 2024, 6:24 PM
17 points
2 comments1 min readLW link

Unal­igned AI is com­ing re­gard­less.

verbalshadowJul 26, 2024, 4:41 PM
−15 points
3 comments2 min readLW link

[Question] Has Eliezer pub­li­cly and satis­fac­to­rily re­sponded to at­tempted re­but­tals of the anal­ogy to evolu­tion?

kalerJul 28, 2024, 12:23 PM
10 points
14 comments1 min readLW link

The need for multi-agent experiments

Martín SotoAug 1, 2024, 5:14 PM
43 points
3 comments9 min readLW link

The Lo­gis­tics of Distri­bu­tion of Mean­ing: Against Epistemic Bureaucratization

SahilNov 7, 2024, 5:27 AM
27 points
1 comment12 min readLW link

[Question] How can hu­man­ity sur­vive a mul­ti­po­lar AGI sce­nario?

Leonard HollowayJan 9, 2025, 8:17 PM
13 points
8 comments2 min readLW link

Ra­tional Utopia & Mul­tiver­sal AI Align­ment: Steer­able ASI for Ul­ti­mate Hu­man Freedom

ankFeb 11, 2025, 3:21 AM
15 points
3 comments13 min readLW link

Deep Deceptiveness

So8resMar 21, 2023, 2:51 AM
244 points
60 comments14 min readLW link1 review

The Peril of the Great Leaks (writ­ten with ChatGPT)

bvbvbvbvbvbvbvbvbvbvbvMar 31, 2023, 6:14 PM
3 points
1 comment1 min readLW link

One Does Not Sim­ply Re­place the Hu­mans

JerkyTreatsApr 6, 2023, 8:56 PM
9 points
3 comments4 min readLW link
(www.lesswrong.com)

AI x-risk, ap­prox­i­mately or­dered by embarrassment

Alex Lawsen Apr 12, 2023, 11:01 PM
151 points
7 comments19 min readLW link

AGI goal space is big, but nar­row­ing might not be as hard as it seems.

Jacy Reese AnthisApr 12, 2023, 7:03 PM
15 points
0 comments3 min readLW link

The ba­sic rea­sons I ex­pect AGI ruin

Rob BensingerApr 18, 2023, 3:37 AM
187 points
73 comments14 min readLW link

On AutoGPT

ZviApr 13, 2023, 12:30 PM
248 points
47 comments20 min readLW link
(thezvi.wordpress.com)

Paths to failure

Apr 25, 2023, 8:03 AM
29 points
1 comment8 min readLW link

[Question] Help me solve this prob­lem: The basilisk isn’t real, but peo­ple are

canary_itmNov 26, 2023, 5:44 PM
−19 points
4 comments1 min readLW link

Gra­di­ent hack­ing via ac­tual hacking

Max HMay 10, 2023, 1:57 AM
12 points
7 comments3 min readLW link

AGI-Au­to­mated In­ter­pretabil­ity is Suicide

__RicG__May 10, 2023, 2:20 PM
25 points
33 comments7 min readLW link

[Question] AI in­ter­pretabil­ity could be harm­ful?

Roman LeventovMay 10, 2023, 8:43 PM
13 points
2 comments1 min readLW link

Agen­tic Mess (A Failure Story)

Jun 6, 2023, 1:09 PM
46 points
5 comments13 min readLW link

Per­sua­sion Tools: AI takeover with­out AGI or agency?

Daniel KokotajloNov 20, 2020, 4:54 PM
85 points
25 comments11 min readLW link1 review

A Friendly Face (Another Failure Story)

Jun 20, 2023, 10:31 AM
65 points
21 comments16 min readLW link

The Main Sources of AI Risk?

Mar 21, 2019, 6:28 PM
125 points
26 comments2 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher KingJun 29, 2023, 4:56 PM
7 points
0 comments2 min readLW link

An Overview of AI risks—the Flyer

Jul 17, 2023, 12:03 PM
20 points
0 comments1 min readLW link
(docs.google.com)

Gear­ing Up for Long Timelines in a Hard World

DalcyJul 14, 2023, 6:11 AM
15 points
0 comments4 min readLW link