Threat Models (AI)

TagLast edit: Dec 14, 2024, 1:39 AM by Ruby

A threat model is a story of how a particular risk (e.g. AI) plays out.

In the AI risk case, according to Rohin Shah, a threat model is ideally:

Combination of a development model that says how we get AGI and a risk model that says how AGI leads to existential catastrophe.

See also AI Risk Concrete Stories

Another (outer) alignment failure story

paulfchristianoApr 7, 2021, 8:12 PM

247 points

38 comments12 min readLW link 1 review

What failure looks like

paulfchristianoMar 17, 2019, 8:18 PM

432 points

55 comments8 min readLW link 2 reviews

Distinguishing AI takeover scenarios

Sam Clarke and Sammy Martin

Sep 8, 2021, 4:19 PM

74 points

11 comments14 min readLW link

Vignettes Workshop (AI Impacts)

Daniel KokotajloJun 15, 2021, 12:05 PM

48 points

6 comments1 min readLW link

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew_CritchMar 31, 2021, 11:50 PM

282 points

65 comments22 min readLW link 1 review

AGI Ruin: A List of Lethalities

Eliezer YudkowskyJun 5, 2022, 10:05 PM

936 points

708 comments30 min readLW link 3 reviews

Investigating AI Takeover Scenarios

Sammy MartinSep 17, 2021, 6:47 PM

31 points

1 comment27 min readLW link

Survey on AI existential risk scenarios

Sam Clarke, apc and Jonas Schuett

Jun 8, 2021, 5:12 PM

65 points

11 comments7 min readLW link

On how various plans miss the hard bits of the alignment challenge

So8resJul 12, 2022, 2:49 AM

313 points

89 comments29 min readLW link 3 reviews

Less Realistic Tales of Doom

Mark XuMay 6, 2021, 11:01 PM

113 points

13 comments4 min readLW link

Rogue AGI Embodies Valuable Intellectual Property

Mark Xu and CarlShulman

Jun 3, 2021, 8:37 PM

71 points

9 comments3 min readLW link

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane RuthenisDec 15, 2023, 8:16 PM

131 points

157 comments8 min readLW link 1 review

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya CotraJul 18, 2022, 7:06 PM

368 points

95 comments75 min readLW link 1 review

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven ByrnesMar 25, 2021, 1:45 PM

74 points

40 comments16 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

127 points

24 comments4 min readLW link 1 review

AI Could Defeat All Of Us Combined

HoldenKarnofskyJun 9, 2022, 3:50 PM

170 points

42 comments17 min readLW link

(www.cold-takes.com)

Will the growing deer prion epidemic spread to humans? Why not?

eukaryoteJun 25, 2023, 4:31 AM

170 points

33 comments13 min readLW link

(eukaryotewritesblog.com)

What Failure Looks Like: Distilling the Discussion

Ben PaceJul 29, 2020, 9:49 PM

82 points

14 comments7 min readLW link

[Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy

Vika and Rohin Shah

Mar 7, 2023, 11:55 AM

128 points

13 comments5 min readLW link

(drive.google.com)

Distinguish worst-case analysis from instrumental training-gaming

Olli Järviniemi and Buck

Sep 5, 2024, 7:13 PM

37 points

0 comments5 min readLW link

Prioritizing threats for AI control

ryan_greenblattMar 19, 2025, 5:09 PM

48 points

2 comments10 min readLW link

Catastrophic sabotage as a major threat model for human-level AI systems

evhubOct 22, 2024, 8:57 PM

92 points

13 comments15 min readLW link

“Humanity vs. AGI” Will Never Look Like “Humanity vs. AGI” to Humanity

Thane RuthenisDec 16, 2023, 8:08 PM

191 points

34 comments5 min readLW link

Worrisome misunderstanding of the core issues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM

5 points

2 comments4 min readLW link

A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans

Thane RuthenisDec 17, 2023, 8:28 PM

29 points

7 comments11 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM

127 points

9 comments15 min readLW link

AI Takeover Scenario with Scaled LLMs

simeon_cApr 16, 2023, 11:28 PM

42 points

15 comments8 min readLW link

Power-seeking can be probable and predictive for trained agents

Feb 28, 2023, 9:10 PM

56 points

22 comments9 min readLW link

(arxiv.org)

A Case for the Least Forgiving Take On Alignment

Thane RuthenisMay 2, 2023, 9:34 PM

100 points

85 comments22 min readLW link

A central AI alignment problem: capabilities generalization, and the sharp left turn

So8resJun 15, 2022, 1:10 PM

272 points

55 comments10 min readLW link 1 review

Difficulty classes for alignment properties

JozdienFeb 20, 2024, 9:08 AM

34 points

5 comments2 min readLW link

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika, Vikrant Varma, Ramana Kumar and Mary Phuong

Aug 12, 2022, 3:17 PM

86 points

4 comments3 min readLW link 1 review

(vkrakovna.wordpress.com)

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika, Vikrant Varma, Ramana Kumar and Rohin Shah

Nov 25, 2022, 2:36 PM

39 points

9 comments6 min readLW link

(vkrakovna.wordpress.com)

Notes on Caution

David GrossDec 1, 2022, 3:05 AM

14 points

0 comments19 min readLW link

AI Neorealism: a threat model & success criterion for existential safety

davidadDec 15, 2022, 1:42 PM

67 points

1 comment3 min readLW link

[Question] We might be dropping the ball on Autonomous Replication and Adaptation.

Charbel-Raphaël and Épiphanie Gédéon

May 31, 2024, 1:49 PM

61 points

30 comments4 min readLW link

Ten Levels of AI Alignment Difficulty

Sammy MartinJul 3, 2023, 8:20 PM

130 points

24 comments12 min readLW link 1 review

Contra “Strong Coherence”

DragonGodMar 4, 2023, 8:05 PM

39 points

24 comments1 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher KingJun 29, 2023, 4:56 PM

7 points

0 comments2 min readLW link

An Overview of AI risks—the Flyer

Charbel-Raphaël, Jonathan Claybrough and tchauvin

Jul 17, 2023, 12:03 PM

20 points

0 comments1 min readLW link

(docs.google.com)

Gearing Up for Long Timelines in a Hard World

DalcyJul 14, 2023, 6:11 AM

15 points

0 comments4 min readLW link

Proof of posteriority: a defense against AI-generated misinformation

jchanJul 17, 2023, 12:04 PM

33 points

3 comments5 min readLW link

Thoughts On (Solving) Deep Deception

JozdienOct 21, 2023, 10:40 PM

72 points

6 comments6 min readLW link

Modeling Failure Modes of High-Level Machine Intelligence

Ben Cottier, Daniel_Eth and Sammy Martin

Dec 6, 2021, 1:54 PM

54 points

1 comment12 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel NandaDec 25, 2021, 11:07 PM

53 points

3 comments28 min readLW link

Why rationalists should care (more) about free software

RichardJActonJan 23, 2022, 5:31 PM

66 points

42 comments5 min readLW link

A Story of AI Risk: InstructGPT-N

peterbarnettMay 26, 2022, 11:22 PM

24 points

0 comments8 min readLW link

How Deadly Will Roughly-Human-Level AGI Be?

David UdellAug 8, 2022, 1:59 AM

12 points

6 comments1 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

78 points

4 comments25 min readLW link

Friendly and Unfriendly AGI are Indistinguishable

ErgoEchoDec 29, 2022, 10:13 PM

−4 points

4 comments4 min readLW link

(neologos.co)

Monthly Doom Argument Threads? Doom Argument Wiki?

LVSNFeb 4, 2023, 4:59 PM

3 points

0 comments1 min readLW link

Validator models: A simple approach to detecting goodharting

berenFeb 20, 2023, 9:32 PM

14 points

1 comment4 min readLW link

[Question] What‘s in your list of unsolved problems in AI alignment?

jacquesthibsMar 7, 2023, 6:58 PM

60 points

9 comments1 min readLW link

Scale Was All We Needed, At First

Gabe MFeb 14, 2024, 1:49 AM

295 points

34 comments8 min readLW link

(aiacumen.substack.com)

More Thoughts on the Human-AGI War

Seth AhrenbachDec 27, 2023, 1:03 AM

−3 points

4 comments7 min readLW link

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Jeremy Gillen and peterbarnett

Jan 26, 2024, 7:22 AM

161 points

60 comments57 min readLW link

What Failure Looks Like is not an existential risk (and alignment is not the solution)

otto.bartenFeb 2, 2024, 6:59 PM

13 points

12 comments9 min readLW link

PoMP and Circumstance: Introduction

benatkinDec 9, 2024, 5:54 AM

1 point

1 comment1 min readLW link

Boundary Conditions: A Solution to the Symbol Grounding Problem, and a Warning

ISCApr 8, 2025, 6:42 AM

1 point

0 comments5 min readLW link

[Question] Self-censoring on AI x-risk discussions?

DecaeneusJul 1, 2024, 6:24 PM

17 points

2 comments1 min readLW link

Unaligned AGI & Brief History of Inequality

ankFeb 22, 2025, 4:26 PM

−20 points

4 comments7 min readLW link

Unaligned AI is coming regardless.

verbalshadowJul 26, 2024, 4:41 PM

−15 points

3 comments2 min readLW link

[Question] Has Eliezer publicly and satisfactorily responded to attempted rebuttals of the analogy to evolution?

kalerJul 28, 2024, 12:23 PM

10 points

14 comments1 min readLW link

The need for multi-agent experiments

Martín SotoAug 1, 2024, 5:14 PM

43 points

3 comments9 min readLW link

The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization

SahilNov 7, 2024, 5:27 AM

27 points

7 comments12 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM

3 points

7 comments7 min readLW link

[Question] How can humanity survive a multipolar AGI scenario?

Leonard HollowayJan 9, 2025, 8:17 PM

13 points

8 comments2 min readLW link

[Question] Could AGI result in a Dark Forest type of situation?

MagpieJackFeb 12, 2025, 8:36 PM

1 point

0 comments1 min readLW link

Rational Effective Utopia & Narrow Way There: Multiversal AI Alignment, Place AI, New Ethicophysics… (Updated)

ankFeb 11, 2025, 3:21 AM

13 points

8 comments35 min readLW link

Deep Deceptiveness

So8resMar 21, 2023, 2:51 AM

251 points

60 comments14 min readLW link 1 review

The Peril of the Great Leaks (written with ChatGPT)

bvbvbvbvbvbvbvbvbvbvbvMar 31, 2023, 6:14 PM

3 points

1 comment1 min readLW link

One Does Not Simply Replace the Humans

JerkyTreatsApr 6, 2023, 8:56 PM

9 points

3 comments4 min readLW link

(www.lesswrong.com)

AI x-risk, approximately ordered by embarrassment

Alex Lawsen Apr 12, 2023, 11:01 PM

151 points

7 comments19 min readLW link

AGI goal space is big, but narrowing might not be as hard as it seems.

Jacy Reese AnthisApr 12, 2023, 7:03 PM

15 points

0 comments3 min readLW link

The basic reasons I expect AGI ruin

Rob BensingerApr 18, 2023, 3:37 AM

189 points

73 comments14 min readLW link

On AutoGPT

ZviApr 13, 2023, 12:30 PM

248 points

47 comments20 min readLW link

(thezvi.wordpress.com)

Paths to failure

Karl von Wendt and mespa

Apr 25, 2023, 8:03 AM

29 points

1 comment8 min readLW link

[Question] Help me solve this problem: The basilisk isn’t real, but people are

canary_itmNov 26, 2023, 5:44 PM

−19 points

4 comments1 min readLW link

Gradient hacking via actual hacking

Max HMay 10, 2023, 1:57 AM

12 points

7 comments3 min readLW link

AGI-Automated Interpretability is Suicide

__RicG__May 10, 2023, 2:20 PM

25 points

33 comments7 min readLW link

[Question] AI interpretability could be harmful?

Roman LeventovMay 10, 2023, 8:43 PM

13 points

2 comments1 min readLW link

Agentic Mess (A Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

Jun 6, 2023, 1:09 PM

46 points

5 comments13 min readLW link

Persuasion Tools: AI takeover without AGI or agency?

Daniel KokotajloNov 20, 2020, 4:54 PM

85 points

25 comments11 min readLW link 1 review

A Friendly Face (Another Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

Jun 20, 2023, 10:31 AM

65 points

21 comments16 min readLW link

The Main Sources of AI Risk?

Daniel Kokotajlo and Wei Dai

Mar 21, 2019, 6:28 PM

126 points

26 comments2 min readLW link

NunoSempere 13 Aug 2021 11:05 UTC
1 point
I spent ten minutes trying to find this tag, it might be a good idea to give it an easier to find name, like “Tales of AI”