RSS

Treach­er­ous Turn

TagLast edit: Dec 30, 2024, 9:54 AM by Dakara

Treacherous Turn is a hypothetical event where an advanced AI system which has been pretending to be aligned due to its relative weakness turns on humanity once it achieves sufficient power that it can pursue its true objective without risk.

[Question] Any work on hon­ey­pots (to de­tect treach­er­ous turn at­tempts)?

David Scott Krueger (formerly: capybaralet)Nov 12, 2020, 5:41 AM
17 points
4 comments1 min readLW link

A Gym Grid­world En­vi­ron­ment for the Treach­er­ous Turn

Michaël TrazziJul 28, 2018, 9:27 PM
74 points
9 comments3 min readLW link
(github.com)

Soares, Tal­linn, and Yud­kowsky dis­cuss AGI cognition

Nov 29, 2021, 7:26 PM
121 points
39 comments40 min readLW link1 review

A toy model of the treach­er­ous turn

Stuart_ArmstrongJan 8, 2016, 12:58 PM
43 points
13 comments6 min readLW link

[AN #165]: When large mod­els are more likely to lie

Rohin ShahSep 22, 2021, 5:30 PM
23 points
0 comments8 min readLW link
(mailchi.mp)

AI learns be­trayal and how to avoid it

Stuart_ArmstrongSep 30, 2021, 9:39 AM
30 points
4 comments2 min readLW link

A very crude de­cep­tion eval is already passed

Beth BarnesOct 29, 2021, 5:57 PM
108 points
6 comments2 min readLW link

Su­per­in­tel­li­gence 11: The treach­er­ous turn

KatjaGraceNov 25, 2014, 2:00 AM
16 points
50 comments6 min readLW link

[Linkpost] Treach­er­ous turns in the wild

Mark XuApr 26, 2021, 10:51 PM
31 points
6 comments1 min readLW link
(lukemuehlhauser.com)

A sim­ple treach­er­ous turn demonstration

Nikola JurkovicNov 25, 2023, 4:51 AM
22 points
5 comments3 min readLW link

Give the model a model-builder

Adam JermynJun 6, 2022, 12:21 PM
3 points
0 comments5 min readLW link

Is there a ML agent that aban­dons it’s util­ity func­tion out-of-dis­tri­bu­tion with­out los­ing ca­pa­bil­ities?

Christopher KingFeb 22, 2023, 4:49 PM
1 point
7 comments1 min readLW link

More Thoughts on the Hu­man-AGI War

Seth AhrenbachDec 27, 2023, 1:03 AM
−3 points
4 comments7 min readLW link

A way to make solv­ing al­ign­ment 10.000 times eas­ier. The shorter case for a mas­sive open source sim­box pro­ject.

AlexFromSafeTransitionJun 21, 2023, 8:08 AM
2 points
16 comments14 min readLW link

“De­stroy hu­man­ity” as an im­me­di­ate subgoal

Seth AhrenbachDec 22, 2023, 6:52 PM
3 points
13 comments3 min readLW link
No comments.