RSS

Treach­er­ous Turn

TagLast edit: 25 Jun 2022 21:15 UTC by Noosphere89

A Treacherous Turn is a hypothetical event where an advanced AI system which has been pretending to be aligned due to its relative weakness turns on humanity once it achieves sufficient power that it can pursue its true objective without risk.

[Question] Any work on hon­ey­pots (to de­tect treach­er­ous turn at­tempts)?

David Scott Krueger (formerly: capybaralet)12 Nov 2020 5:41 UTC
17 points
4 comments1 min readLW link

A Gym Grid­world En­vi­ron­ment for the Treach­er­ous Turn

Michaël Trazzi28 Jul 2018 21:27 UTC
74 points
9 comments3 min readLW link
(github.com)

Soares, Tal­linn, and Yud­kowsky dis­cuss AGI cognition

29 Nov 2021 19:26 UTC
121 points
39 comments40 min readLW link1 review

A toy model of the treach­er­ous turn

Stuart_Armstrong8 Jan 2016 12:58 UTC
43 points
13 comments6 min readLW link

[AN #165]: When large mod­els are more likely to lie

Rohin Shah22 Sep 2021 17:30 UTC
23 points
0 comments8 min readLW link
(mailchi.mp)

AI learns be­trayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC
30 points
4 comments2 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
108 points
6 comments2 min readLW link

Su­per­in­tel­li­gence 11: The treach­er­ous turn

KatjaGrace25 Nov 2014 2:00 UTC
16 points
50 comments6 min readLW link

[Linkpost] Treach­er­ous turns in the wild

Mark Xu26 Apr 2021 22:51 UTC
31 points
6 comments1 min readLW link
(lukemuehlhauser.com)

A sim­ple treach­er­ous turn demonstration

nikola25 Nov 2023 4:51 UTC
22 points
5 comments3 min readLW link

Give the model a model-builder

Adam Jermyn6 Jun 2022 12:21 UTC
3 points
0 comments5 min readLW link

Is there a ML agent that aban­dons it’s util­ity func­tion out-of-dis­tri­bu­tion with­out los­ing ca­pa­bil­ities?

Christopher King22 Feb 2023 16:49 UTC
1 point
7 comments1 min readLW link

More Thoughts on the Hu­man-AGI War

Seth Ahrenbach27 Dec 2023 1:03 UTC
−3 points
4 comments7 min readLW link

A way to make solv­ing al­ign­ment 10.000 times eas­ier. The shorter case for a mas­sive open source sim­box pro­ject.

AlexFromSafeTransition21 Jun 2023 8:08 UTC
2 points
16 comments14 min readLW link

“De­stroy hu­man­ity” as an im­me­di­ate subgoal

Seth Ahrenbach22 Dec 2023 18:52 UTC
3 points
13 comments3 min readLW link
No comments.