Deceptive Alignment

TagLast edit: 18 Oct 2024 0:02 UTC by Matt Putz

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

See also: Mesa-optimization, Treacherous Turn, Eliciting Latent Knowledge, Deception

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

5 Jun 2019 20:16 UTC

118 points

20 comments17 min readLW link

New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe Carlsmith15 Nov 2023 17:16 UTC

79 points

26 comments30 min readLW link

How likely is deceptive alignment?

evhub30 Aug 2022 19:34 UTC

103 points

28 comments60 min readLW link

Does SGD Produce Deceptive Alignment?

Mark Xu6 Nov 2020 23:48 UTC

96 points

9 comments16 min readLW link

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

215 points

14 comments10 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

98 points

22 comments17 min readLW link

Many arguments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC

165 points

86 comments12 min readLW link

Order Matters for Deceptive Alignment

DavidW15 Feb 2023 19:56 UTC

57 points

19 comments7 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC

47 points

8 comments36 min readLW link

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

14 comments9 min readLW link

Deceptive AI ≠ Deceptively-aligned AI

Steven Byrnes7 Jan 2024 16:55 UTC

96 points

19 comments6 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC

628 points

187 comments16 min readLW link

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

27 Feb 2024 23:03 UTC

94 points

188 comments14 min readLW link

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

215 points

11 comments8 min readLW link

Deep Deceptiveness

So8res21 Mar 2023 2:51 UTC

237 points

59 comments14 min readLW link

Sticky goals: a concrete experiment for understanding deceptive alignment

evhub2 Sep 2022 21:57 UTC

39 points

13 comments3 min readLW link

[Question] Why is o1 so deceptive?

abramdemski27 Sep 2024 17:27 UTC

177 points

24 comments3 min readLW link

Monitoring for deceptive alignment

evhub8 Sep 2022 23:07 UTC

135 points

8 comments9 min readLW link

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes9 Sep 2022 22:46 UTC

99 points

7 comments10 min readLW link

The Defender’s Advantage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC

41 points

4 comments6 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

22 Aug 2024 7:32 UTC

23 points

1 comment21 min readLW link

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

258 points

66 comments28 min readLW link

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

19 Dec 2023 19:14 UTC

45 points

4 comments6 min readLW link

(arxiv.org)

On Anthropic’s Sleeper Agents Paper

Zvi17 Jan 2024 16:10 UTC

54 points

5 comments36 min readLW link

(thezvi.wordpress.com)

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

19 Nov 2024 0:10 UTC

57 points

10 comments28 min readLW link

Introducing Alignment Stress-Testing at Anthropic

evhub12 Jan 2024 23:51 UTC

182 points

23 comments2 min readLW link

Environments for Measuring Deception, Resource Acquisition, and Ethical Violations

Dan H7 Apr 2023 18:40 UTC

51 points

2 comments2 min readLW link

(arxiv.org)

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs

Michaël Trazzi24 Aug 2024 4:30 UTC

55 points

0 comments5 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

91 points

11 comments2 min readLW link

Critiques of the AI control agenda

Jozdien14 Feb 2024 19:25 UTC

47 points

14 comments9 min readLW link

Paul Christiano on Dwarkesh Podcast

ESRogs3 Nov 2023 22:13 UTC

17 points

0 comments1 min readLW link

(www.dwarkeshpatel.com)

Empirical work that might shed light on scheming (Section 6 of “Scheming AIs”)

Joe Carlsmith11 Dec 2023 16:30 UTC

8 points

0 comments21 min readLW link

Incentives and Selection: A Missing Frame From AI Threat Discussions?

DragonGod26 Feb 2023 1:18 UTC

11 points

16 comments2 min readLW link

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

34 points

5 comments2 min readLW link

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe Carlsmith29 Nov 2023 16:32 UTC

29 points

1 comment11 min readLW link

Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”)

Joe Carlsmith7 Dec 2023 15:05 UTC

10 points

1 comment19 min readLW link

The counting argument for scheming (Sections 4.1 and 4.2 of “Scheming AIs”)

Joe Carlsmith6 Dec 2023 19:28 UTC

10 points

0 comments10 min readLW link

Trustworthy and untrustworthy models

Olli Järviniemi19 Aug 2024 16:27 UTC

46 points

3 comments8 min readLW link

Deceptive Alignment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC

90 points

29 comments14 min readLW link

[Question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

David Scott Krueger (formerly: capybaralet)4 Sep 2024 12:40 UTC

17 points

7 comments1 min readLW link

Distinguish worst-case analysis from instrumental training-gaming

Olli Järviniemi and Buck

5 Sep 2024 19:13 UTC

37 points

0 comments5 min readLW link

The Sharp Right Turn: sudden deceptive alignment as a convergent goal

avturchin6 Jun 2023 9:59 UTC

38 points

5 comments1 min readLW link

MetaAI: less is less for alignment.

Cleo Nardo13 Jun 2023 14:08 UTC

68 points

17 comments5 min readLW link

[Question] Deceptive AI vs. shifting instrumental incentives

Aryeh Englander26 Jun 2023 18:09 UTC

7 points

2 comments3 min readLW link

A “weak” AGI may attempt an unlikely-to-succeed takeover

RobertM28 Jun 2023 20:31 UTC

55 points

17 comments3 min readLW link

Ten Levels of AI Alignment Difficulty

Sammy Martin3 Jul 2023 20:20 UTC

121 points

14 comments12 min readLW link

Two Tales of AI Takeover: My Doubts

Violet Hour5 Mar 2024 15:51 UTC

30 points

8 comments29 min readLW link

3 levels of threat obfuscation

HoldenKarnofsky2 Aug 2023 14:58 UTC

69 points

14 comments7 min readLW link

Apollo Research is hiring evals and interpretability engineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC

25 points

0 comments2 min readLW link

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

8 Aug 2023 1:30 UTC

312 points

28 comments18 min readLW link

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

4 Sep 2023 12:54 UTC

108 points

16 comments5 min readLW link

(arxiv.org)

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

25 Sep 2023 16:27 UTC

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

An information-theoretic study of lying in LLMs

Annah and Guillaume Corlouer

2 Aug 2024 10:06 UTC

16 points

0 comments4 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul Colognese26 Feb 2024 5:31 UTC

22 points

11 comments4 min readLW link

What sorts of systems can be deceptive?

Andrei Alexandru31 Oct 2022 22:00 UTC

16 points

0 comments7 min readLW link

Distillation of “How Likely Is Deceptive Alignment?”

NickGabs18 Nov 2022 16:31 UTC

24 points

4 comments10 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC

36 points

5 comments65 min readLW link

The commercial incentive to intentionally train AI to deceive us

Derek M. Jones29 Dec 2022 11:30 UTC

5 points

1 comment4 min readLW link

(shape-of-code.com)

Deceptive failures short of full catastrophe.

Alex Lawsen 15 Jan 2023 19:28 UTC

33 points

5 comments9 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link

How AI could workaround goals if rated by people

ProgramCrafter19 Mar 2023 15:51 UTC

1 point

1 comment1 min readLW link

An Appeal to AI Superintelligence: Reasons Not to Preserve (most of) Humanity

Alex Beyman22 Mar 2023 4:09 UTC

−15 points

6 comments19 min readLW link

A tension between two prosaic alignment subgoals

Alex Lawsen 19 Mar 2023 14:07 UTC

31 points

8 comments1 min readLW link

[Question] Wouldn’t an intelligent agent keep us alive and help us align itself to our values in order to prevent risk ? by Risk I mean experimentation by trying to align potentially smarter replicas?

Terrence Rotoufle21 Mar 2023 17:44 UTC

−3 points

1 comment2 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC

18 points

7 comments8 min readLW link

Deception Chess

Chris Land1 Jan 2024 15:40 UTC

7 points

2 comments4 min readLW link

Anomalous Concept Detection for Detecting Hidden Cognition

Paul Colognese4 Mar 2024 16:52 UTC

24 points

3 comments10 min readLW link

(Partial) failure in replicating deceptive alignment experiment

claudia.biancotti7 Jan 2024 17:56 UTC

1 point

0 comments1 min readLW link

Strong-Misalignment: Does Yudkowsky (or Christiano, or TurnTrout, or Wolfram, or…etc.) Have an Elevator Speech I’m Missing?

Benjamin Bourlier15 Mar 2024 23:17 UTC

−4 points

3 comments16 min readLW link

Selfish AI Inevitable

Davey Morse6 Feb 2024 4:29 UTC

1 point

0 comments1 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_Dietz17 Feb 2024 8:45 UTC

3 points

0 comments13 min readLW link

Instrumental deception and manipulation in LLMs—a case study

Olli Järviniemi24 Feb 2024 2:07 UTC

39 points

13 comments12 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC

6 points

1 comment1 min readLW link

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav Fort29 Aug 2024 17:17 UTC

87 points

8 comments7 min readLW link

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

19 Apr 2024 20:00 UTC

38 points

6 comments16 min readLW link

Language Models Model Us

eggsyntax17 May 2024 21:00 UTC

156 points

55 comments7 min readLW link

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

21 Jun 2024 15:54 UTC

160 points

13 comments8 min readLW link

(arxiv.org)

Sparse Features Through Time

Rogan Inglis24 Jun 2024 18:06 UTC

12 points

1 comment1 min readLW link

(roganinglis.io)

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

95 points

13 comments1 min readLW link

(arxiv.org)

Control Vectors as Dispositional Traits

Gianluca Calcagni23 Jun 2024 21:34 UTC

9 points

0 comments11 min readLW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and AE Studio

30 Jul 2024 16:22 UTC

192 points

43 comments12 min readLW link

Ethical Deception: Should AI Ever Lie?

Jason Reid2 Aug 2024 17:53 UTC

5 points

2 comments7 min readLW link

A Dialogue on Deceptive Alignment Risks

Rauno Arike25 Sep 2024 16:10 UTC

11 points

0 comments18 min readLW link

Why humans won’t control superhuman AIs.

Spiritus Dei16 Oct 2024 16:48 UTC

−11 points

1 comment6 min readLW link

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

31 Oct 2024 17:20 UTC

60 points

1 comment2 min readLW link

GPT-4 aligning with acasual decision theory when instructed to play games, but includes a CDT explanation that’s incorrect if they differ

Christopher King23 Mar 2023 16:16 UTC

7 points

4 comments8 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

Natural language alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC

31 points

2 comments2 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleer1 Nov 2023 17:35 UTC

15 points

1 comment1 min readLW link

(arxiv.org)

Predictable Defect-Cooperate?

quetzal_rainbow18 Nov 2023 15:38 UTC

7 points

1 comment2 min readLW link

Alignment is Hard: An Uncomputable Alignment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC

−5 points

4 comments1 min readLW link

(github.com)

Trying to measure AI deception capabilities using temporary simulation fine-tuning

alenoach4 May 2023 17:59 UTC

4 points

0 comments7 min readLW link

Alignment as Function Fitting

A.H.6 May 2023 11:38 UTC

7 points

0 comments12 min readLW link

Exploiting Newcomb’s Game Show

carterallen25 May 2023 4:01 UTC

8 points

2 comments2 min readLW link

Simple experiments with deceptive alignment

Andreas_Moe15 May 2023 17:41 UTC

7 points

0 comments4 min readLW link

[untitled post]

[Error communicating with LW2 server]20 May 2023 3:08 UTC

1 point

0 comments1 min readLW link

Open Source LLMs Can Now Actively Lie

Josh Levy1 Jun 2023 22:03 UTC

6 points

0 comments3 min readLW link

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

NickGabs2 Jun 2023 22:31 UTC

3 points

3 comments4 min readLW link

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition21 Jun 2023 8:08 UTC

2 points

16 comments14 min readLW link

Disincentivizing deception in mesa optimizers with Model Tampering

martinkunev11 Jul 2023 0:44 UTC

3 points

0 comments2 min readLW link

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

Justausername23 Jul 2023 16:08 UTC

4 points

1 comment3 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

Justausername25 Jul 2023 10:25 UTC

−9 points

0 comments4 min readLW link

When can we trust model evaluations?

evhub28 Jul 2023 19:42 UTC

157 points

9 comments10 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC

20 points

1 comment6 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

28 Sep 2023 19:30 UTC

69 points

4 comments21 min readLW link

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Simon Goldstein and Peter S. Park

29 Aug 2023 1:29 UTC

53 points

3 comments10 min readLW link

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB, Owain_Evans and SoerenMind

28 Sep 2023 18:53 UTC

185 points

38 comments3 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

69 points

2 comments6 min readLW link

Framings of Deceptive Alignment

peterbarnett26 Apr 2022 4:25 UTC

32 points

7 comments5 min readLW link

Precursor checking for deceptive alignment

evhub3 Aug 2022 22:56 UTC

24 points

0 comments14 min readLW link

Why deceptive alignment matters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC

67 points

13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC

27 points

4 comments6 min readLW link

It matters when the first sharp left turn happens

Adam Jermyn29 Sep 2022 20:12 UTC

44 points

9 comments4 min readLW link

Smoke without fire is scary

Adam Jermyn4 Oct 2022 21:08 UTC

51 points

22 comments4 min readLW link

Disentangling inner alignment failures

Erik Jenner10 Oct 2022 18:50 UTC

23 points

5 comments4 min readLW link

No comments.