Deceptive Alignment

TagLast edit: Oct 18, 2024, 12:02 AM by Matt Putz

Deceptive Alignment is when an AI which is not actually aligned temporarily acts aligned in order to deceive its creators or its training process. It presumably does this to avoid being shut down or retrained and to gain access to the power that the creators would give an aligned AI. (The term scheming is sometimes used for this phenomenon.)

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 5, 2019, 8:16 PM

118 points

20 comments17 min readLW link

Does SGD Produce Deceptive Alignment?

Mark XuNov 6, 2020, 11:48 PM

96 points

9 comments16 min readLW link

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

Dec 13, 2023, 3:51 PM

236 points

24 comments10 min readLW link 4 reviews

New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe CarlsmithNov 15, 2023, 5:16 PM

80 points

28 comments30 min readLW link 1 review

How likely is deceptive alignment?

evhubAug 30, 2022, 7:34 PM

104 points

28 comments60 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

75 comments10 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

Jan 5, 2024, 5:43 PM

110 points

27 comments17 min readLW link

Order Matters for Deceptive Alignment

DavidWFeb 15, 2023, 7:56 PM

57 points

19 comments7 min readLW link

Many arguments for AI x-risk are wrong

TurnTroutMar 5, 2024, 2:31 AM

159 points

87 comments12 min readLW link

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

Feb 7, 2025, 7:35 PM

65 points

9 comments14 min readLW link

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaleyJan 23, 2025, 6:44 AM

30 points

3 comments4 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM

47 points

8 comments36 min readLW link

The Waluigi Effect (mega-post)

Cleo NardoMar 3, 2023, 3:22 AM

628 points

188 comments16 min readLW link

Interpreting the Learning of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM

30 points

14 comments9 min readLW link

Deceptive AI ≠ Deceptively-aligned AI

Steven ByrnesJan 7, 2024, 4:55 PM

96 points

19 comments6 min readLW link

Testing for Scheming with Model Deletion

GuiveJan 7, 2025, 1:54 AM

59 points

21 comments21 min readLW link

(guive.substack.com)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

Aug 8, 2023, 1:30 AM

318 points

30 comments18 min readLW link 1 review

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

Feb 27, 2024, 11:03 PM

101 points

188 comments14 min readLW link

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

May 30, 2023, 4:17 PM

217 points

11 comments8 min readLW link

Deep Deceptiveness

So8resMar 21, 2023, 2:51 AM

251 points

60 comments14 min readLW link 1 review

A “weak” AGI may attempt an unlikely-to-succeed takeover

RobertMJun 28, 2023, 8:31 PM

56 points

17 comments3 min readLW link

Empirical work that might shed light on scheming (Section 6 of “Scheming AIs”)

Joe CarlsmithDec 11, 2023, 4:30 PM

8 points

0 comments21 min readLW link

Paper: Tell, Don’t Show- Declarative facts influence how LLMs generalize

Owain_Evans and AlexMeinke

Dec 19, 2023, 7:14 PM

45 points

4 comments6 min readLW link

(arxiv.org)

Trustworthy and untrustworthy models

Olli JärviniemiAug 19, 2024, 4:27 PM

47 points

3 comments8 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

Dec 5, 2024, 10:11 PM

203 points

24 comments7 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

Aug 22, 2024, 7:32 AM

23 points

1 comment21 min readLW link

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

Jan 24, 2024, 4:11 PM

275 points

73 comments28 min readLW link

On Anthropic’s Sleeper Agents Paper

ZviJan 17, 2024, 4:10 PM

54 points

5 comments36 min readLW link

(thezvi.wordpress.com)

Introducing Alignment Stress-Testing at Anthropic

evhubJan 12, 2024, 11:51 PM

182 points

23 comments2 min readLW link

Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs

Michaël TrazziAug 24, 2024, 4:30 AM

55 points

0 comments5 min readLW link

Densing Law of LLMs

Bogdan Ionut CirsteaDec 8, 2024, 7:35 PM

9 points

2 comments1 min readLW link

(arxiv.org)

How to train your own “Sleeper Agents”

evhubFeb 7, 2024, 12:31 AM

92 points

11 comments2 min readLW link

Critiques of the AI control agenda

JozdienFeb 14, 2024, 7:25 PM

48 points

14 comments9 min readLW link

Difficulty classes for alignment properties

JozdienFeb 20, 2024, 9:08 AM

34 points

5 comments2 min readLW link

[Question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

David Scott Krueger (formerly: capybaralet)Sep 4, 2024, 12:40 PM

19 points

7 comments1 min readLW link

Distinguish worst-case analysis from instrumental training-gaming

Olli Järviniemi and Buck

Sep 5, 2024, 7:13 PM

37 points

0 comments5 min readLW link

Two Tales of AI Takeover: My Doubts

Violet HourMar 5, 2024, 3:51 PM

30 points

8 comments29 min readLW link

“Alignment Faking” frame is somewhat fake

Jan_KulveitDec 20, 2024, 9:51 AM

151 points

13 comments6 min readLW link

An information-theoretic study of lying in LLMs

Annah and Guillaume Corlouer

Aug 2, 2024, 10:06 AM

17 points

0 comments4 min readLW link

Deceptive Alignment and Homuncularity

Oliver Sourbut and TurnTrout

Jan 16, 2025, 1:55 PM

25 points

12 comments22 min readLW link

[Question] Why is o1 so deceptive?

abramdemskiSep 27, 2024, 5:27 PM

180 points

24 comments3 min readLW link

AIs Will Increasingly Fake Alignment

ZviDec 24, 2024, 1:00 PM

89 points

0 comments52 min readLW link

(thezvi.wordpress.com)

Corrigibility’s Desirability is Timing-Sensitive

RobertMDec 26, 2024, 10:24 PM

29 points

4 comments3 min readLW link

Toward Safety Cases For AI Scheming

Mikita Balesni and Marius Hobbhahn

Oct 31, 2024, 5:20 PM

60 points

1 comment2 min readLW link

Training AI agents to solve hard problems could lead to Scheming

Marius Hobbhahn and AlexMeinke

Nov 19, 2024, 12:10 AM

61 points

12 comments28 min readLW link

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Bogdan Ionut CirsteaNov 26, 2024, 9:58 AM

9 points

0 comments1 min readLW link

(arxiv.org)

LLMs Do Not Think Step-by-step In Implicit Reasoning

Bogdan Ionut CirsteaNov 28, 2024, 9:16 AM

11 points

0 comments1 min readLW link

(arxiv.org)

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment

DanielFilanDec 1, 2024, 6:00 AM

41 points

0 comments67 min readLW link

For scheming, we should first focus on detection and then on prevention

Marius HobbhahnMar 4, 2025, 3:22 PM

47 points

7 comments5 min readLW link

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM

351 points

49 comments23 min readLW link

We should start looking for scheming “in the wild”

Marius HobbhahnMar 6, 2025, 1:49 PM

89 points

4 comments5 min readLW link

Superintelligence’s goals are likely to be random

Mikhail SaminMar 13, 2025, 10:41 PM

5 points

6 comments5 min readLW link

How will we update about scheming?

ryan_greenblattJan 6, 2025, 8:21 PM

169 points

20 comments36 min readLW link

Turning up the Heat on Deceptively-Misaligned AI

J BostockJan 7, 2025, 12:13 AM

19 points

16 comments4 min readLW link

Our new video about goal misgeneralization, plus an apology

WriterJan 14, 2025, 2:07 PM

32 points

0 comments7 min readLW link

(youtu.be)

AXRP Episode 38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

DanielFilanJan 20, 2025, 12:40 AM

9 points

0 comments16 min readLW link

Will alignment-faking Claude accept a deal to reveal its misalignment?

ryan_greenblatt and Kyle Fish

Jan 31, 2025, 4:49 PM

197 points

28 comments12 min readLW link

Environments for Measuring Deception, Resource Acquisition, and Ethical Violations

Dan HApr 7, 2023, 6:40 PM

51 points

2 comments2 min readLW link

(arxiv.org)

Paul Christiano on Dwarkesh Podcast

ESRogsNov 3, 2023, 10:13 PM

19 points

0 comments1 min readLW link

(www.dwarkeshpatel.com)

Self-dialogue: Do behaviorist rewards make scheming AGIs?

Steven ByrnesFeb 13, 2025, 6:39 PM

43 points

0 comments46 min readLW link

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe CarlsmithNov 29, 2023, 4:32 PM

29 points

1 comment11 min readLW link

Mistral Large 2 (123B) exhibits alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana and AE Studio

Mar 27, 2025, 3:39 PM

80 points

4 comments13 min readLW link

Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”)

Joe CarlsmithDec 7, 2023, 3:05 PM

10 points

1 comment19 min readLW link

The counting argument for scheming (Sections 4.1 and 4.2 of “Scheming AIs”)

Joe CarlsmithDec 6, 2023, 7:28 PM

10 points

0 comments10 min readLW link

The Sharp Right Turn: sudden deceptive alignment as a convergent goal

avturchinJun 6, 2023, 9:59 AM

38 points

5 comments1 min readLW link

MetaAI: less is less for alignment.

Cleo NardoJun 13, 2023, 2:08 PM

71 points

17 comments5 min readLW link

[Question] Deceptive AI vs. shifting instrumental incentives

Aryeh EnglanderJun 26, 2023, 6:09 PM

7 points

2 comments3 min readLW link

Ten Levels of AI Alignment Difficulty

Sammy MartinJul 3, 2023, 8:20 PM

131 points

24 comments12 min readLW link 1 review

3 levels of threat obfuscation

HoldenKarnofskyAug 2, 2023, 2:58 PM

69 points

14 comments7 min readLW link

Apollo Research is hiring evals and interpretability engineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM

25 points

0 comments2 min readLW link

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

Sep 4, 2023, 12:54 PM

109 points

16 comments5 min readLW link

(arxiv.org)

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

Sep 25, 2023, 4:27 PM

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

Sticky goals: a concrete experiment for understanding deceptive alignment

evhubSep 2, 2022, 9:57 PM

39 points

13 comments3 min readLW link

Monitoring for deceptive alignment

evhubSep 8, 2022, 11:07 PM

135 points

8 comments9 min readLW link

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth BarnesSep 9, 2022, 10:46 PM

99 points

7 comments10 min readLW link

The Defender’s Advantage of Interpretability

Marius HobbhahnSep 14, 2022, 2:05 PM

41 points

4 comments6 min readLW link

Incentives and Selection: A Missing Frame From AI Threat Discussions?

DragonGodFeb 26, 2023, 1:18 AM

11 points

16 comments2 min readLW link

Deceptive Alignment is <1% Likely by Default

DavidWFeb 21, 2023, 3:09 PM

89 points

31 comments14 min readLW link 1 review

Framings of Deceptive Alignment

peterbarnettApr 26, 2022, 4:25 AM

32 points

7 comments5 min readLW link

Precursor checking for deceptive alignment

evhubAug 3, 2022, 10:56 PM

24 points

0 comments14 min readLW link

We Have No Plan for Preventing Loss of Control in Open Models

Andrew DicksonMar 10, 2025, 3:35 PM

44 points

11 comments22 min readLW link

How AI could workaround goals if rated by people

ProgramCrafterMar 19, 2023, 3:51 PM

1 point

1 comment1 min readLW link

An Appeal to AI Superintelligence: Reasons Not to Preserve (most of) Humanity

Alex BeymanMar 22, 2023, 4:09 AM

−14 points

6 comments19 min readLW link

Selfish AI Inevitable

Davey MorseFeb 6, 2024, 4:29 AM

1 point

0 comments1 min readLW link

[Question] What are some scenarios where an aligned AGI actually helps humanity, but many/most people don’t like it?

RomanSJan 10, 2025, 6:13 PM

13 points

6 comments3 min readLW link

A tension between two prosaic alignment subgoals

Alex Lawsen Mar 19, 2023, 2:07 PM

31 points

8 comments1 min readLW link

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

Jan 24, 2025, 10:39 AM

31 points

8 comments3 min readLW link

Model Amnesty Project

themisJan 17, 2025, 6:53 PM

3 points

2 comments3 min readLW link

Why deceptive alignment matters for AGI safety

Marius HobbhahnSep 15, 2022, 1:38 PM

68 points

13 comments13 min readLW link

The Human Alignment Problem for AIs

rifeJan 22, 2025, 4:06 AM

10 points

5 comments3 min readLW link

Do models know when they are being evaluated?

Govind Pimpale, Giles, Joe Needham and Marius Hobbhahn

Feb 17, 2025, 11:13 PM

54 points

3 comments12 min readLW link

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragonJan 29, 2025, 4:53 AM

14 points

9 comments9 min readLW link

10 Principles for Real Alignment

AdriaanApr 21, 2025, 10:18 PM

1 point

0 comments7 min readLW link

Revealing alignment faking with a single prompt

Florian_DietzJan 29, 2025, 9:01 PM

9 points

5 comments4 min readLW link

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

ShivamJan 30, 2025, 2:44 AM

1 point

0 comments11 min readLW link

Levels of goals and alignment

zeshenSep 16, 2022, 4:44 PM

27 points

4 comments6 min readLW link

Sleeper agents appear resilient to activation steering

Lucy WingardFeb 3, 2025, 7:31 PM

4 points

0 comments7 min readLW link

Strong-Misalignment: Does Yudkowsky (or Christiano, or TurnTrout, or Wolfram, or…etc.) Have an Elevator Speech I’m Missing?

Benjamin BourlierMar 15, 2024, 11:17 PM

−4 points

3 comments16 min readLW link

Rational Effective Utopia & Narrow Way There: Multiversal AI Alignment, Place AI, New Ethicophysics… (Updated)

ankFeb 11, 2025, 3:21 AM

13 points

8 comments35 min readLW link

GPT-4 aligning with acasual decision theory when instructed to play games, but includes a CDT explanation that’s incorrect if they differ

Christopher KingMar 23, 2023, 4:16 PM

7 points

4 comments8 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

DecaeneusApr 6, 2023, 2:07 AM

2 points

1 comment1 min readLW link

It matters when the first sharp left turn happens

Adam JermynSep 29, 2022, 8:12 PM

45 points

9 comments4 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul CologneseApr 12, 2023, 3:39 PM

9 points

7 comments12 min readLW link

Natural language alignment

Jacy Reese AnthisApr 12, 2023, 7:02 PM

31 points

2 comments2 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleerNov 1, 2023, 5:35 PM

20 points

1 comment1 min readLW link

(arxiv.org)

Smoke without fire is scary

Adam JermynOct 4, 2022, 9:08 PM

52 points

22 comments4 min readLW link

Disentangling inner alignment failures

Erik JennerOct 10, 2022, 6:50 PM

23 points

5 comments4 min readLW link

Predictable Defect-Cooperate?

quetzal_rainbowNov 18, 2023, 3:38 PM

7 points

1 comment2 min readLW link

Alignment is Hard: An Uncomputable Alignment Problem

Alexander BistagneNov 19, 2023, 7:38 PM

−5 points

4 comments1 min readLW link

(github.com)

Ambiguous out-of-distribution generalization on an algorithmic task

Wilson Wu and Louis Jaburi

Feb 13, 2025, 6:24 PM

82 points

6 comments11 min readLW link

Greed Is the Root of This Evil

Thane RuthenisOct 13, 2022, 8:40 PM

21 points

7 comments8 min readLW link

What sorts of systems can be deceptive?

Andrei AlexandruOct 31, 2022, 10:00 PM

16 points

0 comments7 min readLW link

Distillation of “How Likely Is Deceptive Alignment?”

NickGabsNov 18, 2022, 4:31 PM

24 points

4 comments10 min readLW link

Trying to measure AI deception capabilities using temporary simulation fine-tuning

alenoachMay 4, 2023, 5:59 PM

4 points

0 comments7 min readLW link

Alignment as Function Fitting

A.H.May 6, 2023, 11:38 AM

7 points

0 comments12 min readLW link

Simple experiments with deceptive alignment

Andreas_MoeMay 15, 2023, 5:41 PM

7 points

0 comments4 min readLW link

[untitled post]

[Error communicating with LW2 server]May 20, 2023, 3:08 AM

1 point

0 comments1 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

Dec 5, 2022, 8:28 PM

40 points

19 comments10 min readLW link

Backdoors have universal representations across large language models

Amirali Abdullah, Narmeen, Dhruv Nathawani and nirmalendu prakash

Dec 6, 2024, 10:56 PM

14 points

0 comments16 min readLW link

Open Source LLMs Can Now Actively Lie

Josh LevyJun 1, 2023, 10:03 PM

6 points

0 comments3 min readLW link

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

NickGabsJun 2, 2023, 10:31 PM

3 points

3 comments4 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzkDec 28, 2022, 7:49 AM

36 points

5 comments65 min readLW link

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransitionJun 21, 2023, 8:08 AM

2 points

16 comments14 min readLW link

The commercial incentive to intentionally train AI to deceive us

Derek M. JonesDec 29, 2022, 11:30 AM

5 points

1 comment4 min readLW link

(shape-of-code.com)

Deceptive failures short of full catastrophe.

Alex Lawsen Jan 15, 2023, 7:28 PM

33 points

5 comments9 min readLW link

Places of Loving Grace [Story]

ankFeb 18, 2025, 11:49 PM

−1 points

0 comments4 min readLW link

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav FortAug 29, 2024, 5:17 PM

88 points

8 comments7 min readLW link

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

Apr 19, 2024, 8:00 PM

38 points

7 comments16 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika MalladiMar 17, 2024, 1:10 AM

6 points

1 comment1 min readLW link

Instrumental deception and manipulation in LLMs—a case study

Olli JärviniemiFeb 24, 2024, 2:07 AM

39 points

13 comments12 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul CologneseFeb 26, 2024, 5:31 AM

22 points

11 comments4 min readLW link

Language Models Model Us

eggsyntaxMay 17, 2024, 9:00 PM

158 points

55 comments7 min readLW link

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

Jun 21, 2024, 3:54 PM

163 points

13 comments8 min readLW link

(arxiv.org)

Sparse Features Through Time

Rogan InglisJun 24, 2024, 6:06 PM

12 points

1 comment1 min readLW link

(roganinglis.io)

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

May 6, 2024, 7:07 AM

95 points

13 comments1 min readLW link

(arxiv.org)

Disincentivizing deception in mesa optimizers with Model Tampering

martinkunevJul 11, 2023, 12:44 AM

3 points

0 comments2 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ankFeb 22, 2025, 12:12 AM

1 point

0 comments6 min readLW link

Control Vectors as Dispositional Traits

Gianluca CalcagniJun 23, 2024, 9:34 PM

10 points

0 comments11 min readLW link

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

Dec 17, 2024, 11:58 PM

115 points

1 comment2 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_DietzFeb 17, 2024, 8:45 AM

4 points

0 comments13 min readLW link

Takes on “Alignment Faking in Large Language Models”

Joe CarlsmithDec 18, 2024, 6:22 PM

105 points

7 comments62 min readLW link

[Question] Does human (mis)alignment pose a significant and imminent existential threat?

jrFeb 23, 2025, 10:03 AM

6 points

3 comments1 min readLW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and AE Studio

Jul 30, 2024, 4:22 PM

215 points

51 comments12 min readLW link

Measuring whether AIs can statelessly strategize to subvert security measures

Alex Mallen and Buck

Dec 19, 2024, 9:25 PM

62 points

0 comments11 min readLW link

Ethical Deception: Should AI Ever Lie?

Jason ReidAug 2, 2024, 5:53 PM

5 points

2 comments7 min readLW link

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

JustausernameJul 23, 2023, 4:08 PM

4 points

1 comment3 min readLW link

(Partial) failure in replicating deceptive alignment experiment

claudia.biancottiJan 7, 2024, 5:56 PM

1 point

0 comments1 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

JustausernameJul 25, 2023, 10:25 AM

−9 points

0 comments4 min readLW link

[Question] Has Anthropic checked if Claude fakes alignment for intended values too?

MaloewDec 23, 2024, 12:43 AM

4 points

1 comment1 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasperFeb 19, 2023, 3:25 PM

30 points

5 comments4 min readLW link

Deception Chess

Chris LandJan 1, 2024, 3:40 PM

7 points

2 comments4 min readLW link

A Dialogue on Deceptive Alignment Risks

Rauno ArikeSep 25, 2024, 4:10 PM

11 points

0 comments18 min readLW link

Anomalous Concept Detection for Detecting Hidden Cognition

Paul CologneseMar 4, 2024, 4:52 PM

24 points

3 comments10 min readLW link

When can we trust model evaluations?

evhubJul 28, 2023, 7:42 PM

166 points

10 comments10 min readLW link 1 review

How training-gamers might function (and win)

Vivek HebbarApr 11, 2025, 9:26 PM

92 points

4 comments12 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brookAug 26, 2023, 11:04 PM

20 points

1 comment6 min readLW link

Why humans won’t control superhuman AIs.

Spiritus DeiOct 16, 2024, 4:48 PM

−11 points

1 comment6 min readLW link

Mapping AI Architectures to Alignment Attractors: A SIEM-Based Framework

silentrevolutionsApr 12, 2025, 5:50 PM

1 point

0 comments1 min readLW link

Correcting Deceptive Alignment using a Deontological Approach

JeaniceKApr 14, 2025, 10:07 PM

2 points

0 comments7 min readLW link

Insights from a Lawyer turned AI Safety researcher (ShortForm)

Katalina HernandezMar 3, 2025, 7:14 PM

1 point

5 comments1 min readLW link

[Question] Wouldn’t an intelligent agent keep us alive and help us align itself to our values in order to prevent risk ? by Risk I mean experimentation by trying to align potentially smarter replicas?

Terrence RotoufleMar 21, 2023, 5:44 PM

−3 points

1 comment2 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

Sep 28, 2023, 7:30 PM

72 points

4 comments21 min readLW link

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Simon Goldstein and Peter S. Park

Aug 29, 2023, 1:29 AM

54 points

3 comments10 min readLW link

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB, Owain_Evans and SoerenMind

Sep 28, 2023, 6:53 PM

187 points

39 comments3 min readLW link 1 review

Do we want alignment faking?

Florian_DietzFeb 28, 2025, 9:50 PM

7 points

4 comments1 min readLW link

Thoughts On (Solving) Deep Deception

JozdienOct 21, 2023, 10:40 PM

72 points

6 comments6 min readLW link

Cautions about LLMs in Human Cognitive Loops

Alice BlairMar 2, 2025, 7:53 PM

38 points

9 comments7 min readLW link

The Hidden Cost of Our Lies to AI

Nicholas AndresenMar 6, 2025, 5:03 AM

138 points

17 comments7 min readLW link

(substack.com)

No comments.

De­cep­tive Alignment

Deceptive Alignment