Eliciting Latent Knowledge

TagLast edit: Jan 17, 2025, 10:04 PM by Dakara

Eliciting Latent Knowledge is an open problem in AI safety.

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.
But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.
In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

See also: Transparency/Interpretability

ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano, Mark Xu and Ajeya Cotra

Dec 14, 2021, 8:09 PM

228 points

90 comments1 min readLW link 3 reviews

(docs.google.com)

Mechanistic anomaly detection and ELK

paulfchristianoNov 25, 2022, 6:50 PM

138 points

22 comments21 min readLW link

(ai-alignment.com)

Finding gliders in the game of life

paulfchristianoDec 1, 2022, 8:40 PM

104 points

8 comments16 min readLW link

(ai-alignment.com)

ELK prize results

paulfchristiano and Mark Xu

Mar 9, 2022, 12:01 AM

138 points

50 comments21 min readLW link

Counterexamples to some ELK proposals

paulfchristianoDec 31, 2021, 5:05 PM

53 points

10 comments7 min readLW link

Prizes for ELK proposals

paulfchristianoJan 3, 2022, 8:23 PM

150 points

152 comments7 min readLW link

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

Nov 1, 2023, 12:46 PM

18 points

1 comment7 min readLW link

ELK Proposal: Thinking Via A Human Imitator

TurnTroutFeb 22, 2022, 1:52 AM

31 points

6 comments11 min readLW link

Importance of foresight evaluations within ELK

Jonathan UesatoJan 6, 2022, 3:34 PM

25 points

1 comment10 min readLW link

Towards a better circuit prior: Improving on ELK state-of-the-art

evhub and kcwoolverton

Mar 29, 2022, 1:56 AM

23 points

0 comments15 min readLW link

Eliciting Latent Knowledge Via Hypothetical Sensors

John_MaxwellDec 30, 2021, 3:53 PM

38 points

1 comment6 min readLW link

ELK First Round Contest Winners

Mark Xu and paulfchristiano

Jan 26, 2022, 2:56 AM

65 points

6 comments1 min readLW link

My Reservations about Discovering Latent Knowledge (Burns, Ye, et al)

Robert_AIZIDec 27, 2022, 5:27 PM

50 points

0 comments4 min readLW link

(aizi.substack.com)

Implications of automated ontology identification

Alex Flint, adamShimi and Robert Miles

Feb 18, 2022, 3:30 AM

69 points

27 comments23 min readLW link

Can we efficiently explain model behaviors?

paulfchristianoDec 16, 2022, 7:40 PM

64 points

3 comments9 min readLW link

(ai-alignment.com)

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

Dec 23, 2023, 12:05 AM

57 points

10 comments4 min readLW link

AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

DanielFilanApr 25, 2024, 7:10 PM

20 points

1 comment63 min readLW link

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

Leon LangOct 22, 2024, 1:57 PM

51 points

2 comments18 min readLW link

(arxiv.org)

Understanding the two-head strategy for teaching ML to answer questions honestly

Adam ScherlisJan 11, 2022, 11:24 PM

29 points

1 comment10 min readLW link

Is ELK enough? Diamond, Matrix and Child AI

adamShimiFeb 15, 2022, 2:29 AM

17 points

10 comments4 min readLW link

What Does The Natural Abstraction Framework Say About ELK?

johnswentworthFeb 15, 2022, 2:27 AM

35 points

0 comments6 min readLW link

Some Hacky ELK Ideas

johnswentworthFeb 15, 2022, 2:27 AM

37 points

8 comments5 min readLW link

REPL’s: a type signature for agents

scottviteriFeb 15, 2022, 10:57 PM

25 points

6 comments2 min readLW link

Two Challenges for ELK

derek shillerFeb 21, 2022, 5:49 AM

7 points

0 comments4 min readLW link

ELK Thought Dump

abramdemskiFeb 28, 2022, 6:46 PM

61 points

18 comments17 min readLW link

Musings on the Speed Prior

evhubMar 2, 2022, 4:04 AM

33 points

4 comments10 min readLW link

ELK Sub—Note-taking in internal rollouts

HoagyMar 9, 2022, 5:23 PM

6 points

0 comments5 min readLW link

ELK contest submission: route understanding through the human ontology

Vika, Ramana Kumar and Vikrant Varma

Mar 14, 2022, 9:42 PM

21 points

2 comments2 min readLW link

[Question] Can you be Not Even Wrong in AI Alignment?

throwaway8238Mar 19, 2022, 5:41 PM

22 points

7 comments8 min readLW link

[ASoT] Observations about ELK

leogaoMar 26, 2022, 12:42 AM

34 points

0 comments3 min readLW link

ELK Computational Complexity: Three Levels of Difficulty

abramdemskiMar 30, 2022, 8:56 PM

46 points

9 comments7 min readLW link

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam MarksApr 27, 2022, 7:30 PM

17 points

8 comments3 min readLW link

Note-Taking without Hidden Messages

HoagyApr 30, 2022, 11:15 AM

17 points

2 comments4 min readLW link

Clarifying what ELK is trying to achieve

Towards_KeeperhoodMay 21, 2022, 7:34 AM

22 points

1 comment5 min readLW link

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Michael SoareverixSep 8, 2022, 3:20 PM

2 points

2 comments2 min readLW link

[Question] How is ARC planning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM

24 points

5 comments1 min readLW link

Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision

Michaël TrazziJan 17, 2023, 5:21 PM

25 points

5 comments4 min readLW link

(theinsideview.ai)

What Discovering Latent Knowledge Did and Did Not Find

Fabien RogerMar 13, 2023, 7:29 PM

166 points

17 comments11 min readLW link

[ASoT] Some thoughts on human abstractions

leogaoMar 16, 2023, 5:42 AM

42 points

4 comments5 min readLW link

Logical Decision Theories: Our final failsafe?

Noosphere89Oct 25, 2022, 12:51 PM

−7 points

8 comments1 min readLW link

(www.lesswrong.com)

The Greedy Doctor Problem… turns out to be relevant to the ELK problem?

JanJan 14, 2022, 11:58 AM

36 points

10 comments14 min readLW link

(universalprior.substack.com)

Covert Malicious Finetuning

Tony Wang and dannyhalawi

Jul 2, 2024, 2:41 AM

89 points

4 comments3 min readLW link

For ELK truth is mostly a distraction

c.troutNov 4, 2022, 9:14 PM

44 points

0 comments21 min readLW link

You won’t solve alignment without agent foundations

Mikhail SaminNov 6, 2022, 8:07 AM

27 points

3 comments8 min readLW link

The limited upside of interpretability

Peter S. ParkNov 15, 2022, 6:46 PM

13 points

11 comments1 min readLW link

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

Josh LevyJun 4, 2024, 3:45 PM

39 points

0 comments18 min readLW link

REPL’s and ELK

scottviteriFeb 17, 2022, 1:14 AM

9 points

4 comments1 min readLW link

“What the hell is a representation, anyway?” | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents

IwanWilliamsJun 9, 2024, 2:19 PM

9 points

1 comment4 min readLW link

CCS on compound sentences

Artyom KarpovMay 4, 2024, 12:23 PM

6 points

0 comments9 min readLW link

ARC paper: Formalizing the presumption of independence

Erik JennerNov 20, 2022, 1:22 AM

97 points

2 comments2 min readLW link

(arxiv.org)

Auditing LMs with counterfactual search: a tool for control and ELK

Jacob PfauFeb 20, 2024, 12:02 AM

28 points

6 comments10 min readLW link

Discovering Latent Knowledge in Language Models Without Supervision

XodarapDec 14, 2022, 12:32 PM

45 points

1 comment1 min readLW link

(arxiv.org)

Finding the estimate of the value of a state in RL agents

Clément Dumas, Walter Laurito , KlaRo and Kaarel

Jun 3, 2024, 8:26 PM

8 points

4 comments4 min readLW link

Thoughts on self-inspecting neural networks.

DeruwynMar 12, 2023, 11:58 PM

4 points

2 comments5 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM

37 points

4 comments2 min readLW link

Article Review: Discovering Latent Knowledge (Burns, Ye, et al)

Robert_AIZIDec 22, 2022, 6:16 PM

13 points

4 comments6 min readLW link

(aizi.substack.com)

[ASoT] Some ways ELK could still be solvable in practice

leogaoMar 27, 2022, 1:15 AM

26 points

1 comment2 min readLW link

Vaniver’s ELK Submission

VaniverMar 28, 2022, 9:14 PM

10 points

0 comments7 min readLW link

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

CollinDec 15, 2022, 6:22 PM

244 points

39 comments16 min readLW link 1 review

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_cApr 7, 2022, 1:46 PM

11 points

0 comments7 min readLW link

Can we efficiently distinguish different mechanisms?

paulfchristianoDec 27, 2022, 12:20 AM

91 points

30 comments16 min readLW link

(ai-alignment.com)

[ASoT] Simulators show us behavioural properties by default

JozdienJan 13, 2023, 6:42 PM

36 points

3 comments3 min readLW link

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

Dec 18, 2023, 11:58 AM

147 points

21 comments10 min readLW link

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

gekaklam, Walter Laurito , Kaarel and Kay Kozaronek

Jan 25, 2023, 7:03 PM

48 points

6 comments12 min readLW link

ELK shaving

Miss Aligned AIMay 1, 2022, 9:05 PM

6 points

1 comment1 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM

58 points

0 comments59 min readLW link

Croesus, Cerberus, and the magpies: a gentle introduction to Eliciting Latent Knowledge

Alexandre VariengienMay 27, 2022, 5:58 PM

17 points

0 comments16 min readLW link

Eliciting Latent Knowledge (ELK) - Distillation/Summary

Marius HobbhahnJun 8, 2022, 1:18 PM

69 points

2 comments21 min readLW link

ELK Proposal—Make the Reporter care about the Predictor’s beliefs

Adam Jermyn and Nicholas Schiefer

Jun 11, 2022, 10:53 PM

8 points

0 comments6 min readLW link

Bounded complexity of solving ELK and its implications

Rubi J. HudsonJul 19, 2022, 6:56 AM

11 points

4 comments18 min readLW link

Abram Demski’s ELK thoughts and proposal—distillation

Rubi J. HudsonJul 19, 2022, 6:57 AM

19 points

8 comments16 min readLW link

Surprised by ELK report’s counterexample to Debate, IDA

Evan R. MurphyAug 4, 2022, 2:12 AM

18 points

0 comments5 min readLW link

A Bite Sized Introduction to ELK

Luk27182Sep 17, 2022, 12:28 AM

5 points

0 comments6 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien RogerSep 4, 2022, 12:46 AM

7 points

0 comments5 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, gekaklam, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

Feb 23, 2023, 8:14 PM

51 points

0 comments19 min readLW link

Representational Tethers: Tying AI Latents To Human Ones

Paul BricmanSep 16, 2022, 2:45 PM

30 points

0 comments16 min readLW link

The ELK Framing I’ve Used

sudoSep 19, 2022, 10:28 AM

5 points

1 comment1 min readLW link

Where I currently disagree with Ryan Greenblatt’s version of the ELK approach

So8resSep 29, 2022, 9:18 PM

65 points

7 comments5 min readLW link

Locating and Editing Knowledge in LMs

Dhananjay AshokJan 24, 2025, 10:53 PM

1 point

0 comments4 min readLW link

Towards building blocks of ontologies

Daniel C, Alex_Altair, Dalcy, Alfred Harwood and JoseFaustino

Feb 8, 2025, 4:03 PM

29 points

0 comments26 min readLW link

Half-baked idea: a straightforward method for learning environmental goals?

Q HomeFeb 4, 2025, 6:56 AM

16 points

7 comments5 min readLW link

Eliciting Latent Knowledge in Comprehensive AI Services Models

acabodiNov 17, 2023, 2:36 AM

6 points

0 comments5 min readLW link

Betting on what is un-falsifiable and un-verifiable

Abhimanyu Pallavi SudhirNov 14, 2023, 9:11 PM

13 points

0 comments15 min readLW link

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

Scott EmmonsMay 31, 2023, 5:09 PM

97 points

1 comment6 min readLW link 1 review

Goal-misgeneralization is ELK-hard

rokosbasiliskJun 10, 2023, 9:32 AM

2 points

0 comments1 min readLW link

Still no Lie Detector for LLMs

Daniel Herrmann and ben_levinstein

Jul 18, 2023, 7:56 PM

50 points

2 comments21 min readLW link

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten and Ami Hays

Aug 5, 2023, 5:55 PM

6 points

2 comments7 min readLW link

(drive.google.com)

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

Sep 14, 2023, 1:40 AM

32 points

7 comments8 min readLW link

(far.ai)

A personal explanation of ELK concept and task.

Zeyu QinOct 6, 2023, 3:55 AM

1 point

0 comments1 min readLW link

Attributing to interactions with GCPD and GWPD

jennyOct 11, 2023, 3:06 PM

20 points

0 comments6 min readLW link

Discovering Latent Knowledge in the Human Brain: Part 1 – Clarifying the concepts of belief and knowledge

Joseph EmersonOct 15, 2023, 9:02 AM

5 points

0 comments12 min readLW link

[Question] Popular materials about environmental goals/agent foundations? People wanting to discuss such topics?

Q HomeJan 22, 2025, 3:30 AM

5 points

0 comments1 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_DietzMar 10, 2025, 4:07 PM

35 points

3 comments9 min readLW link

What happens when LLMs learn new things? & Continual learning forever.

sunchipssterApr 15, 2025, 6:38 PM

2 points

0 comments7 min readLW link

Clarifying Alignment Fundamentals Through the Lens of Ontology

Ben IhrigOct 7, 2024, 8:57 PM

12 points

4 comments24 min readLW link

Mechanistic Anomaly Detection Research Update

Nora Belrose and David Johnston

Aug 6, 2024, 10:33 AM

11 points

0 comments1 min readLW link

(blog.eleuther.ai)

No comments.