RSS

Elic­it­ing La­tent Knowl­edge (ELK)

TagLast edit: 31 Mar 2022 3:04 UTC by Multicore

Eliciting Latent Knowledge is an open problem in AI safety.

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

--ARC report

See also: Transparency/​Interpretability

ARC’s first tech­ni­cal re­port: Elic­it­ing La­tent Knowledge

14 Dec 2021 20:09 UTC
225 points
90 comments1 min readLW link3 reviews
(docs.google.com)

Mechanis­tic anomaly de­tec­tion and ELK

paulfchristiano25 Nov 2022 18:50 UTC
133 points
22 comments21 min readLW link
(ai-alignment.com)

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
101 points
7 comments16 min readLW link
(ai-alignment.com)

ELK prize results

9 Mar 2022 0:01 UTC
138 points
50 comments21 min readLW link

Coun­terex­am­ples to some ELK proposals

paulfchristiano31 Dec 2021 17:05 UTC
53 points
10 comments7 min readLW link

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

1 Nov 2023 12:46 UTC
18 points
1 comment7 min readLW link

Prizes for ELK proposals

paulfchristiano3 Jan 2022 20:23 UTC
150 points
152 comments7 min readLW link

Towards a bet­ter cir­cuit prior: Im­prov­ing on ELK state-of-the-art

29 Mar 2022 1:56 UTC
23 points
0 comments15 min readLW link

Im­por­tance of fore­sight eval­u­a­tions within ELK

Jonathan Uesato6 Jan 2022 15:34 UTC
25 points
1 comment10 min readLW link

ELK First Round Con­test Winners

26 Jan 2022 2:56 UTC
65 points
6 comments1 min readLW link

ELK Pro­posal: Think­ing Via A Hu­man Imitator

TurnTrout22 Feb 2022 1:52 UTC
31 points
6 comments11 min readLW link

Elic­it­ing La­tent Knowl­edge Via Hy­po­thet­i­cal Sensors

John_Maxwell30 Dec 2021 15:53 UTC
38 points
1 comment6 min readLW link

My Reser­va­tions about Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

Robert_AIZI27 Dec 2022 17:27 UTC
50 points
0 comments4 min readLW link
(aizi.substack.com)

Im­pli­ca­tions of au­to­mated on­tol­ogy identification

18 Feb 2022 3:30 UTC
69 points
27 comments23 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael Soareverix8 Sep 2022 15:20 UTC
2 points
2 comments2 min readLW link

[ASoT] Some thoughts on hu­man abstractions

leogao16 Mar 2023 5:42 UTC
42 points
4 comments5 min readLW link

Two Challenges for ELK

derek shiller21 Feb 2022 5:49 UTC
7 points
0 comments4 min readLW link

ELK Thought Dump

abramdemski28 Feb 2022 18:46 UTC
61 points
18 comments17 min readLW link

Mus­ings on the Speed Prior

evhub2 Mar 2022 4:04 UTC
32 points
4 comments10 min readLW link

Is ELK enough? Di­a­mond, Ma­trix and Child AI

adamShimi15 Feb 2022 2:29 UTC
17 points
10 comments4 min readLW link

ELK Sub—Note-tak­ing in in­ter­nal rollouts

Hoagy9 Mar 2022 17:23 UTC
6 points
0 comments5 min readLW link

AXRP Epi­sode 29 - Science of Deep Learn­ing with Vikrant Varma

DanielFilan25 Apr 2024 19:10 UTC
20 points
1 comment63 min readLW link

ELK con­test sub­mis­sion: route un­der­stand­ing through the hu­man ontology

14 Mar 2022 21:42 UTC
21 points
2 comments2 min readLW link

Un­der­stand­ing the two-head strat­egy for teach­ing ML to an­swer ques­tions honestly

Adam Scherlis11 Jan 2022 23:24 UTC
29 points
1 comment10 min readLW link

[Question] Can you be Not Even Wrong in AI Align­ment?

throwaway823819 Mar 2022 17:41 UTC
22 points
7 comments8 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC
24 points
5 comments1 min readLW link

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristiano16 Dec 2022 19:40 UTC
64 points
3 comments9 min readLW link
(ai-alignment.com)

[Paper Blog­post] When Your AIs De­ceive You: Challenges with Par­tial Ob­serv­abil­ity in RLHF

Leon Lang22 Oct 2024 13:57 UTC
47 points
0 comments18 min readLW link
(arxiv.org)

[ASoT] Ob­ser­va­tions about ELK

leogao26 Mar 2022 0:42 UTC
34 points
0 comments3 min readLW link

Clar­ify­ing what ELK is try­ing to achieve

Towards_Keeperhood21 May 2022 7:34 UTC
22 points
1 comment5 min readLW link

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC
17 points
8 comments3 min readLW link

Col­lin Burns on Align­ment Re­search And Dis­cov­er­ing La­tent Knowl­edge Without Supervision

Michaël Trazzi17 Jan 2023 17:21 UTC
25 points
5 comments4 min readLW link
(theinsideview.ai)

ELK Com­pu­ta­tional Com­plex­ity: Three Levels of Difficulty

abramdemski30 Mar 2022 20:56 UTC
46 points
9 comments7 min readLW link

Note-Tak­ing with­out Hid­den Messages

Hoagy30 Apr 2022 11:15 UTC
17 points
2 comments4 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC
166 points
17 comments11 min readLW link

What Does The Nat­u­ral Ab­strac­tion Frame­work Say About ELK?

johnswentworth15 Feb 2022 2:27 UTC
35 points
0 comments6 min readLW link

REPL’s: a type sig­na­ture for agents

scottviteri15 Feb 2022 22:57 UTC
25 points
6 comments2 min readLW link

Some Hacky ELK Ideas

johnswentworth15 Feb 2022 2:27 UTC
37 points
8 comments5 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
57 points
10 comments4 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC
7 points
0 comments5 min readLW link

Rep­re­sen­ta­tional Tethers: Ty­ing AI La­tents To Hu­man Ones

Paul Bricman16 Sep 2022 14:45 UTC
30 points
0 comments16 min readLW link

The ELK Fram­ing I’ve Used

sudo19 Sep 2022 10:28 UTC
5 points
1 comment1 min readLW link

Where I cur­rently dis­agree with Ryan Green­blatt’s ver­sion of the ELK approach

So8res29 Sep 2022 21:18 UTC
65 points
7 comments5 min readLW link

Log­i­cal De­ci­sion The­o­ries: Our fi­nal failsafe?

Noosphere8925 Oct 2022 12:51 UTC
−7 points
8 comments1 min readLW link
(www.lesswrong.com)

For ELK truth is mostly a distraction

c.trout4 Nov 2022 21:14 UTC
44 points
0 comments21 min readLW link

You won’t solve al­ign­ment with­out agent foundations

Mikhail Samin6 Nov 2022 8:07 UTC
26 points
3 comments8 min readLW link

The limited up­side of interpretability

Peter S. Park15 Nov 2022 18:46 UTC
13 points
11 comments1 min readLW link

ARC pa­per: For­mal­iz­ing the pre­sump­tion of independence

Erik Jenner20 Nov 2022 1:22 UTC
97 points
2 comments2 min readLW link
(arxiv.org)

Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Supervision

Xodarap14 Dec 2022 12:32 UTC
45 points
1 comment1 min readLW link
(arxiv.org)

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

Collin15 Dec 2022 18:22 UTC
244 points
39 comments16 min readLW link1 review

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC
88 points
30 comments16 min readLW link
(ai-alignment.com)

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

Jozdien13 Jan 2023 18:42 UTC
35 points
3 comments3 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
48 points
6 comments12 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
51 points
0 comments19 min readLW link

Thoughts on self-in­spect­ing neu­ral net­works.

Deruwyn12 Mar 2023 23:58 UTC
4 points
2 comments5 min readLW link

Ar­ti­cle Re­view: Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

Robert_AIZI22 Dec 2022 18:16 UTC
13 points
4 comments6 min readLW link
(aizi.substack.com)

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

18 Dec 2023 11:58 UTC
147 points
21 comments10 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
37 points
4 comments2 min readLW link

Au­dit­ing LMs with coun­ter­fac­tual search: a tool for con­trol and ELK

Jacob Pfau20 Feb 2024 0:02 UTC
28 points
6 comments10 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh Levy4 Jun 2024 15:45 UTC
38 points
0 comments17 min readLW link

Find­ing the es­ti­mate of the value of a state in RL agents

3 Jun 2024 20:26 UTC
7 points
4 comments4 min readLW link

CCS on com­pound sentences

Artyom Karpov4 May 2024 12:23 UTC
6 points
0 comments9 min readLW link

“What the hell is a rep­re­sen­ta­tion, any­way?” | Clar­ify­ing AI in­ter­pretabil­ity with tools from philos­o­phy of cog­ni­tive sci­ence | Part 1: Ve­hi­cles vs. contents

IwanWilliams9 Jun 2024 14:19 UTC
9 points
1 comment4 min readLW link

Covert Mal­i­cious Finetuning

2 Jul 2024 2:41 UTC
88 points
4 comments3 min readLW link

Mechanis­tic Ano­maly De­tec­tion Re­search Update

6 Aug 2024 10:33 UTC
11 points
0 comments1 min readLW link
(blog.eleuther.ai)

Clar­ify­ing Align­ment Fun­da­men­tals Through the Lens of Ontology

eternal/ephemera7 Oct 2024 20:57 UTC
12 points
4 comments24 min readLW link

Elic­it­ing La­tent Knowl­edge in Com­pre­hen­sive AI Ser­vices Models

acabodi17 Nov 2023 2:36 UTC
6 points
0 comments5 min readLW link

Bet­ting on what is un-falsifi­able and un-verifiable

Abhimanyu Pallavi Sudhir14 Nov 2023 21:11 UTC
13 points
0 comments15 min readLW link

Con­trast Pairs Drive the Em­piri­cal Perfor­mance of Con­trast Con­sis­tent Search (CCS)

Scott Emmons31 May 2023 17:09 UTC
97 points
0 comments6 min readLW link

Goal-mis­gen­er­al­iza­tion is ELK-hard

rokosbasilisk10 Jun 2023 9:32 UTC
2 points
0 comments1 min readLW link

Still no Lie De­tec­tor for LLMs

18 Jul 2023 19:56 UTC
47 points
2 comments21 min readLW link

Ground-Truth La­bel Im­bal­ance Im­pairs the Perfor­mance of Con­trast-Con­sis­tent Search (and Other Con­trast-Pair-Based Un­su­per­vised Meth­ods)

5 Aug 2023 17:55 UTC
6 points
2 comments7 min readLW link
(drive.google.com)

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

14 Sep 2023 1:40 UTC
32 points
7 comments8 min readLW link
(far.ai)

A per­sonal ex­pla­na­tion of ELK con­cept and task.

Zeyu Qin6 Oct 2023 3:55 UTC
1 point
0 comments1 min readLW link

At­tribut­ing to in­ter­ac­tions with GCPD and GWPD

jenny11 Oct 2023 15:06 UTC
20 points
0 comments6 min readLW link

Dis­cov­er­ing La­tent Knowl­edge in the Hu­man Brain: Part 1 – Clar­ify­ing the con­cepts of be­lief and knowledge

Joseph Emerson15 Oct 2023 9:02 UTC
5 points
0 comments12 min readLW link

The Greedy Doc­tor Prob­lem… turns out to be rele­vant to the ELK prob­lem?

Jan14 Jan 2022 11:58 UTC
36 points
10 comments14 min readLW link
(universalprior.substack.com)

REPL’s and ELK

scottviteri17 Feb 2022 1:14 UTC
9 points
4 comments1 min readLW link

[ASoT] Some ways ELK could still be solv­able in practice

leogao27 Mar 2022 1:15 UTC
26 points
1 comment2 min readLW link

Vaniver’s ELK Submission

Vaniver28 Mar 2022 21:14 UTC
10 points
0 comments7 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

ELK shaving

Miss Aligned AI1 May 2022 21:05 UTC
6 points
1 comment1 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
58 points
0 comments59 min readLW link

Croe­sus, Cer­berus, and the mag­pies: a gen­tle in­tro­duc­tion to Elic­it­ing La­tent Knowledge

Alexandre Variengien27 May 2022 17:58 UTC
17 points
0 comments16 min readLW link

Elic­it­ing La­tent Knowl­edge (ELK) - Distil­la­tion/​Summary

Marius Hobbhahn8 Jun 2022 13:18 UTC
69 points
2 comments21 min readLW link

ELK Pro­posal—Make the Re­porter care about the Pre­dic­tor’s beliefs

11 Jun 2022 22:53 UTC
8 points
0 comments6 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC
11 points
4 comments18 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC
19 points
8 comments16 min readLW link

Sur­prised by ELK re­port’s coun­terex­am­ple to De­bate, IDA

Evan R. Murphy4 Aug 2022 2:12 UTC
18 points
0 comments5 min readLW link
No comments.